Learn Everything About DBMS and SQL

In your day-to-day life whenever you have a need to save some information, where do you save it? Especially when you may need to refer to it later, maybe something like your expenses for the month, or your todo or shopping list?

Correct! Many of us use software like Excel, Google Sheets, Notion, Notes app etc to keep a track of things that are important to us and we may need to refer to it in future. Everyone, be it humans or organizations, have need to store a lot of data that me useful for them later. For example, let’s think about School. At School, we would want to keep track of all of your’s attendance, assignments solved, codes written, coins, mentor sessions etc! We would also need to store details about instructors, mentors, TAs, batches, etc. And not forget all of your email, phone number, and password. Now, where will we do this?

For now, forget that you know anything about databases. Imagine yourself to be a new programmer who just knows how to write code in a programming language. Where will you store data so that you are able to retrieve it later and process that?

Correct! You will store it in files. You will write code to read data from files and write data to files. And you will write code to process that data. For example, you may create separate CSV (comma-separated values, you will understand as we proceed) files to store information about let’s say students, instructors, and batches. Eg:

students.csv
name, batch, psp, attendance, coins, rank
Naman, 1, 94, 100, 0, 1
Amit, 2, 81, 70, 400, 1
Aditya, 1, 31, 100, 100, 2
name, subjects, average_rating
Rachit, C++, 4.5
Rishabh, Java, 4.8
Aayush, C++, 4.9
id, name, start_date, end_date
1, AUG 22 Intermediate, 2022-08-01, 2023-08-01
2, AUG 22 Beginner, 2022-08-01, 2023-08-01

Now, let’s say you want to find out the average attendance of students in each batch. How will you do that? You will have to write code to read data from students.csv, and batches.csv, and then process it to find out the average attendance of students in each batch. Right? Do you think this will be very cumbersome?


Issues with Files as a Database

1. Inefficient

While the above set of data is very small in size, let’s think of actual school scale. We have 2M+ users in our system. Imagine going through a file with 2M lines, reading each line, processing it to find your relevant information. Even a very simple task like finding the psp of a student named Naman will require you to open the file, read each line, check if the name is Naman, and then return the psp. Time complexity wise, this is O(N) and very slow.

2. Integrity

Is there anyone stopping you from putting a new line in students.csv as Naman, 1, Hello, 100, 0, 1 . If you see that Hello that is unexpected. The PSP can’t be a string. But there is no one to validate and this can lead to very bad situations. This is known as a data integrity issue, where the data is not as expected.

3. Concurrency

Later in the course, you will learn about multi-threading and multi-processing. It is possible for more than 1 people to query the same data at the same time. Similarly, 2 people may update the same data at the same time. On save, whose version should you save? Imagine you give the same Google Doc to 2 people and both make changes on the same line and send it to you. Whose version will you consider to be correct? This is known as a concurrency issue.

4. Security

Earlier we talked about storing the passwords of users. Imagine them being stored on files. Anyone who has access to the file can see the password of all users. Also, anyone who has access to the file can update it as well. There is no authorization at the user level. Eg: a particular person may be only allowed to read, not write. ‘


What’s a Database

A database is nothing but a collection of related data. For example, the School will have a Database that stores information about our students, users, batches, classes, instructors, and everything else. Similarly, Facebook will have a database that stores information about all of it’s users, their posts, comments, likes, etc. The above way of storing data into files was also nothing but a database, though not the easiest one to use and with a lot of issues.

What’s a Database Management System (DBMS)

A DBMS as the name suggests is a software system that allows to efficiently manage a database. A DBMS allows us to create, retrieve, update, and delete data (often also called CRUD operations). It also allows to define rules to ensure data integrity, security, and concurrency. It also provides ways to query the data in the database efficiently. E.g.: find all students with PSP > 50, find all students in batch 1, find all students with rank 1 in their batch, etc. There are many database management systems, each with their tradeoffs. We will talk about the types of databases later.

Types of Databases

When you have to store some data, eg let’s say you are an instructor at School and want to keep track of attendance and PSP of every student of you, in what form will you store that?

Correct! Often one of the easiest and most intuitive ways to store data can be in the form of tables. For example for the mentioned use case, I may create a table with 3 columns: name, attendance, PSP and fill values for each of my students there. This is very intuitive and simple and is also how relational databases work.

Relational Databases

Relational Databases allow you to represent a database as a collection of multiple related tables. Each table has a set of columns and rows. Each row represents a record and each column represents a field. For example, in the above case, I may have a table with 3 columns: name, attendance, PSP and fill values for each of my students there. Let’s learn some properties of relational databases.

  • Relational Databases represent a database as a collection of tables with each table storing information about something. This something can be an entity or a relationship between entities. Example: I may have a table called students to store information about students of my batch (an entity). Similarly, I may have a table called student_batches to store information about which student is in which batch (a relationship between entities).
  • Every row is unique. This means that in a table, no 2 rows can have the same values for all columns. Example: In the student’s table, no 2 students can have the same name, attendance and PSP. There will be something different for example we might also want to store their roll number to distinguish 2 students having the same name.
  • All of the values present in a column hold the same data type. Example: In the student’s table, the name column will have string values, the attendance column will have integer values and the PSP column will have float values. It cannot happen that for some students PSP is a String.
  • Values are atomic. What does atomic mean? What does the word atom mean to you?

Correct. Similarly, atomic means indivisible. So, in a relational database, every value in a column is indivisible. Example: If we have to store multiple phone numbers for a student, we cannot store them in a single column as a list. How to store those, we will learn in the end of the course when we do Schema Design. Having said that, there are some SQL databases that allow you to store a list of values in a column. However, that is not a part of the SQL standard and is not supported by all databases. Even those that support, aren’t most optimal with queries on such columns.

  • The column sequence is not guaranteed. This is very important. SQL standard doesn’t guarantee that the columns will be stored in the same sequence as you define them. So, if you have a table with 3 columns: name, attendance, and PSP, it is not guaranteed that the data will be stored in the same sequence. So it is recommended to not rely on the sequence of columns and always use column names while writing queries. While MySQL guarantees that the order of columns shall be the same as defined at the time of creating a table, it is not a part of SQL standard and hence not guaranteed by all databases and relying on order can cause issues in the future when a new column is added in between.
  • The row sequence is not guaranteed. Similar to columns, SQL doesn’t guarantee the order in which rows shall be returned after any query. So, if you want to get rows in a particular order, you should always use the ORDER BY clause in your query which we will learn about in the next class. So when you write an SQL query, don’t assume that the first row will always be the same. The order of rows may change across multiple runs of the same query. Having said that, MySQL does return rows in order of their primary key (we will learn about this later today), but again, don’t rely on that as not guaranteed by SQL standards.
  • The name of every column is unique. This means that in a table, no 2 columns can have the same name. Example: In the student’s table, I cannot have 2 columns with the name. This is because if I have to write a query to get the name of a student, I will have to write SELECT name FROM students. Now if there are 2 columns with name names, how will the database know which one to return? Hence, the name of every column is unique.

Non-Relational Databases

Non-relational databases are those databases that don’t follow the relational model. They don’t store data in the form of tables. Instead, they store data in the form of documents, key-value pairs, graphs, etc.


Keys in Relational Databases

let’s say you are working at School and are maintaining a table of every student’s details. Someone tells you to update the PSP of Naman to 100. How will you do that? What can go wrong?

What if there are 2 Namans?

Correct. If there are 2 Namans, how will you know which one to update? This is where keys come into the picture. Keys are used to uniquely identify a row in a table. There are 2 important types of keys: Primary Key and Foreign Key. There are also other types of keys like Super Key, Candidate Key etc. Let’s learn about them one by one.

Super Keys

To understand this, let’s take an example of a student’s table at school with the following columns.

Which are the columns that can be used to uniquely identify a row in this table?

Let’s start with the name. How many of you think a name can be used to uniquely identify a row in this table?

Correct. A name is not a good idea to recognize a row. Why? Because there can be multiple students with the same name. So, if we have to update the PSP of a student, we cannot use a name to uniquely identify the student. Email and phone numbers on the other hand are a great idea, assuming no 2 students have the same email or phone number.

Do you think the value of the combination of columns (name, email) can uniquely identify a student? Do you think there will be only 1 student with a particular combination of name and email? Eg: will there be only 1 student (Naman, naman@school.com)?

Correct, similarly do you think (name, phone number) can uniquely identify a student? What about (name, email, phone number)? What about (name, email, PSP)? What about (email, PSP)?

The answer to each of the above is Yes. Each of these can be considered a Super Key. A super key is a combination of columns whose values can uniquely identify a row in a table. What do you think are other such super keys in the student’s table?

In the above keys, did you ever feel something like “but this column was useless to uniquely identify a row..”? Let’s take an example of (name, email, PSP). Do you think PSP is required to uniquely identify a row? Similarly, do you think a name is required as you anyways have an email right?

Now let’s remove the columns that weren’t necessary.

Also, let’s say we were an offline school, and students didn’t have email or phone numbers. In that case, what do you think schools use to uniquely identify a student? Eg: If we remove redundant columns from (name, email, PSP), we will be left with (email). Similarly, if we remove redundant columns from (name, email, phone number), we will be left with (phone number) or (email). These are known as candidate keys. A candidate key is a super key from which no column can be removed and still has the property of uniquely identifying a row. If any more column is removed from a candidate key, it will no longer be able to uniquely identify a row. Let’s take another example. Consider a table the school has for storing students’ attendance for every class.

| student_id | class_id | attendance |

What do you think are the candidate keys for this table? Do you think (student_id) is a candidate key? Will there be only 1 row with a particular student_id?

Is (class_id) a candidate key? Will there be only 1 row with a particular class_id?

Is (student_id, class_id) a candidate key? Will there be only 1 row with a particular combination of student_id and class_id?

Yes! (student_id, class_id) is a candidate key. If we remove any of the columns of this, the remaining part is not a candidate key. Eg: If we remove student_id, we will be left with (class_id). But there can be multiple rows with the same class_id. Similarly, if we remove class_id, we will be left with (student_id). But there can be multiple rows with the same student_id. Hence, (student_id, class_id) is a candidate key.

Is (student_id, class_id, attendance) a candidate key? Will there be only 1 row with a particular combination of student_id, class_id and attendance?

But can we remove any column from this and still have a candidate key? Eg: If we remove attendance, we will be left with (student_id, class_id). This is a candidate key. Hence, (student_id, class_id, attendance) is not a candidate key.

Primary Key

We just learnt about super keys and candidate keys. Can 1 table have multiple candidate keys? Yes. The table earlier had both (email), and (phone number) as candidate keys. A key in MySQL plays a very important role. For example, MySQL orders the data in disk by the key. Similarly, by default, it returns answers to queries ordered by key. Thus, it is important that there is only 1 key. And that is called the primary key. A primary key is a candidate key that is chosen to be the key for the table. In the student’s table, we can choose (email) or (phone number) as the primary key. Let’s choose (email) as the primary key.

Sometimes, we may have to or want to create a new column to be the primary key. Eg: If we have a student’s table with columns (name, email, phone number), we may have to create a new column called roll number or studentId to be the primary key. This may be because let’s say a user can change their email or phone number. Something that is used to uniquely identify a row should ideally never change. Hence, we create a new column called roll number or studentid to be the primary key.

Foreign Keys

Let’s say we have a table called batches which stores information about batches at school. It has columns (id, name, startDate, endDate). We would want to know for every student, which batch they belong to. How can we do that?

Correct, We can add batchId column in student’s table. But how do we know which batch a student belongs to? How do we ensure that the batchId we are storing in the student’s table is a valid batchId? What if someone puts the value in batchID column as 4 but there is no batch with id 4 in the batches table. We can set such kind of constraints using foreign keys. A foreign key is a column in a table that references a column in another table. It has nothing to do with primary, candidate, or super keys. It can be any column in 1 table that refers to any column in other tables. In our case, batchId is a foreign key in the students table that references the id column in the batches table. This ensures that the batchId we are storing in the student’s table is a valid batchId. If we try to insert any value in the batchID column of the students’ table that isn’t present in the id column of the batches table, it will fail. Another example:

Let’s say we have a year table as: | id | year | number_of_days |

and we have a table of students as: | id | name | year |

Is the year column in the student’s table a foreign key?

The correct answer is yes. It is a foreign key that references the id column in the year’s table. Again, a foreign key has nothing to do with a primary key, candidate key etc. It is just any column on one side that references another column on the other side. Often it doesn’t make sense to have that and you just keep the primary key of the other table as the foreign key. If not a primary key, it should be a column with a unique constraint. Else, there will be ambiguities.

Okay, now let’s think of what can go wrong with foreign keys.

Correct, let’s say we have students and batches tables as follows:

Now let’s say we delete the row with batch_id 2 from batches table. What will happen? Yes, the students Jim and Jack will be orphaned. They will be at the students table but there will be no batch with id 2. This is called orphaning. This is one of the problems with foreign keys. Another problem is that if we update the batch_id of a batch in batches table, it will not be updated in students table. Eg: If we update the batch_id of Batch A from 1 to 4, the students John and Jane will still have batch_id as 1. This is called inconsistency.

To fix these, MySQL allows you to set ON DELETE constraints so that cascading deletes happen when such a delete happens.

6 Comments

    • Hey Arlette,
      Thank you for commenting. We will surely make one specific article for SEO. Meanwhile, you can subscribe to TechAlgoSpotlight for the Latest Article Notifications.

Leave a Reply

Your email address will not be published. Required fields are marked *