CodingwithMax
Here to help YOU become a data scientist who gets hired⚡️, makes more 💰 & gains freedom 🏝 Want to become a coder or data scientist? I've got you!
🤷 “But what exactly is a relational database?”
You’re likely familiar with the term database, but sometimes this term may be preceded with another description like “relational” or “NoSQL”. What exactly do these terms mean?
Here we’ll just focus on the relational database, since that is a specific type of database, whereas NoSQL basically means all databases that aren’t relational databases. 🤔
Sometimes you may hear things like “relational databases are databases that follow the relational model”. Well obviously… but that definition doesn’t really help with anything.
So, let’s go at it one step at a time.
A core component of relational databases is that they’re organized into a table-like structure. Each of these table-like structures can also be referred to as a relation.
The intuition you already have from creating tables in programs like Excel, Google Sheets, or generally tables within documents, etc. all carry over nicely to how to image the tables in a relational database. 📋
Just like normal tables, our tables in a relation database contain also, these are also called attributes, and they specify what type of data is held within it.
It does this in two ways:
1️⃣ The column’s name, for example customer_id.
2️⃣ Each column actually has a data type associated with it. For example, the customer_id column may be an integer column, the price column could be float or double (which are just fancy words for decimal), and the product_name column could be a text column.
This is great because we know, and the database also knows, that everything inside the price column is a decimal. That way if we ask the database to calculate a sum over all prices for example, we known, and it knows, that it will only encounter decimals and doesn’t have to deal with any funny business.
❗ There is one exception to this, namely if we say we allow our columns to contain null values, i.e. if we allow the columns to contain missing data. But even in these cases computing a sum is possible because we can just skip the missing values.
Then we have the other component, which are the rows, also called records, but you’ll may also hear them referred to as entries.
A row is a collection of values, one for each each column in that table, and all of these values correspond to the same event, entry, record, or whatever other name you want to refer to it as.
For example, if you sell a T-Shirt👕, and you have a table that has the columns:
- name
- item
- quantity
- location
Then the row associated with this sale would have a value for each of these columns, e.g.
- T-Shirt
- 39.99
- 1
- New York
You can also see a similar visual depiction of this on the graphic of this post. There are some slight discrepancies which will we get into next.
So far, this is a pretty standard description of how tables are usually organised (despite maybe each column have a specific data type).
However, the tables in a relational database also have some other additional properties too. To make it easier for intuition, we left them out in the above example, but we will get into them now!
The first is that each table needs to have something called a “primary key”. The primary key is supposed to act as a unique identifier for that row, and duplicate values of a primary key within a table are strictly forbidden.
A primary key is usually implemented either as a continuously increasing number ⬆️, which therefore just indicates the row number.
Or, another common implementation, is a unique identified. For example a unique identifier is created for each website event, store purchase, or whatever else you’re recording in a table.
In addition, we can also add further unique requirements to a table. For example, we can say something like “Only 1 record is allow to exist for a specific customer, store, item, and time combination”. ⚠️
This way we can protect against potential duplicate records being recorded in our database, and it allows us to add further restrictions on the structure of our data in the database based on our conceptual understanding of the data and where it comes from.
Another cool feature of relational databases is that within one table we can reference the row of another table by using that row’s primary key.
For example, image we have 2 tables
One table (called event_log) which tracks individual website events, and has the following columns:
- id
- user_id
- timestamp
- event_type_id
- store_id
- location
And a second table (called event_info) that describes the different types of events, and has the columns
- id
- name
As a quick note, every column with the word “id” in it will only contain integer numbers.
We won’t dive to deep into the reason here, but this is a very common thing to do because it takes up much less space to store e.g. the integer 15, than the word “Pop-up Close”.
Coming back to our tables, our event_info table has two columns: The id column, which is the primary key and the name column, which is the actual name of the event (for example “Login”).
This table will hold all the different events that are possible, and will be much smaller, since the number of events that can happen on our site is ultimately limited.
Every website event will be logged in our event_log table, with the primary key of the event_log table being the id column. This table will likely be large as many events can happen on the site, and the same user can perform the same event multiple times over (e.g. logging in or refreshing and loading a page).
One column that’s particularly interesting here is the “event_type_id” column. This column holds the id of the type of each event, and each number here directly corresponds to the appropriate event in the event_info table. ↔️
In other words, the event_type_id value directly references the id column of the event_info table, which is also the event_info table’s primary key. In this case we would called the event_type_id a “foreign key”, since it holds the primary key of another table.
Like I mentioned above, it’s common to use ids instead of the actual descriptive names because you end up saving a lot of space. 💾
It’s also possible for a table to have more than 1 column that holds a foreign key, and using foreign keys is what allows us to easily make links across different tables. We can then use these links to optionally enrich the information of the table we’re looking at.
For example, for the event_log table, we have another column called store_id. It’s not unlikely that this columns holds a foreign key which references a specific row in a “store” table.
The store table itself may have columns with information about
- Store size
- Store location
- Opening hours
- etc.
And thus, if we want to, we can easily bring in and consider additional information from the stores that we may want to use for filtering or in an analysis.
There are some other cool additional features that relational databases have to make querying and storage more efficient, such as indexing and partitioning, but that’s a topic for another post 😊
As you can hopefully see now, relational databases basically expand on the intuition we already have about tables by adding additional features which make it easy for us to create links between different tables. It also helps us make sure our data is intact by letting us defining specific rules to avoid things like duplicate entires.
“Why should I learn SQL if I already know Excel?”
When you think of reports and tables you may directly associate that with Excel, and that’s fair. We’ve learned in school and in college about how to use Excel to perform analysis and create reports, and you may even be using it at work for that same purpose. 📊
But why is it that newer technology companies prefer their analysts using SQL over Excel for data processing and analysis? 💻
Well, if you’ve ever used Excel for more than just a small project with a couple of data points, you’ve probably experienced some of the following frustrations:
⏱️ You scroll and it takes a bit for the next rows to load.
❄️ You add in a new row, and the application freezes as it updates calculated values (such as a sum or average over an entire column).
😨 You accidentally open the wrong spreadsheet, and now you have to wait for the application to unfreeze as it loads and renders all of the data from the large spreadsheet you accidentally opened.
Now don’t get me wrong, Excel has many uses, especially when it comes to things like easily formatting overview reports with numbers, graphs, and annotations.
But… Excel hits a wall when the amount of data grows large. We start experiencing this when our number of rows go into the thousands, or tens of thousands, and our number of columns go into the tens and hundreds. 📈
In those cases Excel spends a significant amount of time in just continuously rendering all of the data so that you can actually see it on the screen, and every time a new data point is added it needs to update the calculated cells to incorporate the new values.
With the amount of data now being collected everywhere, and with sites reaching monthly active user counts in the hundreds of millions and higher, Excel is obviously not a suitable approach to handling data in these types of situations.
And that’s because Excel is not meant to be used like a database, yet people still use it that way. Likely because with databases it can feel like the data is locked away from you if you’re not a “programmer”, and if you want to gain access to it you need to use code and get to the raw source and retrieve it. ⛏️
But that’s where SQL comes in. SQL is a user-friendly language that lets you query a relational database with ease, so that you can retrieve data, process it, and use it to do analysis and create reports from it.
And the cool thing is that the tables in a relational database are conceptually very similar to how you already understand tables in spreadsheets, so your intuition carries over.
But relational databases also come with some extra cool bonuses like:
✅ We don’t have any loading times waiting for the graphical portion to update and show us the newest row of raw data (which we likely won’t look at anyway)
✅ We can assign multiple machines to manage a database, meaning we can scale much better to larger amounts of data
✅ The database itself can optimise how it stores and keeps track of data, and we don’t have to worry about it
SQL is then the language that lets us tell the database what data we want (and which data we don’t want), as well as how we want it processed (if at all). It also lets us perform aggregations, like the number of daily active users our site had per day over the last week, so we can use the results directly in a report. 🎉
As companies continue to grow and evolve, and everyone continues to track more and more data, Analysts start turning to SQL, rather than Excel, to help them perform the required tasks.
In these cases SQL becomes the better, more efficient, and more scalable alternative, that also seamlessly lets you access the data in the relational databases and data warehouses the companies store their data.
“Wow… SQL is popping up everywhere, but why is SQL so widely used now?”
It’s likely you’ve seen or heard of SQL being mentioned somewhere, maybe you even thought about picking it up yourself, but you may still be wondering, why now? Why is SQL becoming popular now? Why am I hearing about it so much now? 🤔
Although one obvious answer would be that the reason you’re seeing more SQL content is that you’ve recently gotten into data stuff and our beloved recommendation engines have just started finding more tailored content to show you, but let’s ignore that reason.
Interestingly, SQL has been around since the 1970’s, and so it’s not actually a “new” thing. In fact, there have already been several hype waves surrounding SQL, but this “wave” is a bit different.
But why is it different? Many reason can all be traced back to: we’re in the era of Big Data. 🗄️
The amount of data being produced, collected, and stored by everyone is exploding massively 📈. It’s become standard to have a database that is actually made up of many separate machines, just because one machine couldn’t possible handle storing and retrieving all this data.
In addition, even if you could run everything on one machine, you’d still face problems with internet connection speed limitations (think of how long it takes to download a 1Gb file, now image databases that store Tera-, Peta-, or even Exabytes of data). 📡
Not only that, but It’s likely that when you want data from a database you don’t actually want all the data. In fact, it’s likely you only want a much smaller portion of the data (maybe only the last week’s, last month’s, or last year’s).
And for that smaller portion, it’s likely you want this data processed somehow. I.e. junk data removed, prices converted to a standard currency, numerical ids replaced with their actual terms, and possibly also calculating aggregates (like total number of items sold, total number of customers, or average order value).
It’s much more efficient to do all of this filtering, processing, and aggregation before the data is sent out of the database.
Why? Because otherwise you’ll spend a lot of time waiting for data to be sent over the internet ⏳, only to do all of these steps later somewhere else and throw away a lot of the data you didn’t actually need.
In fact, databases are actually optimised for quick and efficient filtering, processing, and aggregation 🤓. That’s because they keep track of what data is stored where, so when you do filtering, it’s likely the database doesn’t even need to read every bit of data to see if it fits within the filter conditions or not, because it already knows what is stored where.
The era of Big Data has actually brought with it development of many new technologies, and SQL has also been extended to have applications there too.
For example, rather than having a standard relational database used to quickly store and retrieve data to use in the application, you now also have separate databases, called data warehouses, built and optimised specifically for analysts and scientists running complex queries over massive amount of data.
The need for data warehouses and the general adoption of relational databases has brought with it significant technological advances that make SQL even more efficient, which then further extends its reach and applications, which then feeds back to more work being put into developing and optimising the technologies around it. 🔁
In fact, SQL has become so popular and useful that it has even been extended away from only being used in databases, to also forming a core component of many other big data tools that have been developed.👩🏽💻
🤔 "Uhhhh, I’m afraid to ask at this point… but what is SQL anyway?"
You’ve likely heard of SQL by now, either through reading some (blog) article, hearing it mentioned in a YouTube video, or even seeing it listed on job descriptions.
If you’re already super familiar with SQL and are using it on a day-to-day basis, then what you’re about to read probably won’t be mind blowing for you.
But if you’re just getting started in the world of data or analytics, and want to get your bearings straight, then this next bit will hopefully help you get clarity on what SQL is all about. 🤓
You probably already know generally that SQL can be used to access, or query if you like, databases. But what exactly does that mean, and why do you need SQL for it? 🧐
Now before we go deeper into anything else I want to make sure to add one clarifying remark, SQL cannot be used in all database (in fact, there’s a whole group of databases specifically called NoSQL databases).
However, we won’t dive any deeper down that rabbit hole, but instead first shortly focus on the databases where SQL does apply, namely relation databases, and then learning a bit more about SQL and what it can do. 🔎
Relational databases are basically databases that are organised into one, or more commonly multiple tables, each made up of rows and columns. Each row is a separate entry or entity, and each column holds a specific piece of information about each entry.
Therefore we know that each row in the table contains the same information as the other rows in it, and we also know what type of information is held in each column, and that this type of information is consistent across rows.
For example, if a column holds a date, then the values we will encounter here will all be dates, if it holds a price, then the values will be prices, etc.
The tables can be of different natures, for example, some tables may contain additional descriptive information, like
- what length and colour options a specific pair of pants have,
- what brand it belongs to
- what year to was designed in
- etc.
These types of tables are called dimension tables.
Others may contain information about things that have occurred, for example, what item was sold, at what store, at what time and date, and what price it was sold for. This type of tables is called a fact tables.
If we want to know more information about the item itself, we could look it up in the dimension table we mentioned previously, where we can e.g. find the associated brand for each item.
(👇 You can see a visual representation of an example of these tables on the accompanying graphic to this post.)
Each table, therefore, has a specific structure to it, and the tables can also be linked to each other so that we can go between them to get more detailed information.
And that particular structure of tables that have pre-defined, consistent, columns, and defined associates between tables, is also one of the reasons why SQL can be as useful and powerful as it is.
So now let’s actually get to it, what is SQL anyway?
SQL (short for structured query language) is a language that allows us write queries against these structured tables.
Ok… but what does that actually look like?
It could look like us saying something like:
“Please give me all the sales that happened between Christmas and New Years and order them in chronological order, oh, and also I am only interest in products that are electronic.”
Now obviously we wouldn’t actually write it like this, instead, in SQL, this would look something more like:
SELECT *
FROM purchased_items
WHERE “date” >= ’2020-12-24’ AND “date” = ’2020-12-24’ AND “date”
Once and for all – what are correlation and causation? How do you differentiate between correlation vs causation?
>> https://codingwithmax.com/correlation-vs-causation-examples/
In this blog post, we discuss what correlations and causations are, some properties and types of correlations plus what noise is, and of course, you’ll find some examples to guide you along the way!
Correlation vs Causation: What's the Difference (+ Examples!) Once and for all - what are correlation and causation? How do you differentiate between correlation vs causation? Here are a few examples to help you out!
So psyched about this collaboration with Freecodecamp! 👨💻
If you've heard the term 'data science' thrown around but you're not *quite* sure what the term entails - check out this short 90 min mini-course that'll get you up to speed.👇
Learn the basics of Data Science Learn the basic components of Data Science in this beginner's course from Coding With Max. This course covers: Statistics: learn about the types of data you'll encounter, types of averages, variance, standard deviation, correlation, and more. Data visualization: learn about why we need to visualize....
Check out my newest blog post about the importance of story telling:
https://www.codingwithmax.com/blog/6-steps-to-storytelling-data-scientist
6 Steps to Storytelling Your Data Like a Senior Data Scientist Being able to analyze data properly has always been important, even before we went into the digital and big data era. Data analysis is a very important skillset for scientists, because models are built on the results that we see in experiments, and if we are able to properly analyze our experimental
We've got a new blog post up on the blog on exactly how to storytell like an advanced data scientist.
Because as we all know - data analysis is only half the story. The other half is being able to explain your findings and convince others of your conclusions.
Check out the blog post to find out how to master this skill!
https://www.codingwithmax.com/blog/6-steps-to-storytelling-data-scientist
6 Steps to Storytelling Your Data Like a Senior Data Scientist Being able to analyze data properly has always been important, even before we went into the digital and big data era. Data analysis is a very important skillset for scientists, because models are built on the results that we see in experiments, and if we are able to properly analyze our experimental
Does your code sometimes take forever to run? Is your code hung-up on doing the same type of data manipulation for lots and lots of data?
This blog post can help you speed up your code, especially reducing computation time for bulk data manipulation:
https://www.codingwithmax.com/blog/shortening-code-making-it-run-fast
The Secret to Shortening Your Code and Making it Run 150x Faster Today, I want to talk about something that honestly blew my programming mind. I was taking a course on Neural Networks and Deep Learning - the one by Andrew Ng, former head of Baidu AI Group and Google Brain. And I stumbled upon a method I’d never encountered before to shortening your code and mak...
Wondering which programming language to learn?
Wonder no more - Python is the answer to any programming problem. Check out these 9 reasons as to why Python is the only programming language you'll ever need!
https://www.codingwithmax.com/blog/why-python-is-best-programming-language
Curious about the new growing field of data science but not sure how to get started with it? We've got you covered!
Check out our new blog post: the Ultimate Step-by-Step Guide on Getting Started with Data Science!
https://www.codingwithmax.com/blog/step-by-step-guide-getting-started-data-science
Ultimate Step-by-Step Guide on Getting Started with Data Science Data science is becoming a very hotter and hotter topic by the minute, and data scientists are becoming more and more demanded by all sorts of companies. I personally like to think of data scientists as the watermelon of the fruit aisle in the summer. Everyone wants one - but there is a limit