An introduction to SQL for data science
In this post you'll learn about the SQL language and how to SELECT data, filter it with WHERE clauses, create new features with CASE WHEN and aggregate data with GROUP BY.

Contents
An introduction to SQL and the star wars data set
SQL, which stands for Structured Query Language was developed in the early 1970s at IBM. It's a declarative, domain-specific language designed for managing data held in a relational database management system (RDBMS). A relational database is a collection of tables (data sets) that are structured i.e. the data fits nicely into rows and columns that can be linked to one another i.e. they are relational.
-
Tables: Data is stored in tables, which are similar to spreadsheets. For example, we might have a Sales table that record ever transaction and a Store table that contains information about each of the stores in which the sales took place.
-
Rows (Records): Each row in a table represents a single, unique entry. In the Stores table, each row would represent one store.
-
Columns (Attributes): Each column defines a specific type of information for every row. In the Sales table, columns might include Store_ID, Date, Order_ID, Sales_Amount.
-
Keys: Special columns are used to uniquely identify rows and link tables. A primary key (like Order_ID) is a unique identifier for each row in the Sales table. A foreign key is a primary key from one table that is included in another table to establish a link. For instance, the Store_ID would be included in the Sales table so we can link each order back to the store it was made in.
SQL was designed to solve the challenges of storing, retrieving, and manipulating data efficiently. Its primary use cases were, and still are, centred around four key operations: Data Retrieval (Querying), Data Manipulation, Data Definition and Data Control. In today's tutorial we'll be focussing on data querying which is the most common use of SQL. SQL allows us to extract specific information from large datasets quickly and without needing to write (hopefully) complex code. In the next section we'll see how the SELECT statement allows us to easily specify exactly what data we want from which tables and under what conditions.
It was mentioned at the start how SQL is a 'declarative language' and it's actually one of its most powerful features. This means we describe what we want from the data, not how the database should get it. The query optimiser handles the "how" for us behind the scenes. If we imagine our database is like a library, the declarative nature of SQL working in tandem with the optimiser is a bit like asking a librarian for a book. We simply tell them we're looking for a copy of The Hobbit by J. R. R. Tolkien (declarative) and they'll go off an efficiently find it for us. We don't need to give them a set of instructions (imperative) like, "Walk to the fiction section, go to aisle 5, find the 'T' shelf, and locate the book" . The librarian, like the database's query optimizer, already knows the most efficient way to fulfil our request.
For this tutorial we'll be using the star wars data set which contains information on 87 different characters from the star wars universe. It usually comes loaded as part of the dplyr package in R but we can download it from the dplyr repo on github. I've loaded it into a table called 'misc.dbo.starwars_data'. Let's go ahead and have a look at our data. We tell SQL which table we want to use with a FROM table_name and finish our query with a ;. You might have noticed why SQL is sometimes called a shouty language as it uses a lot of caps. SQL is actually very flexible about how we can write our code which has pros and cons. For this tutorial we'll be following the general style guide of upper case keyword (SELECT, FROM, etc) and lower case table and variable names.
As you can see we've got our key words (helpfully highlighted orange too) in upper case, each of the variables we're using is on its own line with an indent. Our table name is also in lower case to match the variables. These are handy formatting tips for the following reasons:
-
Keywords in uppercase → improves readability.
-
Columns indented and on separate lines → easy to add/remove columns without missing commas.
-
Consistent indentation → works well when joins and conditions are added later.
-
Lowercase identifiers → avoids confusion with SQL keywords.
Using SELECT to select our data
The SELECT statement is how we retrieve data in SQL. It allows us to tell the database exactly which pieces of information we want to see. For example, if we only wanted to know the 'name' and 'height' of our star wars characters, we'd run the following query:
We are asking the database to return two columns— name, which contains the names of Star Wars characters, and height, which contains their heights in centimetres. This means that, regardless of how many columns exist in the underlying table, only these two will be included in the results. Notice though that we get all the rows in our data set returned. It also looks like it's stored in alphabetical order of character name. We can see that some characters have missing or empty values for their height which SQL records as NULL. The NULL in SQL is not a value in its own right but rather a special marker that indicates that the information simply does not exist in our database.
The FROM clause specifies where the database should look for the data. In this case, 'misc.dbo.starwars_data' refers to a table named 'starwars_data', stored within the 'dbo' schema, which in turn is part of the 'misc' database. Although we write SELECT before FROM in the query, the database processes them in the opposite order. It first locates the table specified in FROM, then builds a working dataset from it and finally selects the requested columns.
Understanding this order of operations is important for building more complex queries. In a simplified form, SQL processes queries by first reading from the source (FROM), then filtering rows (WHERE), grouping them (GROUP BY), filtering the groups (HAVING), selecting the columns or expressions to return (SELECT), and finally ordering the results (ORDER BY). We'll run through each of these different keywords in turn.
One useful feature of SELECT is that it is flexible in both content and order. The order in which we list the columns is the order in which they will appear in the output. For example, let's ask for the 'height' column first this time.
If we wanted to select every column and every row in our data, rather than write them all out, we can use a * to tell SQL that we want to return everything.
We'll see later how we can use more complicated expressions and functions in our SELECT list. This means that as our analysis becomes more advanced, we can start performing calculations, reformatting values, and creating entirely new derived columns directly in the query results.
Sometimes column names in a database can be a bit cryptic, too long or not descriptive enough for our purposes. SQL allows us to assign a temporary, more readable name to a column in our result set. This is called 'aliasing', and it's done using the `AS` keyword.
We can see that with our new aliasing:
-
name will appear as character_name, making it obvious that the value refers to the name of a Star Wars character.
-
height will appear as height_cm, making the unit of measurement explicit and reducing ambiguity.
-
mass will appear as weight_kg, again clarifying what the value represents and what units it is measured in.
We are also applying an alias to the table itself. The FROM misc.dbo.starwars_data AS sw statement gives the table a shorter reference name (sw) for use within the query. This becomes particularly useful when queries grow more complex — for example, when we join multiple tables or need to reference the same table multiple times. Using a short alias like sw saves typing, keeps our code clean and makes it easier to follow which table a column belongs to.
Aliases are especially valuable in collaborative environments where the output of a query may be shared with others who are not familiar with the original database schema. By taking a few extra moments to label our columns clearly, we make our results more intuitive, easier to interpret, and less prone to misinterpretation.
Filtering data with WHERE clauses
So far with our SELECT, we've been telling SQL which columns we want to return and then bringing back all of the rows for them. The WHERE clause filter rows from a table, allowing us to retrieve only the data that meets specific criteria. For example, let's reuse our original query to select 'name' and 'height' but this time we'll only return rows for character that are exactly 200cm tall. We do this by using a WHERE after our FROM and specifying our condition i.e. height = 200.
We can see this time we've only got 1 row returned along with our selected columns. In our query we're asking for characters who are exactly 200 centimetres tall. The = operator checks for equality, so only rows where the value in the height column is precisely 200 will be included. If instead we wanted instead all of the character that are 200cm or taller we can add a '>=' i.e. greater than or equal to 200
Now we have more characters, all of whom are at least 200cm tall. Below is a list of all the common logical operators available in SQL for filtering with. We'll run through most of these but feel free to experiment with different ones yourself.
So far we've been applying our WHERE filter onto the same columns that are in our SELECT statement so we can easily see the relationship between how they work i.e. use WHERE to filter for height >=200 and then SELECT height to confirm that all the rows are >=200.
Remember though that the order in which SQL applies keywords: read the source (FROM), filter rows (WHERE), group them (GROUP BY), filter the groups (HAVING), selecting the columns (SELECT) and order the results (ORDER BY) means WHERE filters as SQL reads the table from the database. This means we can actually apply it as a filter on the table even if we don't subsequently SELECT the column we filtered on. For example, let's apply the same WHERE height >=200 but this time ask for 'homeworld' instead of the 'height' column.
We we can see that we've got the same number of rows and the same characters i.e. the ones with height >=200cm but this time we've not actually returned the height column at all.
We can also specify a range of values to filter by using the BETWEEN keyword. For example, let's say we wanted all of the characters whose height is between 150 and 200 cm inclusive.
Both 150 and 200 are part of the result, making BETWEEN especially convenient when we want to capture values across a whole band. So far all of our filtering has been on numerical columns but we can also filter on text columns too. For example, we can filter on the character name to bring back all the data for just one character.
Here our filter performs an exact match on the text string. Depending on which database you're using, case sensitivity can sometimes be a factor when working with strings. For example, some databases would be happy to accept 'chewbacca' as a match for the string 'Chewbacca' whereas others will not due to the difference in case for the first letter.
Most databases have built in functions that will change the case of the string for you. If you wanted to make the search case-insensitive, it's often best practice to use one of these to remove any potential issues that could be causes by case mismatches. For example, we can tell SQL to make all the letters lowercase in our string and then simply search for 'chewbacca' which would remove any worries around case sensitivities in our filter.
We saw how we could filter for a range of numerical values with the BETWEEN keyword. For text we have something similar which is IN. This is where we provide a list of strings (wrapped in brackets) that we want to filter on and tell SQL to return anything IN that list. This is effectively a shorthand for multiple OR conditions (which we'll cover later) and is very efficient when we have a list of distinct values to match against. To find characters whose eye colour is one of several specific options, such as 'brown', 'hazel', or 'black' we can use the IN operator like so: WHERE eye_color IN ('brown', 'hazel', 'black').
For more flexible text matching, where we know a part of the text but not the whole thing, we use the LIKE operator with wildcard characters. For example, WHERE name LIKE '%Skywalker%' will find any character whose name contains 'Skywalker', regardless of what comes before or after it. The percentage sign (%) acts as a wildcard, representing any sequence of zero or more characters. This is incredibly useful for searching within text fields where exact matches aren't feasible.
Working with real-world datasets often means encountering missing information, represented by NULL. In the case of SQL, NULL means "unknown" or "missing" rather than zero, an empty string, or false. This distinction is important because any direct comparison with NULL (e.g., birth_year = NULL) will always return UNKNOWN, which is then treated as false. Instead, we must use IS NULL or IS NOT NULL to correctly identify rows where data is missing or present.
Sometimes, when reading in CSVs or moving data between systems, the NULL values can end up being converted into strings i.e. the special data type of a NULL value gets converted into a string "NULL" instead. To make sure the code in the next section runs ok, the below block of code updates our base table to explicitly specify the format for each of our columns and how NULLs should be recorded. Depending on how your data got read in you might have to play around with it to get it to run on your system sorry.
To find characters for whom we do have a populated birth_year, we use WHERE birth_year IS NOT NULL. This ensures our output and any future analysis would only include complete data points.
To find characters with unknown birth years (missing data), we use WHERE birth_year IS NULL. This identifies all rows where the birth_year column has no value.
Often, our filtering needs are more complex, requiring us to combine several conditions. We do this using logical operators: AND and OR. When we need all specified conditions to be true, we use AND. For example, WHERE height > 180 AND mass > 80 AND species = 'Human' will only return rows for characters who are both taller than 180 cm and heavier than 80 kg and are of the 'Human' species. All criteria must be met for a row to be included.
Conversely, OR allows us to return rows where at least one of the conditions is true. In the query WHERE species = 'Droid' OR birth_year > 50, we'll get characters who are either a 'Droid' (regardless of their birth year) or who have a birth_year greater than 50 (regardless of their species). It broadens our result set to include rows meeting any of the stated conditions.
We can see in our output that we do have a collection of droids and characters with a birth year > 50. For example, BB8 has a missing birth year so would fail the birth_year>50 condition. However he gets included because he passes the species = 'Droid' filter.
When working with AND and especially OR it is good practice (and usually required) to use brackets () to tell SQL explicitly how you want the evaluation to occur. A critical point when combining AND and OR is the order of operations. SQL processes AND conditions before OR conditions. This can lead to unexpected results if we're not careful.
For example, let's say we're after droids and humans that are all over 200cm tall. We might try writing this as WHERE species = 'Human' OR species = 'Droid' AND height > 200. However, without brackets, the database will interpret this as species = 'Human' OR (species = 'Droid' AND height > 200). This means it will return all humans, irrespective of their height, plus droids taller than 200 cm. This is not what we intended as we can see from the output below.
Whereas what we wanted was tall humans or tall droids what we've ended up with is all humans along with our tall droids. To ensure the logic is clear and behaves as expected, we use brackets to define our desired order of operations. The query WHERE (species = 'Human' OR species = 'Droid') AND height > 200; clearly instructs the database to first find all characters who are either 'Human' or 'Droid' and then from that filtered group, select only those who are taller than 200 cm. This ensures we get only tall humans and tall droids, reflecting our intended logic. Using brackets diligently prevents logical errors and makes our queries far more readable and predictable.
We can see that by adding brackets we get the result we expected i.e. all the humans/droids that are also over 200cm tall which as it turns out, only Darth Vader meets the specified criteria.
So far we've been selecting and filtering our data. Remember at the start how we saw that our data was in alphabetical order based on the character's name? In the next section we'll see how we can change the sort order with ORDER BY.
Changing the order of our data with ORDER BY
By default, SQL does not guarantee any particular ordering of rows unless we explicitly specify it. This means that if we want our results to be interpretable, comparable across runs or useful for reporting, we need to control how they are sorted. Ordering is especially important when we are looking for the top or bottom values in a dataset, such as the tallest characters, the heaviest droids or the oldest humans. It also ensures consistency when results are shared or integrated into dashboards. The ORDER BY allows us to explicitly tell SQL the order in which rows are returned in our results. Let's try sorting our data by height and seeing what happens.
These results might look a little strange at first. Since no direction is specified i.e. whether we want to sort ascending (smallest to largest) or descending (largest to smallest), the sort is in ascending order by default. We can see that for characters with non-NULL heights that the shortest is at the top and then the order is in increasing heights. However our first few rows are all NULL heights. It is a common convention, although it varies by database provider, that NULL values are sorted either at the very beginning or the very end of the result set. This is why it can be helpful to filter them out as we're probably less interested in missing height values and rather more concerned with sorting and inspecting populated ones.
Let's try sorting our data in a descending order this time. We'll use ORDER BY mass DESC command to tell the database to sort our characters based on their mass in descending order, meaning the heaviest characters will appear at the top of our list. We'll also filter out any characters with missing mass values.
This query allows us to quickly see which characters are heaviest and we can see that Jabba Desilijic Tiure is by far the heaviest character in our data set!
We can actually tell the database to sort on multiple columns with the order in which we specify them determining the order in which they are applied. For example, the following query first sorts characters by species name in alphabetical order then within each species group, the results are further ordered by height in descending order, so that the tallest members of each species appear first.
We can see that are multi-sort has been successfully applied with 'species' sorting in ascending order (which is A-Z for text) and then after that has been applied we can see the height is in descending order i.e. tallest first. This is most obvious for the droids where there multiple characters with the same species and the tallest IG-88 is indeed at the top of the group.
Let's try one last ORDER BY with three different sorting criteria. First, we'll arrange characters by species in alphabetical order. Within each species, we'll sort by height from tallest to shortest. Let's pretend we need a tie-breaker for when we have characters with the same height and species. We'll use birth year as a third level of sorting, placing the oldest characters first. We'll also filter our anyone with NULL values for any of our sorting criteria.
By layering the sorting in this way, the query produces results that are well structured and interpretable. We can easily compare characters within the same species, see the tallest individuals listed at the top of each group and then understand how age differentiates those with identical heights. This approach illustrates the flexibility of ORDER BY in SQL, allowing us to define multiple levels of precedence so that our results are not only consistent but also meaningful for analysis.
Creating new features with conditional logic using CASE WHEN
The CASE WHEN statement is one of SQL's most powerful conditional logic tools, allowing us to implement sophisticated decision-making directly within our queries. At its core, CASE WHEN enables us to evaluate multiple conditions sequentially and return different values based on which condition is met first. This functionality is incredibly helpful in data science workflows where we frequently need to categorise, clean or transform our data based on specific business rules or analytical requirements.
The fundamental structure of CASE WHEN operates much like an if-else statement in programming languages. SQL evaluates each WHEN condition in the order they appear, and as soon as it finds a condition that evaluates to TRUE, it returns the corresponding result and stops checking further conditions. This sequential evaluation means that the order of our conditions is very important—we must place our most specific or restrictive conditions first to ensure they're properly evaluated before more general conditions.
The logical conditions that we can apply with our CASE WHEN are the same as we could use to filter our data with WHERE. This time however we're changing our data based on those conditions rather than removing it entirely. When no conditions evaluate to TRUE, SQL will return the value specified in the ELSE clause. If we don't include an ELSE clause and none of our conditions are met, SQL returns NULL. This behaviour requires careful consideration during query design as unexpected NULL values can affect downstream calculations and analyses. Let's have a look at a few examples.
In this query, we're creating a new categorical variable called height_category based on the continuous height variable. Notice how we've structured our conditions strategically: we first check for NULL values, which is crucial in real-world datasets where missing data is common. By handling NULLs explicitly at the beginning, we prevent them from falling through to our ELSE clause unexpectedly.
Our subsequent conditions follow a logical hierarchy from most restrictive to least restrictive. We identify characters over 200 units as 'Very tall', then those between 150 and 200 as 'Average height', with everyone else falling into the 'Below average height' category. This approach ensures that each character receives exactly one classification, creating clean, mutually exclusive categories essential for statistical analysis.
The ORDER BY clause at the end sorts our results by height in descending order, making it easier to verify that our categorisation logic is working correctly. This is a good practice when developing and testing CASE WHEN statements so we can quickly spot any logical errors.
Often as part of our data preparation for creating features for a machine learning model we need to create binary indicator variables (0/1 flags). The CASE WHEN in SQL makes this very easy to do.
The pattern for creating binary flags is to have one CASE WHEN condition per flag with a 'if condition is met' THEN 1 ELSE 0 END to create clean 0/1 variables that can be directly used in mathematical calculations or fed into machine learning algorithms.
Let's have a look at one final example that makes use of some more complex logic to create some new features including a nested CASE WHEN statement.
This query illustrates several advanced techniques. The first CASE WHEN statement implements multi-condition logic using AND operators, allowing us to create more complex classification rules. For instance, we specifically identify human characters who appeared in the original Star Wars film, creating a unique category that combines species information with film appearance data.
The use of the LIKE operator with wildcards (%A New Hope%) demonstrates how we can search for partial text matches within string fields—a common requirement when working with text data that might contain multiple values or vary in formatting.
Our second CASE WHEN statement demonstrates nested CASE WHEN logic, where we have a CASE statement within another CASE statement(!) This nested approach allows us to implement hierarchical decision trees: first, we check whether we have complete data (non-NULL height and mass), and only if we do, we proceed to calculate and categorise the BMI. If the data is incomplete, we immediately assign an 'Incomplete data' classification without attempting the mathematical calculation.
This nested structure prevents potential errors that could arise from attempting calculations on NULL values and provides clear, actionable feedback about data quality issues—a crucial consideration in any data science workflow where data completeness and quality directly impact analytical outcomes.
Aggregating our data with GROUP BY
So far we've been working with data all at the same level of aggregation i.e. one row per character. We've filtered it, sorted it and created new features but all at a character level. Now we'll look at how we can aggregate our data to create summary statistics based on the table as a whole as well as specific groups within in.
For example, we've done a lot of work using the heights of individual characters to see if they are tall, short, missing, etc. We might also be interested to know what the average height across all the characters in our table is. We can do this by applying an aggregating function to our data. These tell the database that we want to aggregate our data in some way i.e. combine multiple rows into a single summary statistic. Common aggregating function include:
-
COUNT(): This function counts the number of rows that match a specified condition.
-
COUNT(*) counts all rows in the table.
-
COUNT(column_name) counts non-null values in a specific column.
-
-
SUM(): This function calculates the total sum of a numeric column.
-
AVG(): This function calculates the average value of a numeric column.
- MIN(): This function returns the minimum (smallest) value in a selected column.
- MAX(): This function returns the maximum (largest) value in a selected column.
Let's see how these work by calculating the average height of characters in our table.
We can see that the average height of all characters in our dataset is 174.6cm (note that NULL values are ignored). In essence we're calculating a single statistic across our entire dataset. The AVG function processes all non-NULL height values and returns a single summary measure. Whilst this provides useful baseline information, it also obscures potentially important variations between different subgroups within our data. This is where GROUP BY comes in.
The GROUP BY allows us to partition our dataset into distinct groups based on shared characteristics, then apply aggregate functions to calculate statistics within each group. Without GROUP BY, aggregate functions like COUNT, AVG, and SUM operate across the entire result set, producing a single summary value like we saw above. When we introduce GROUP BY, these same functions operate within each distinct group, producing one summary row per group.
The conceptual model behind GROUP BY involves three distinct phases: scanning all qualifying rows, partitioning them into groups based on the specified grouping columns, and then calculating aggregate functions within each partition. This process fundamentally changes the granularity of our analysis from individual observations to group-level statistics.
This query changes our level of aggregation by partitioning our characters into species-based groups. Instead of a single average height that might mask important biological differences, we now see the average height within each species. The addition of COUNT(*) also provides context by showing us how many characters comprise each average—essential information for assessing the reliability of our statistics.
We use the WHERE clause to exclude NULL species values before grouping. This decision reflects a common data science principle: explicitly handle missing data rather than allowing it to create misleading or uninformative groups. By filtering out NULLs, we ensure that our species-based analysis focuses on cases where we have meaningful grouping information.
To run our query, behind the scenes, SQL first identifies all unique species values, creates separate partitions for each species, and then calculates the average height and count within each partition. This process converts our row-level character data into species-level statistical summaries. Building on our basic grouping concept, we can calculate multiple statistics simultaneously to help us learn more about our data.
Here we calculate multiple descriptive statistics within each species group. We can see the averages for our metrics but also the variation i.e. minimum and maximum values within each group. The ORDER BY ranks species from most to least represented in our dataset to give a bit more insight.
Real-world data science often requires analysis across multiple dimensions simultaneously. GROUP BY accommodates this need through multi-column grouping:
This query introduces multi-dimensional grouping by partitioning our data based on both species and sex. The resulting groups represent unique combinations of these characteristics—for example, (Human, male), (Human, female), (Droid, none), etc. Multi-dimensional groupings allow us to calculate our summary statistics within ever smaller sub groups of data.
The HAVING clause (which we'll cover in more detail in next section) was added so we only return sub-groups with more than 1 character in them. HAVING allows us to do post-aggregation filtering. Unlike WHERE, which filters individual rows before grouping, HAVING filters entire groups after aggregation calculations are complete. In this example, our HAVING condition excludes groups with only one character, focusing our analysis on combinations where we can calculate meaningful comparative statistics.
As well as using multiple columns to group by, we can also incorporate our CASE WHEN condition logic into the aggregations
This query demonstrates the sophisticated analytical capabilities achieved by combining GROUP BY with conditional aggregation patterns. The COUNT(CASE WHEN condition THEN 1 END) pattern creates filtered counts within each group—we're essentially asking "within each species group, how many characters meet this specific condition?"
The conditional averaging technique shows how we can calculate statistics on subsets of data within each group. Rather than calculating the average height including potentially problematic NULL values, we explicitly control which values participate in our average calculation. This level of control over aggregate calculations is useful when dealing with real-world data quality issues. Our film appearance counting using LIKE pattern matching demonstrates how GROUP BY analysis can incorporate text-based conditions.
The percentage calculation creates what's essentially a key performance indicator (KPI) within each group—the proportion of characters meeting our "tall" criteria. Such ratio calculations transform absolute counts into comparable metrics across groups of different sizes, enabling fair comparisons between species with vastly different representation in our dataset.
Filtering our aggregated data with HAVING
The HAVING clause filters entire groups after aggregation rather than individual rows before grouping like WHERE would. This is useful as it allows us to apply filtering criteria to summarised data rather than raw observations.
Understanding HAVING requires appreciating SQL's order of execution, which fundamentally determines when different types of filtering can occur. Let's recap again: FROM identifies source tables, WHERE filters individual rows, GROUP BY partitions the remaining rows into groups, HAVING filters these groups based on aggregate conditions, SELECT chooses columns and applies functions, and finally ORDER BY sorts the results. This sequence means that HAVING operates on already-summarised data, enabling us to apply conditions that simply wouldn't be possible at the row level.
The conceptual distinction between WHERE and HAVING can be summarised as: WHERE asks "which individual rows should participate in our groups?" whilst HAVING asks "which completed groups should appear in our final results?" This difference becomes crucial when we need to exclude groups that don't meet certain aggregate criteria—such as species with too few characters for reliable statistical analysis, or categories that don't exceed certain threshold values.
A common use for HAVING is to deal with the challenge of excluding groups with insufficient sample sizes. Our HAVING clause here filters out any species represented by only a single character, ensuring that our statistical calculations reflect meaningful group patterns rather than individual quirks.
Notice how we structure our filtering logic across both WHERE and HAVING clauses. The WHERE condition 'species IS NOT NULL' removes individual rows lacking species information before any grouping occurs. This pre-filtering ensures that our GROUP BY operation works with clean data and prevents NULL species from forming an uninformative group. The 'HAVING COUNT(*) > 1' operates after grouping is complete, examining each species group and excluding those with only one member. The WHERE ensures data quality by removing problematic individual records, whilst HAVING ensures analytical reliability by removing groups that don't meet our minimum criteria for meaningful interpretation.
We can also specify multiple conditions in our HAVING clause:
This query illustrates how HAVING enables more complicated group selection based on multiple aggregate criteria. The AND operator in our HAVING clause ensures that groups must satisfy both conditions—they need at least 3 characters AND an average mass over 50kg. This conjunction creates a refined subset of our data that meets multiple analytical criteria simultaneously, enabling more focused and reliable analysis.
The power of this approach lies in its ability to combine statistical and substantive criteria within a single filtering step. We're specifically seeking species that meet two distinct aggregate conditions: sufficient sample size (at least 3 characters) and a meaningful biological characteristic (average mass exceeding 50kg). The sample size requirement ensures our average mass calculations are based on reasonable data volumes, whilst the mass threshold focuses our analysis on species of particular interest—perhaps those large enough to pose physical threats.
This execution order that SQL runs through makes for a couple of constraints on how we can use HAVING. First off, HAVING must accompany GROUP BY. This reflects the fundamental purpose of the HAVING clause—it only makes sense to filter groups when groups have actually been created. Without GROUP BY, there are no aggregate groups to filter, making HAVING conceptually meaningless and syntactically invalid. Secondly, HAVING cannot reference column aliases created in the SELECT clause because HAVING executes before SELECT. If we want to filter based on a calculated value, we must repeat the entire calculation in our HAVING clause rather than using the alias.
Let's have a look at once last complicated query that tries to put all our different learnings into practice in one place:
This query showcases several advanced concepts that frequently arise in data science applications. First, we're grouping by a calculated expression rather than a simple column—our CASE WHEN statement creates film categories on the fly, demonstrating how GROUP BY can work with derived rather than stored values.
The grouping expression itself implements a common analytical pattern: collapsing detailed categories into broader, more analytically useful groups. Rather than analysing each film separately, we're comparing the original trilogy against all other films, creating a more focused analysis that addresses specific research questions about the evolution of character representation across different eras of the franchise.
Our HAVING clause applies the same sample size filter as previous examples, but now it's filtering film categories rather than species. This demonstrates the flexibility of HAVING—the same logical pattern (excluding small groups) applies across different analytical contexts and grouping schemes.
The complex percentage calculation within our SELECT clause illustrates why understanding execution order matters. If we wanted to filter based on this percentage, we couldn't reference the pct_human alias in our HAVING clause. Instead, we would need to repeat the entire calculation: HAVING ROUND(100.0 * SUM(CASE WHEN species = 'human' THEN 1 ELSE 0 END) / SUM(CASE WHEN species IS NOT NULL THEN 1 ELSE 0 END), 1) > some_threshold.
Well done on making it to the end of this tutorial! Hopefully now you've got a better understanding of the SQL language and how you can chain together the different clauses to create powerful and insightful analysis. If you'd like to learn how to join data with SQL there's a tutorial here. Or if you fancy putting some of your new skills into practice there's a tutorial on how to calculate market basket association measures here.

