Joining data in Python using pandas

In this post you'll learn how to merge data with pandas using standard joins such as inner, left and full join and some tips and ticks for common challenges such as merging multiple tables with different join keys. We'll also look at filtering and unioning.

pandas and the starwars data set

For this tutorial we're going to run through the different types of joins that are available in the pandas library. If you're not familiar with pandas there's a longer post all about data wrangling and manipulation with it here. For this tutorial we're going to be using the Starwars data set which contains information on 87 different characters from the starwars universe. It usually comes loaded as part of the dplyr package in R but we can load it from the dplyr repo on github. We'll create a couple of filtered tables from it so we can practice joining them together to see how each of the different joins in pandas behave.

To keep things nice and simple we've got two main data sets we'll be using. I like to give my data sets names that help me remember what I've done to them so apologies they're not very original. The first is the 'humans_droids' table, which is a collection of all the human and droid characters which we can see have 41 rows and 2 columns. The second data set, which I've just called 'humans', is just the human characters which is 35 rows and 2 columns.

There are actually two options for the syntax when we come to join our data sets in pandas. Both of the examples below do exactly the same thing but the first one is the more commonly used one as it's a bit clearer about which tables exactly are being joined.

Joining best practice

Before we have a look at how the joins works in pandas it's worth mentioning two bits of best practice that apply to joining data in

most languages:

Don't join on missing data: the pandas merge documentation explicitly warns that "If both key columns contain rows where the key is a null value, those rows will be matched against each other" so it's very easy to end up with one-many or many-many joins unintentionally. If the missing data occurs differently across tables you can also end up joining to a different row than the one you intended.
Avoid many to many joins: this is where the column you're joining on across tables has duplicate row values in both tables. As each row matches to every other row you end up with a table much larger than your original e.g. if you had 10 duplicates in your join column in both tables your final table will be 10x10=100 rows.

There will be rare occasions when you actually do want to do one of these but more often than not they'll be done by accident and have the potential to give you strange results or depending on the size of your data, crash your session or server. In the first example, C-3PO who has an NA value for 'hair_color' ends up joining against every character with missing 'hair_color' information. In the many-many join, our table goes from 87 rows all the way to 1,335 rows. Now imagine if we had millions or rows in our original join!

You can also see in the output how pandas handles columns from different tables having the same name. If they're not specified in the 'on=' then pandas keeps both sets and gives them a '_x' or '_y' suffix to show which table they came from. The first table in the join is given _x and the second _y. We'll see how we can change these if we want to later on.

Finally, unlike in SQL where you need to explicitly list which columns are to be used to make the join, pandas is able to infer them for us if they have the same name in both data sets. This is a neat ability but I'd always recommend manually specifying them in the 'on=' as it's easy to unintentionally join on more or different columns than what you intended. You can see which columns pandas picked to join on in the log for the second example:

Inner, right, left, outer and cross joins

If you've ever used joins in SQL these are basically their equivalents in pandas. Pandas joins keep all columns from both data sets and join based on matching rows. We'll run through each of these in turn starting. We specify which type of join we want to do with the 'how=' option. We'll start with inner join as it's the default option. Inner join only keeps rows that occur in both tables. We can see how this works with our example data sets:

Selecting how='inner' tells pandas we only want to return rows that are a match in both tables. The 'on="name"' tells pandas on which columns to join the two tables i.e. we tell pandas which columns the tables have in common and it's between these that it finds matching rows. The column we specify in the 'on=' column is sometimes referred to as the 'join key'.

As the 'humans' data set is essentially a smaller subset of 'humans_droids' when we inner join them, we get a data set that has all the rows from 'humans', since all the rows from 'humans' are in 'humans_droids' and we gain an extra column 'species'. As an inner join only returns rows that occur in both tables, our new columns are complete/fully populated i.e. they have no missing values. The Venn diagram below shows a a more visual way to understand what's going on:

Next up is the left join. This keeps all of the rows and columns from the first data set and adds any new columns from the second data set. Where there is a match on our join key, these new rows will be populated with values from the second table. Where there isn't a match, the row from the first table is still kept and the new columns will simply be given a missing/NA value to show that there was no match in the second table for that row. Let's see how this looks with our data:

We can see our data set this time is 41 rows as it's essentially a copy of our first/left-hand table 'human_droids'. We gain the 'species' column from the second/right-hand table and where they had matching values for 'name' the corresponding 'species' value is brought over with the new column. Where there wasn't a match we just get an NA value. The Venn diagram for our left join looks like this:

The black box around the left-hand table is there to emphasise that it's kept in its entirety after the join and info from the right-hand table is only appended where there is a match.

Having seen how the left join works in terms of keeping the first table and appending data from the second, what happens if we do our left join again but this time switch the order in which we join the tables?

This time we get a copy of the 'humans' table with a new 'homeworld' column from our second/right-hand table. We saw with our inner join how every row in 'humans' matches a row in 'humans_droids' and so every row of the new 'homeworld' column is populated with a value from our right-hand table on this occasion.

As well as left join, which keeps a copy of the first table and appends data from the second, there is its opposite, the right join. It doesn't tend to get used much as you can always make a right join into a left join by switching the table order but for completeness let's see it in an example. In fact we can rewrite our previous left join to work as a right join:

The next join is the outer join. Whereas until now we've been only keeping rows that matched in some way, an outer join combines both tables fully so we keep all the rows and all the columns from both tables. As all the data from both table is kept it doesn't matter which is the left table or right table:

The final join is the cross join, sometimes called a cartesian join. This is where every row from the first table is combined against every row from the second table. The result is a new table containing all possible pairings of rows between the two original tables.

These types of joins are helpful if you ever need to compare all rows from one table against all other rows from a second table. For example, in the Yule's Q association code there is a step where we need to create every possible pair of products to see which ones customers have bought both of.

Tips and Tricks

The next section just runs through some examples of challenges that can crop up when joining data out in the wild and some hints and tips on how to overcome them.

What if my tables have names in common but I'm not joining on them?

We saw earlier that pandas handles column names that are common across table but not in the 'on=' by using a '_x' and '_y' suffix for them. Let's remind ourselves of this by creating a new table 'humans_flag' that has a new column called 'species' that instead of recording the species of the character just takes the value 'yes/no' depending on whether the character is human. If we then join this table we can see what happens when we have two columns with the same name that we aren't joining by:

If we don't want to use these default suffixes we can supply our own with the 'suffixes= ' option. We can pass a list of suffixes, one for the first table and one for the second table. This way, rather than just recording the fact there were two columns with the same name in the join and which table they came from, we can try and make them more informative:

That looks a bit neater now and we can see that our two species columns are doing slightly different jobs for us. The first is a more general column that records the exact species of the character whereas the second is a binary indicator for whether the specific character is human or not.

What if my join keys have different names in each table?

For this we've got two options. We could change the name of the join key so they match but this could get annoying if we've got lots of tables to join. An easier way is to deal with it in the join by telling pandas what the different names of the join keys are. We do this by using 'left_on=' and 'right_on=' instead of just 'on='. Let's see how this works in practice by changing the 'name' to 'character_name' and saving it in a new object:

What if I need to join on more than one column at a time?

So far all our joins have just been joining on one column at a time. Sometimes, particularly if we've got aggregated data, we might need to use multiple columns in our joins. Let's make some dummy data that records the average height and mass for characters but we'll group by 'species' and 'homeworld'. We can then also make a copy of the original starwars table but only keep the columns we need for the example:

If we now want to join our aggregated data onto our second data set we need to make sure we use both 'species' and 'homeworld' in our join otherwise we'll end duplicating out rows. For example, we've got two different averages for droids as some are from 'Naboo' and some from 'Tatooine'. We can't just join on 'species' otherwise every droid would get both of these averages which wouldn't be right.

We get round this in a similar way to how we did with the differently named join keys. When we join our data we pass a list of column names in our 'on=' which tells pandas we want to join on multiple columns at the same time:

What if I need to join on multiple columns and they have different names?

If we've got multiple columns we need to join on and they've got different names we can combine the two previous approaches. We still pass a list to our 'on=' in the join and we use 'left_on' and 'right_on':

What if I want to join more than two tables at a time?

The easiest way to join lots of tables is probably to do the operations consecutively on separate rows. However if you really want to do it all on one line you can. For example, let's say we wanted to take our 'humans' data set, join on the 'homeworld' column from 'humans_droids' and then join against our average table:

Is there a way to tell which observations come from which table after joining?

Sometimes when we're joining it's useful to know which observations are in both tables or just one of them. Pandas has a handy way of capturing this with the 'indicator=True' option:

As we're outer joining we keep observations from both tables. The _merge column is the indicator that tells us which table the rows came from. We can see that all the humans are in both tables (which makes sense since we're joining 'humans' to 'humans_droids') but the droids are flagged as being in the 'right_only' table.

Filtering and joins

So far we've looked at joins which modify the first table by adding columns from the second table and potentially adding or removing rows too. The 'indicator=True' makes it easy to filter out observations after joining but sometimes we might not actually want to have all the additional columns from the second table and just want to filter our original table directly.

For example, we might have a list of dates, going back years, with the sales for the day recorded in one table. In another table we might have 50 columns, one of which is the current financial quarter dates. To just get the sales for the latest quarter we could inner join the tables but then we'd also bring over 49 extra columns we don't need from the second table. We could drop these beforehand but it still adds an extra step. Thankfully filtering the table directly is easy to do in pandas. I've included it here even though they're not technically joins however they are the equivalent of a semi-join or anti-join in dplyr. First let's compare how filtering the table works differently to inner joining. The below code is the equivalent of a semi-join.

You can see the difference in behaviour as with the inner join we get the extra 'species' column from the right table whereas for the .isin() filter we don't. Another neat feature of filtering is they don't mind if you have duplicate values in your join keys. They'll never duplicate out rows if they find one-many or many-main joins, they only ever keep or drop rows. If we already have duplicate rows in our left table these will be preserved, we just won't get any more from joining to our right table.

The second type of filtering we can do is equivalent to an anti-join. Whereas the previous filter only returns rows that match in both tables, an anti-join returns rows from the left table that don't match rows in the right table. This can be helpful if say you've got a list of exclusions in one table that you need to apply to another. For example, let's say we've still got our sales by day table but we want to remove public holidays as we think these might skew our data. If we have a second table of public holiday dates we can filter against this to only keep dates in our first table that aren't in the public holiday table. To do this with filters in pandas we can simple negate our previous semi-join logic with ~ i.e. only returns rows that aren't in the second table:

concat for union and union all type operations

So far we've joining tables side-by-side i.e. left and right. But what if we wanted to join top to bottom i.e. append new data to our table? This is where we can use concat. As the name suggests, it concatenates tables along a given axis. By default it joins the tables top to bottom i.e. axis=0. It mimics the behaviour of union all in SQL or dplyr where it keep all rows, including duplicate occurrences, but concat is less particular about all of the columns being present in both base tables.

The new data set is 76 rows long which is a combination of 35 rows from 'humans' and 41 from 'humans_droids'. Since the 'homeworld' column from the second table isn't present in the first, it is created but just filled with NA for all rows that came from the first table. Likewise, observations from the second table will have NA for the 'species' column from the first table.

If we want to append data but without having any common rows occurring twice, then we can use the concat function with the drop_duplicates() function. This works in exactly the same way as concat except we remove any duplicate rows from the final data set. The way drop_duplicates() works is it looks for complete duplicates i.e. rows that have the same values across all columns. In our data we don't actually have this as although we have the same 'name' column in both tables, the other columns are different as they are not shared between tables. This means if we run the below code we don't actually remove any duplicates.

We still have 76 rows! To implement the deduping logic as required we can tell pandas to use a subset of columns to check for duplicates. This way, rather than having to look for rows that are the same across all columns, it only finds duplicate rows within the specified columns. In our case that would be the 'name' column.

That last result looks a lot better with each name only occurring once. We've successfully concatenated our tables without duplicating out any rows. This concludes this tutorial on joining data with pandas. If you'd like a more detailed tutorial on manipulating data with pandas you can find one here or you can see how these joins can be put to use to calculate Yule's Q for product associations here.

pandas

Filtering joins

Tips and tricks

Mutating joins

Best practice

Set operations