Quick Python Pandas: Two Ways to Split a Dataframe

Quick Python Pandas: Two Ways to Split a Dataframe

Extract two smaller dataframes from a big one ─ sequentially or randomly

·

3 min read

Let's say you have a quite big dataframe. For whatever reason, you want to split it into two parts in some way. In this article, I will cover two common ways to do this:

  1. split the dataframe into two parts: its first N rows (or a percentage of the number of rows) and the rest;
  2. split it randomly into two parts, by a number of rows or percentage.

As example data, I will use my Spotify streaming history during some three months, which contains 6184 rows. (Which means I listened to almost 70 songs per day on average! Who would have thought?)

If you happen to love any of these songs or artists, let me know! 😎

import pandas as pd
df = pd.read_json('./StreamingHistory.json')

Screenshot 2021-09-28 at 00.52.47.png

1. Take the first N rows of the dataframe

In both cases, the tricky part is storing the second dataframe in another variable. Let's see how to do that.

First, let's take the first 1000 lines from the dataframe and store it in a new dataframe df_1:

df_1 = df.iloc[0:1000]

Screenshot 2021-09-28 at 00.52.58.png

That was pretty straightforward: iloc takes an index range as argument, and slices the dataframe according to that range.

To get the leftover part of our initial dataframe, we can use this line:

df_2 = df.loc[~df.index.isin(df_1.index)].reset_index(drop=True)

Screenshot 2021-09-28 at 00.54.07.png

In this case, loc does the slicing by a boolean condition ─ we are taking only those indexes that are not in df_1. Then, we have to reset indexing in the resulting slice, otherwise df_2 would start from index 1000.

Bonus: if you want to specify the size of df_1 using a fraction of the original df, you can define it like this:

frac = 0.9
df_1 = df.iloc[0:round(len(df) * frac)]

2. Split it randomly into two parts

One common reason for splitting a dataset into two random parts is to train and validate a machine learning model. In PyTorch there is a built-in method called random_split to do this. However, it returns a Subset object instead of the original Dataset type. So if you have a customized dataset type, it might not be straightforward to use random_split.

In that case, or in any other case when you need two (or one!) random parts of a dataframe, you can use the Pandas method sample. We can either specify the number of rows we want in our random sample or the ratio of the original dataframe, respectively with args n or frac:

df_1 = df.sample(n=5000)  # or: frac=0.9
df_2 = df.loc[~df.index.isin(df_t.index)].reset_index(drop=True)
df_1 = df_1.reset_index(drop=True)

We have to reset indexing here on df_1 as well because indexes of a random dataframe would just be random indexes. But! We have to do it after defining df_2, otherwise we would not get the remaining rows correctly.

Here are our resulting dataframes:

Screenshot 2021-09-28 at 01.26.01.png

Screenshot 2021-09-28 at 01.26.10.png

Notice that the records in the second dataframe are still in chronological order, unlike the first one.

This random sampling might seemingly not make much sense if you have time series data, as in this example. Or it could, if you want to create an amazing random playlist based on your past listening!

Conclusion

These were some easy tips, but I keep forgetting how to get the second dataframe when I work on similar tasks. Hopefully, I've explained this clearly for other readers as well as future-me! 🔮

If you have any questions or suggestions about Python and Pandas, please drop a comment. Or just to talk about music, of course!

And by the way, thanks for reading 😊 You've won a beautiful song by Italian rock band Måneskin (Eurovision 2021 winners).

Photo credits: