How to split a list inside a Dataframe cell into rows in Pandas

How to split a list inside a Dataframe cell into rows in Pandas

I started the “What’s cooking?” Kaggle challenge and wanted to do some data analysis. The given data set consists of three columns. Unfortunately, the last one is a list of ingredients.

I wanted to calculate how often an ingredient is used in every cuisine and how many cuisines use the ingredient. I had to split the list in the last column and use its values as rows. Additionally, I had to add the correct cuisine to every row.

Let’s look at an example. If my dataset looks like this:

1
2
cuisine_1,id_1,[ingredient_1, ingredient_2, ingredient_3]
cuisine_2,id_2,[ingredient_4, ingredient_5]

I want that output:

1
2
3
4
5
cuisine_1,id_1,ingredient_1
cuisine_1,id_1,ingredient_2
cuisine_1,id_1,ingredient_3
cuisine_2,id_2,ingredient_4
cuisine_2,id_2,ingredient_5

I wrote some code that was doing the job and worked correctly but did not look like Pandas code. Look at this, I dissected the data frame and rebuilt it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ingredients = []
cuisines = []
ids = []
for _, row in data.iterrows():
    cuisine = row.cuisine
    identifier = row.id
    for ingredient in row.ingredients:
        cuisines.append(cuisine)
        ingredients.append(ingredient)
        ids.append(identifier)
ingredient_to_cuisine = pd.DataFrame({
    "id": ids,
    "ingredient": ingredients,
    "cuisine": cuisines
})

There must be a better way to do it.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

The better way

Let’s do it step by step.

Firstly, we have to split the ingredients column (which contains a list of values) into new columns. It is easy to do, and the output preserves the index. The index is important. We are going to need it ;)

1
data.ingredients.apply(pd.Series)
For testing I limited the data set to three rows.
For testing I limited the data set to three rows.

Now we can merge the columns with the rest of the data set. There is a lot of empty values, but that is fine. We will get rid of them later.

1
2
data.ingredients.apply(pd.Series) \
    .merge(data, left_index = True, right_index = True)

First of all, I don’t need the old ingredients column anymore. So, let’s drop it.

1
2
3
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1)

Now we can transform the numeric columns into separate rows using the melt function. Note that I use the cuisine and the id as the identifier variables.

1
2
3
4
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient")

It looks like the “variable” column contains the ids of the numeric columns. It is useless therefore we can remove that too.

1
2
3
4
5
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
    .drop("variable", axis = 1)

I told you that we will get rid of the empty values. Now, it is time to do it.

1
2
3
4
5
6
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
    .drop("variable", axis = 1) \
    .dropna()

Done! Now we have a data set that has the ingredients in separate rows.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group