How to split a list inside a Dataframe cell into rows in Pandas

How to split a list inside a Dataframe cell into rows in Pandas

I started the “What’s cooking?” Kaggle challenge and wanted to do some data analysis. The given data set consists of three columns. Unfortunately, the last one is a list of ingredients.

I wanted to calculate how often an ingredient is used in every cuisine and how many cuisines use the ingredient. I had to split the list in the last column and use its values as rows. Additionally, I had to add the correct cuisine to every row.

Let’s look at an example. If my dataset looks like this:

1
2
cuisine_1,id_1,[ingredient_1, ingredient_2, ingredient_3]
cuisine_2,id_2,[ingredient_4, ingredient_5]

I want that output:

1
2
3
4
5
cuisine_1,id_1,ingredient_1
cuisine_1,id_1,ingredient_2
cuisine_1,id_1,ingredient_3
cuisine_2,id_2,ingredient_4
cuisine_2,id_2,ingredient_5

I wrote some code that was doing the job and worked correctly but did not look like Pandas code. Look at this, I dissected the data frame and rebuilt it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ingredients = []
cuisines = []
ids = []
for _, row in data.iterrows():
    cuisine = row.cuisine
    identifier = row.id
    for ingredient in row.ingredients:
        cuisines.append(cuisine)
        ingredients.append(ingredient)
        ids.append(identifier)
ingredient_to_cuisine = pd.DataFrame({
    "id": ids,
    "ingredient": ingredients,
    "cuisine": cuisines
})

There must be a better way to do it.

Do you want to show your product/service to 25000 data science enthusiasts every month? I am looking for companies which would like to become a partner of this blog.

Are you interested? Is your employer interested? Here are the details of the offer.

The better way

Let’s do it step by step.

Firstly, we have to split the ingredients column (which contains a list of values) into new columns. It is easy to do, and the output preserves the index. The index is important. We are going to need it ;)

1
data.ingredients.apply(pd.Series)
For testing I limited the data set to three rows.
For testing I limited the data set to three rows.

Now we can merge the columns with the rest of the data set. There is a lot of empty values, but that is fine. We will get rid of them later.

1
2
data.ingredients.apply(pd.Series) \
    .merge(data, left_index = True, right_index = True)

First of all, I don’t need the old ingredients column anymore. So, let’s drop it.

1
2
3
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1)

Now we can transform the numeric columns into separate rows using the melt function. Note that I use the cuisine and the id as the identifier variables.

1
2
3
4
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient")

It looks like the “variable” column contains the ids of the numeric columns. It is useless therefore we can remove that too.

1
2
3
4
5
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
    .drop("variable", axis = 1)

I told you that we will get rid of the empty values. Now, it is time to do it.

1
2
3
4
5
6
data.ingredients.apply(pd.Series) \
    .merge(data, right_index = True, left_index = True) \
    .drop(["ingredients"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
    .drop("variable", axis = 1) \
    .dropna()

Done! Now we have a data set that has the ingredients in separate rows.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


If this article was helpful, consider donating to WWF or any other charity of your choice.
Bartosz Mikulski
Bartosz Mikulski * data scientist / software engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group