Visualize common elements of two datasets using NetworkX
Yet another story from the “What’s cooking?” Kaggle competition. I was looking at other people’s Kaggle kernels and found a very interesting one.
The author noticed that one of the distinctive characteristics of cuisine are pairs of ingredients (for example salt + pepper, olive oil + vinegar, eggs + bacon, etc.) In the original kernel, the author used NLTK to convert the ingredients to bigrams. This solution has one huge problem. When you have an ingredient like “olive oil” it becomes a tuple (“olive”, “oil”). Two words, one ingredient. Not what I wanted.
From a list of ingredients to bigrams
Our starting point is a dataset which looks like this:
Every row consists of an identifier, the name of the cuisine and a list of ingredients. I want a list of pairs. If the ingredient list has three elements: “eggs, salt, pepper” I want three pairs: (“eggs”, “salt”), (“eggs”, “pepper”), and (“salt”, “pepper”).
1 2 from itertools import combinations dataset['bigrams'] = dataset.ingredients.apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)])
Visualise common pairs of ingredients In the next step, I want to find the most popular pairs of ingredients. Then I want to create a graph with edges between cuisine and its ingredients.
Firstly, I have to convert the list of bigrams to data frame rows:
1 2 3 4 5 6 ingredient_to_pairs = train.bigrams.apply(pd.Series) \ .merge(train, right_index = True, left_index = True) \ .drop(["ingredients", "bigrams"], axis = 1) \ .melt(id_vars = ['cuisine', 'id'], value_name = "bigrams") \ .drop("variable", axis = 1) \ .dropna()
Now I have to count the pairs, sort them by the number of elements, and select the most popular ones.
1 2 3 4 5 6 7 8 9 10 mexican = ingredient_to_pairs[ingredient_to_pairs["cuisine"] == "mexican"] \ .drop(columns = "cuisine") \ .groupby(["bigrams"]).count().sort_values("id", ascending = False)[:25] mexican['cuisine'] = 'mexican' italian = ingredient_to_pairs[ingredient_to_pairs["cuisine"] == "italian"] \ .drop(columns = "cuisine") \ .groupby(["bigrams"]).count().sort_values("id", ascending = False)[:25] italian['cuisine'] = 'italian' combined = pd.concat([mexican, italian]) combined = combined.reset_index()
Finally, I can generate the graph using NetworkX. I use the circular layout because it makes it trivial to spot the ingredients popular in both cuisines.
1 2 3 4 5 6 7 8 9 import networkx as nx g = nx.from_pandas_edgelist(combined, source = 'cuisine', target = 'bigrams') pos = nx.circular_layout(g) cmap = plt.cm.RdYlGn colors = [n for n in range(len(g.nodes()))] nx.draw_networkx(g, pos, node_size = combined['id'].values * 4, edge_color = 'grey', cmap = cmap, node_color = colors, font_size = 15, width = 3) plt.title("Top 25 Bigrams for Mexican and Italian cuisine", fontsize = 40) plt.gcf().set_size_inches(60, 60) plt.show()
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz