Looking for structure in data — Andrews curves plot explained

That may be one of those tools which I will never use in real life, but I thought that it is interesting enough to learn it anyway.

Andrews curves plot is a way of visualizing the structure of independent variables. The possible usage is a lightweight method of checking whether we have enough features to train a classifier or whether we should keep doing feature engineering and cleaning data because there is a mess in data.

We may do it because the independent variables get summarized and the data dimensions get reduced to only two. To be more precise, every observation gets reduced to a curve which can be projected in two-dimensions. When we draw the plot and use colors to distinguish the labels, we see whether all observations are tangled with each other or whether the observations are grouped in separate streams.

In Scikit-learn all we need is one function and preprocessed data (values must be normalized to (0.0, 1.0).

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X = pd.DataFrame(scaler.transform(X), columns = ['feature_1', 'feature_2', 'label'])

from pandas.plotting import andrews_curves

andrews_curves(X, 'label', colormap = 'winter')

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

How to interpret the chart

What do we see here? Both colors are mixed. There is no way to distinguish between them. The same problem exists in the feature dataset. There is no way to predict the target class using those two features.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.