Ludwig machine learing model in Kaggle

There is something beautifully simple about using Ludwig in Kaggle. It is almost like calling a “make_me_a_model(data)” function.

Almost. I decided to give it a try. It was the first time I was using Ludwig, so I decided to make it a little bit hardcore. It is advertised as a “magical” tool that trains deep learning models without the need to write code. Let’s use it like this. I am going to provide a minimal configuration and see what happens.

First of all, we must install the Ludwig library. It turns out, that the version available in pip repository was throwing errors while running in a Kaggle kernel, so I installed the most recent code from the repository.

1
!pip install https://github.com/uber/ludwig/archive/master.zip

After that, it was easy. I loaded the training dataset, defined the features, and ran Ludwig. Note that I used the “PUBG Finish Placement Prediction” dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd

data = pd.read_csv('../input/train_V2.csv')

model_definition = {
    'input_features': [
        {'name': 'assists', 'type': 'numerical'},
        {'name': 'boosts', 'type': 'numerical'},
        ...
        {'name': 'matchType', 'type': 'category'}
    ],
    'output_features': [{'name': 'winPlacePerc', 'type': 'numerical'}]
}

from ludwig import LudwigModel

model = LudwigModel(model_definition)
model.train(data)

After that, I loaded the test dataset, ran the “predict” function and saved the results in an output file.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
data = pd.read_csv('../input/test_V2.csv')
ids = data['Id']

predictions = model.predict(data)
model.close()

output = pd.concat([
    pd.DataFrame(ids, columns = ['Id']),
    pd.DataFrame(predictions, columns = ['winPlacePerc'])
], axis = 1)

output.to_csv('submission.csv', index=False)

It is still running, so I have no idea what the score is. Maybe it is going to run out of memory or the available processing time. I don’t know yet.

If it works well without any human-driven preprocessing, I will start to worry a little bit. I think it is going to fail miserably.

How such tools are going to change data science and machine learning engineering? I think that the only change that we are going to see is less time spent on writing boring code. After all, implementing a preprocessing pipeline is fun the first time you are doing it. Maybe the second or even the fifth time it is fun too. At some point it gets boring.

Perhaps we should start reading more business books because it looks that simple machine learning can be successfully automated. What is left for us? In my opinion, now we can focus on the creative part of data science like finding new data, adding features to the dataset, looking for new business problems, and applying ML to solve them.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.