Ludwig machine learing model in Kaggle

There is something beautifully simple about using Ludwig in Kaggle. It is almost like calling a “make_me_a_model(data)” function.

Almost. I decided to give it a try. It was the first time I was using Ludwig, so I decided to make it a little bit hardcore. It is advertised as a “magical” tool that trains deep learning models without the need to write code. Let’s use it like this. I am going to provide a minimal configuration and see what happens.

First of all, we must install the Ludwig library. It turns out, that the version available in pip repository was throwing errors while running in a Kaggle kernel, so I installed the most recent code from the repository.

!pip install https://github.com/uber/ludwig/archive/master.zip

After that, it was easy. I loaded the training dataset, defined the features, and ran Ludwig. Note that I used the “PUBG Finish Placement Prediction” dataset.

import pandas as pd

data = pd.read_csv('../input/train_V2.csv')

model_definition = {
    'input_features': [
        {'name': 'assists', 'type': 'numerical'},
        {'name': 'boosts', 'type': 'numerical'},
        ...
        {'name': 'matchType', 'type': 'category'}
    ],
    'output_features': [{'name': 'winPlacePerc', 'type': 'numerical'}]
}

from ludwig import LudwigModel

model = LudwigModel(model_definition)
model.train(data)

After that, I loaded the test dataset, ran the “predict” function and saved the results in an output file.

import pandas as pd
data = pd.read_csv('../input/test_V2.csv')
ids = data['Id']

predictions = model.predict(data)
model.close()

output = pd.concat([
    pd.DataFrame(ids, columns = ['Id']),
    pd.DataFrame(predictions, columns = ['winPlacePerc'])
], axis = 1)

output.to_csv('submission.csv', index=False)

It is still running, so I have no idea what the score is. Maybe it is going to run out of memory or the available processing time. I don’t know yet.

If it works well without any human-driven preprocessing, I will start to worry a little bit. I think it is going to fail miserably.

How such tools are going to change data science and machine learning engineering? I think that the only change that we are going to see is less time spent on writing boring code. After all, implementing a preprocessing pipeline is fun the first time you are doing it. Maybe the second or even the fifth time it is fun too. At some point it gets boring.

Perhaps we should start reading more business books because it looks that simple machine learning can be successfully automated. What is left for us? In my opinion, now we can focus on the creative part of data science like finding new data, adding features to the dataset, looking for new business problems, and applying ML to solve them.

Older post

The problem of large categorical variables in machine learning

How to use FeatureHasher in Scikit-learn

Newer post

Guidelines for data science teams — a summary of Daniel Molnar’s talks

Avoiding over-engineering in machine learning