LightAutoML vs Titanic: 80% accuracy in several lines of code

Published in

TDS Archive

7 min readApr 12, 2021

In this tutorial, we are going to talk about how to automatically create ML models in several lines of code using the open-source framework LightAutoML for the Titanic Survival competition on Kaggle.

At the end of 2020 open-source python library LightAutoML was released by the AutoML Team at Sber AI Lab as an Automated Machine Learning (AutoML) framework. It is designed to be lightweight and efficient for various tasks (binary/multiclass classification and regression) on tabular datasets, which contains different types of features: numeric, categorical, dates, texts etc.

LightAutoML installation is pretty simple — pip install -U lightautoml
Official LightAutoML documentation

LightAutoML provides not only presets for end-to-end ML task solving but also the easy-to-use ML pipeline creation constructor, including data preprocessing elements, advanced feature generation, CV schemes (including nested CVs), hyperparameters tuning, different models, and composition building methods. It also gives the user an option to generate model training and profiling reports to check model results and find insights that are not obvious from initial dataset.

Below we are going to show how to solve Titanic — Machine Learning from Disaster competition using LightAutoML — from python libraries import up to saving final submission file.

Step 0.0. Super-short Titanic solution based on LightAutoML

Extreme short Titanic solution based on LightAutoML

The code above is available as a kernel here and scores 0.77990 on Kaggle public leaderboard in just 7 minutes and 12 lines. The main LightAutoML part is only 3 lines — from 8th to 10th.

Below we will discuss another kernel with a score of 0.79665, which is more structured like a real business ML solution and can be used as a template for your own projects.

Step 0.1. Import necessary libraries

At this step we import 3 standard python libraries, several libraries from the usual data scientist set, including numpy, pandas, and sklearn and 2 presets from LightAutoML — TabularAutoML and TabularUtilizedAutoML. We will discuss later what they can do and what the differences are between them.

# Standard python libraries
import os
import time
import re

# Installed libraries
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split

# Imports from LightAutoML package
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task

Step 0.2. Datasets load

Now we need to load the train and test dataset and the submission file, which we should fill with the predicted class:

%%time

train_data = pd.read_csv('../input/titanic/train.csv')
train_data.head()

test_data = pd.read_csv('../input/titanic/test.csv')
test_data.head()

submission = pd.read_csv('../input/titanic/gender_submission.csv')
submission.head()

Step 0.3. Additional expert features creation block

The cell below shows some user feature preparations, which can be helpful for LightAutoML to separate positive and negative class objects. The logic behind these features is ticket type extraction for Ticket column, family size calculation, name feature cleaning etc.:

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

def create_extra_features(data):
    data['Ticket_type'] = data['Ticket'].map(lambda x: x[0:3])
    data['Name_Words_Count'] = data['Name'].map(lambda x: len(x.split()))
    data['Has_Cabin'] = data["Cabin"].map(lambda x: 1 - int(type(x) == float))
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    
    data['CategoricalFare'] = pd.qcut(data['Fare'], 5).astype(str)
    data['CategoricalAge'] = pd.cut(data['Age'], 5).astype(str)
    
    data['Title'] = data['Name'].apply(get_title).replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    data['Title'] = data['Title'].replace('Mlle', 'Miss')
    data['Title'] = data['Title'].replace('Ms', 'Miss')
    data['Title'] = data['Title'].replace('Mme', 'Mrs')
    data['Title'] = data['Title'].map({"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}).fillna(0)
    return data

train_data = create_extra_features(train_data)
test_data = create_extra_features(test_data)

Step 0.4. Data splitting for train-validation

To validate the models we are going to build, we need to split the dataset into train and validation parts:

%%time

tr_data, valid_data = train_test_split(train_data, 
                                       test_size=0.2, 
                                       stratify=data[‘Survived’], 
                                       random_state=42)
logging.info(‘Parts sizes: tr_data = {}, valid_data = {}’
              .format(tr_data.shape, valid_data.shape))

= LightAutoML preset usage =

Step 1. Create Task object

Below this line we are ready to build the model for Survived target variable prediction. First of all, we setup the type of model we need using LightAutoML Task class object, there the valid values can be:

‘binary’ for binary classification
‘reg’ for regression and
‘multiclass’ for multiclass classification

As we have a binary classification competition, we setup Task object with ‘binary’ value with F1 metric to pay more attention for model predictions precision-recall balance:

def f1_metric(y_true, y_pred):
    return f1_score(y_true, (y_pred > 0.5).astype(int))

task = Task('binary', metric = f1_metric)

Step 2. Setup columns roles

Roles setup here is to set target column called Survived and drop column PassengerId from the dataset with the already used for expert features Name and Ticket columns:

%%time

roles = {'target': 'Survived',
         'drop': ['PassengerId', 'Name', 'Ticket']}

Step 3. Create AutoML model from preset

TabularAutoML preset pipeline — Structure of our first model — Linear model, LigthGBM with expert params and LightGBM with Optuna optimized params are weighted averaged to create final predictions

To develop first LightAutoML model with the structure above, we use the TabularAutoML preset. In code, it looks like this:

automl = TabularAutoML(task = task, 
                       timeout = 600, # 600 seconds = 10 minutes
                       cpu_limit = 4, # Optimal for Kaggle kernels
                       general_params = {'use_algos': [['linear_l2', 
                                         'lgb', 'lgb_tuned']]})

Base algorithms, which are currently available to be in general_params use_algos :

Linear model (called 'linear_l2')
LightGBM model with expert params based on dataset ('lgb')
LightGBM with tuned params using Optuna ('lgb_tuned')
CatBoost model with expert params ('cb') and
CatBoost with params from Optuna ('cb_tuned')

As you can see, use_algos are lists in the list — this is the notation to create ML pipelines with as many levels of algorithms as you want. For example, [['linear_l2', 'lgb', 'cb'], ['lgb_tuned', 'cb']] stands for 3 algorithms on the first level and 2 on the second. After the second level will be fully trained, predictions from the 2 algorithms are weighted averaged to construct the final prediction. The full set of parameters (not only general ones), which can be provided for the TabularAutoML customization, can be found in its YAML config.

To fit our TabularAutoML preset model on the train part of the dataset, we use the code below:

oof_pred = automl.fit_predict(tr_data, roles = roles)

As a result of fit_predict function, we receive Out-of-Fold (OOF for short) predictions. They are based on the inner CV of LightAutoML and can be used to calculate the model performance metrics on the train data.

Step 4. Predict to validation data and check scores

Now we have a trained model and we want to receive predictions for the validation data:

valid_pred = automl.predict(valid_data)

And as we have the ground truth labels for this object let’s check how good we are:

def acc_score(y_true, y_pred):
    return accuracy_score(y_true, (y_pred > 0.5).astype(int))print('OOF acc: {}'.format(acc_score(tr_data['Survived'].values,      oof_pred.data[:, 0])))
print('VAL acc: {}'.format(acc_score(valid_data['Survived'].values, valid_pred.data[:, 0])))

The results are pretty good and stable — 84.4% accuracy for OOF and 83.2% for validation data in 2.5 minutes. But we want even more :)

Step 5. Create LightAutoML model with time utilization

Below we are going to create specific AutoML preset for TIMEOUT utilization (try to spend it as much as possible inside TIMEOUT boundary):

automl = TabularUtilizedAutoML(task = task, 
                       timeout = 600, # 600 seconds = 10 minutes
                       cpu_limit = 4, # Optimal for Kaggle kernels
                       general_params = {'use_algos': [['linear_l2', 
                                         'lgb', 'lgb_tuned']]})

It’s time to fit and get the better result:

oof_pred = automl.fit_predict(tr_data, roles = roles)

As you can see, the API is the same for both presets so you can easily check each of them without much coding.

Step 6. Predict to validation data and check scores for utilized model

Prediction API for TabularUtilizedAutoML is also the same:

valid_pred = automl.predict(valid_data)

And now we check scores:

print('OOF acc: {}'.format(acc_score(tr_data['Survived'].values,      oof_pred.data[:, 0])))
print('VAL acc: {}'.format(acc_score(valid_data['Survived'].values, valid_pred.data[:, 0])))

Wow! 85.5% for OOF and 82.7% for validation data in less than 9 minutes. The validation score is a little bit lower here, but we have only 179 passengers in it. OOF score increase is more valuable here as we have 712 passengers in its calculation.

Step 7. Retrain selected model on the full dataset and predict for the real test

Now we know what model to use to receive good results on the Titanic dataset, so it’s time to retrain it on the whole dataset:

automl = TabularUtilizedAutoML(task = task, 
                       timeout = 600, # 600 seconds = 10 minutes
                       cpu_limit = 4, # Optimal for Kaggle kernels
                       general_params = {'use_algos': [['linear_l2', 
                                         'lgb', 'lgb_tuned']]})
oof_pred = automl.fit_predict(train_data, roles = roles)
test_pred = automl.predict(test_data)

Step 8. Prepare submission for Kaggle

As we already loaded sample submission file, the only thing we need to do is to insert our predictions into it and save the file:

submission['Survived'] = (test_pred.data[:, 0] > 0.5).astype(int)
submission.to_csv('automl_utilized_600_f1_score.csv', index = False)

Step 9. Submit to Kaggle!!!

Our prepared submit scores 0.79665 on Kaggle public leaderbord.

Conclusion

In this tutorial we created a step-by-step solution for the Titanic Survival competition using LightAutoML — an open-source framework for fast, automatic ML model creation.

Full tutorial code is available in Kaggle kernels here (and here for super-short version of the solution) — just give it a try on this dataset or any other one you have. It can surprise you :)

Stay tuned for more examples!