Using Techila AutoML

1. Introduction

Note! This is a beta release. Please report any issues or bugs you find to cloudsupport@techilatechnologies.com.

This document is intended for Techila Distributed Computing Engine (TDCE) End-Users who are working in the Machine Learning (ML) field and are using Python as their development language. The purpose of this document is to provide an overview of Techila AutoML using code samples and example material that highlights how different aspects of how Techila AutoML can be used.

If you are unfamiliar with the terminology or the operating principles of TDCE, information on these can be found in Introduction to Techila Distributed Computing Engine.

2. Requirements

This Chapter contains a list of requirements that must be met in order to use Techila AutoML.

2.1. Operating System

Techila AutoML requires a Debian 10 operating system. If you are using a Mac OS X, Microsoft Windows or a different Linux distribution, you can run Techila AutoML in a Docker environment. Please see Running in Docker for more information.

Additionally, you will need to use Techila Workers that have a Linux operating system.

2.2. Python Packages and Versions

Techila AutoML supports Python 3.7.

2.2.1. Installing the `techila` Python Package

Techila AutoML uses functionality from the techila Python package.

Note! If you plan on running your computations in Docker, you do not need to install the techila package to your own computer. Instead, the build scripts included in the Techila SDK will install the package automatically to a Docker image. Please see Running in Docker for more details.

Instructions for manually installing the techila package can be found in the Techila Python guide.

2.2.2. Installing techila_ml Requirements

The package requirements of the techila_ml package are listed in the techila/lib/techila_ml/requirements.txt file.

Note! If you plan on running your computations in Docker, you do not need to install any of the required packages to your own computer. Instead, the build scripts included in the Techila SDK will install the package requirements automatically to a Docker image. Please see Running in Docker for more details.

Please note that installing the requirements may take up to 1 hour.

You can install the package requirements by running the following commands:

cd path/to/techila/lib/techila_ml
pip3 install -r requirements.txt

2.2.3. Installing the techila_ml Python Package

The techila_ml package is included in the Techila SDK, in folder techila\lib\techila_ml

Note! If you plan on running your computations in techila_ml, you do not need to install the techila_ml package to your own computer. Instead, the build scripts included in the Techila SDK will install the package automatically to a Docker image. Please see Running in Docker for more details.

You can install the package requirements by running the following commands:

cd path/to/techila/lib/techila_ml
python3 setup.py install --user

3. Example Notebooks

Python notebooks containing example material for Techila AutoML can be found in following folder in the Techila SDK.

techila/examples/techila_ml/notebooks

Please see the links below for html versions of these notebooks:

4. Example Python Scripts

This chapter contains code samples that illustrate how you can use Techila AutoML and the available features with different types of machine learning datasets. These examples can be found in the following folder in the Techila SDK.

techila/examples/techila_ml/scripts

4.1. Supported Data Types

This example shows the syntax for defining the input data when using Techila AutoML.

import numpy as np
import pandas as pd
from techila_ml import find_best_model
from techila_ml.configs import OptionalPackages
OptionalPackages.use = False

# Number of Techila jobs
n_jobs = 2

# Number of iterations
n_iterations = 8


def load_data():
    # Function for generating dummy training data.

    # Arbitrary numpy data example
    X_train = np.random.random((500, 20))
    y_train = np.random.randint(0,2,500)
    X_validation = np.random.random((50, 20))
    y_validation = np.random.randint(0,2,50)

    # Alternatively, pandas data format could also be used. Uncomment to use.
    # X_train  = pd.DataFrame(X_train)
    # y_train  = pd.Series(y_train)
    # X_validation  = pd.DataFrame(X_validation)
    # y_validation  = pd.Series(y_validation)

    return {'X_train': X_train, 'y_train': y_train, 'X_validation': X_validation, 'y_validation': y_validation}


# Load the data locally.
data = load_data()

# Search for the best model in TDCE.
res = find_best_model(
    n_jobs,
    n_iterations,
    data,
    task='classification',
)

print(f"best score: {res['best_cv_score']}")

4.2. MNIST

This example shows how to use Techila AutoML to find the best model for the MNIST data set.

from techila_ml import find_best_model
import pandas as pd
from techila_ml.configs import OptionalPackages
OptionalPackages.use = False

# Number of Techila Jobs
n_jobs = 8

# Number of iterations
n_iterations = 16

def load_data():
    from keras.datasets import mnist

    (X_train, y_train), (X_validation, y_validation) = mnist.load_data()

    # Convert to pandas format.
    X_train =  pd.DataFrame(X_train.reshape(-1,X_train[0].size))
    y_train = pd.Series(y_train)
    X_validation =  pd.DataFrame(X_validation.reshape(-1,X_validation[0].size))
    y_validation = pd.Series(y_validation)

    return {'X_train': X_train, 'y_train': y_train, 'X_validation': X_validation, 'y_validation': y_validation}


# Load the data locally.
data = load_data()

# Search for the best model in TDCE.
res = find_best_model(
    n_jobs,
    n_iterations,
    data,
    task='classification',
    optimization={
        'optimizer': 'skopt',
    }
)

4.3. IRIS Dataset - Autostopping

This example shows how to use the autostopping feature in Techila AutoML to automatically stop the optimization process after the model’s performance has not improved during the latest iterations.

from techila_ml import find_best_model
from techila_ml.configs import OptionalPackages
OptionalPackages.use = False

# Number of Techila jobs
n_jobs = 20

# Number of iterations
n_iterations = 1600


def load_data():
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    X, y = load_iris(return_X_y=True)
    X_train, X_validation, y_train, y_validation = train_test_split(X, y, random_state=0)

    return {'X_train': X_train, 'y_train': y_train, 'X_validation': X_validation, 'y_validation': y_validation}


# Load the data locally.
data = load_data()

# Search for the best model in TDCE.
res = find_best_model(
    n_jobs,
    n_iterations,
    data,
    task='classification',
    optimization={
        'optimizer': 'skopt',
        'study_auto_stopping': True,
        'auto_stopping': False,

    }
)

4.4. Diabetes Dataset - Regression

This example shows how you can apply Techila AutoML to solve a regression problem.

from techila_ml import find_best_model
from techila_ml.configs import OptionalPackages
OptionalPackages.use = False

# Number of Techila jobs
n_jobs = 20

# Number of iterations
n_iterations = 160


def load_data():
    from sklearn.datasets import load_diabetes
    from sklearn.model_selection import train_test_split
    X, y = load_diabetes(return_X_y=True)
    X_train, X_validation, y_train, y_validation = train_test_split(X, y, random_state=0)

    return {'X_train': X_train, 'y_train': y_train, 'X_validation': X_validation, 'y_validation': y_validation}


# Load the data locally.
data = load_data()

# Search for the best model in TDCE.
res = find_best_model(
    n_jobs,
    n_iterations,
    data,
    task='regression',
    optimization={
        'optimizer': 'skopt',
    }
)

print(f"best score: {res['best_cv_score']}")

4.5. IRIS Dataset - Random Search

This example shows how you can use a random search instead of skopt when using Techila AutoML.

from techila_ml import find_best_model
from techila_ml.configs import OptionalPackages
OptionalPackages.use = False

# Number of Techila jobs
n_jobs = 20

# Number of iterations
n_iterations = 100


def load_data():
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    X, y = load_iris(return_X_y=True)
    X_train, X_validation, y_train, y_validation = train_test_split(X, y, random_state=0)

    return {'X_train': X_train, 'y_train': y_train, 'X_validation': X_validation, 'y_validation': y_validation}


# Load the data locally.
data = load_data()

# Search for the best model in TDCE.
res = find_best_model(
    n_jobs,
    n_iterations,
    data,
    task='classification',
    optimization={
        'optimizer': 'random',
    }
)

5. Running in Docker

This Chapter contains examples on how you can run Techila AutoML in Docker.

Before continuing, please make sure that you have Docker installed on your computer.

The flow listed below describes how you can use Docker to run Techila AutoML. Using the Docker approach will minimize the differences between the local and TDCE environment. This can be useful in situations where differences (e.g. differences in package versions) between your local Python development environment and the TDCE execution environment are causing problems.

Download the TechilaSDK.zip to your own computer from the Techila Configuration Wizard.
Extract TechilaSDK.zip to your own computer. Make a mental note where you extracted it. The example flow below assumes that the TechilaSDK.zip was extracted to /home/user/techila. This directory should contain files called techila_settings.ini and admin.jks.
Copy the TechilaSDK.zip from your own computer, from where you downloaded it, to the current working directory. After copying the file, it should be located in the same folder with the Dockerfile file. The TechilaSDK.zip file will be included in the image in the next step (excluding credentials).
Modify the yourimagenamehere parameter below to have a descriptive name for your image.
```
sudo docker build -f Dockerfile -t yourimagenamehere .
```
After modifying the command, run it. This will create a Docker image that can be used to run Techila SDK.
Next you will need to create a Bundle from the container image you just created. This can be done using the command shown below. Before running the command, please update the following values:
- /tmp/dockertmp - Modify this to point to a directory on your computer that can be used to store the Docker image.
- /home/user/techila - Modify this to point to the directory where you extracted the TechilaSDK.zip file on your computer. This is the directory that contains the techila_settings.ini and admin.jks files.
- yourimagenamehere - Modify this to match your image name (the one you defined earlier, when executing the docker build command)
  
  Modify the command shown below with the values you are using and run the command to create the Bundle.
  sudo docker run -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/dockertmp:/tmp -v /home/user/techila:/techila yourimagenamehere /usr/bin/python3 py/createcontainerbundle.py
  Creating the Bundle may take several minutes (up to 30, depending on your network speed).
After the Bundle has been created, you can run an example that is included in the Techila SDK to verify that everything works:
```
sudo docker run -it -v /home/user/techila:/techila -e TECHILA_ML_DOCKER=true
yourimagenamehere /usr/bin/python3 /techila/examples/techila_ml/scripts/run_datatypes.py
```
In addition to the TECHILA_ML_DOCKER environment variable, docker usage can also be specified with docker parameter for find_best_model (docker=False|True|<bundlename>).