Luis Caballero Diaz's profile

From Machine Learning Model to Production

This project focuses on introducing the procedure from developing a machine learning model to be deployed into business production to allow non-qualified end users to use the model to make predictions. For that purpose, the next three projects are created:

MODEL DEVELOPMENT --> project focused on assessing input dataset and develop a machine learning model with high learning power to predict new data.

NEW DATA PREDICTION --> project focused on using the developed model to make predictions in a friendly graphic user interface to be used for non-qualified end users.

ONLINE SERVER MANAGEMENT --> project focused on managing an online server to accept end user HTTP requests for new data prediction using the developed model.


The dataset used in the project is from Kaggle and available here. It corresponds a dataset with almost 70k samples, 12 features and a binary target. Each sample corresponds to a patient and the binary target indicates if the patient has a cardiovascular disease or not. Note the project does not have as focus maximizing model performance, but creating an acceptable operating machine learning model and introducing the procedure to deploy it into production.

As reference, the work is done with Python, FLASK framework to manage the online server, SCIKIT LEARN framework to develop the machine learning model and TKINTER framework to develop the graphic user interface for end user predictions.

The project work is split into next sections.

SECTION 1: MODEL DEVELOPMENT AND SERIALIZATION
SECTION 2: LOCAL PREDICTION
SECTION 3: ONLINE SERVER PREDICTION
SECTION 4: DOCKER CONTAINER INTRODUCTION
SECTION 1: MODEL DEVELOPMENT AND SERIALIZATION
This section has the purpose to create an acceptable operating machine learning model to predict if the patients have a cardiovascular disease or not. Following that purpose, an exploratory data analysis (EDA) to get more familiar with input dataset is performed using below code.
The dataset does not have NA values, so imputation is not required. Otherwise, a simple strategy to impute according to the mean and most common category might be applied.
The dataset has 69301 samples with 12 features and 1 target class. All data is defined as numerical with type int64 and float64, but according to the description some features are categorical. Note that having a categorical feature with numerical categories does not make the feature numerical, and needs to be treated as categorical. Therefore, type conversion will be performed to be sure each feature is treated as needed.
The description per each feature is depicted below and main highlights are captured next.

- ID: it is just a reference number per each patient, so it will be removed from the input dataset to keep model as simple as possible while the learning power is not affected.
- AGE: it is a numerical feature, and it looks good based on statistical information.
- GENDER: it is a binary categorical feature.
- HEIGHT: it is a numerical feature, and it looks good based on statistical information.
- WEIGHT: it is a numerical feature, and it looks good based on statistical information.
- AP_HI: it is a numerical feature, and it has outliers in both extremes. Therefore, it will explode the outliers to 1.5 x IQR rule.
- AP_LO: it is a numerical feature, and it has outliers in both extremes. Therefore, it will explode the outliers to 1.5 x IQR rule.
- CHOLESTEROL: it is a categorical feature with three categories.
- GLUC: it is a categorical feature with three categories.
- SMOKE: it is a binary categorical feature.
- ALCO: it is a binary categorical feature.
- ACTIVE: it is a binary categorical feature.
- CARDIO: it is the binary target, which is uniformly distributed between the two classes.
As described earlier, some preprocessing strategy is needed in the input dataset such as applying proper type conversion or exploding outliers. That work is performed with below code.
Then, input dataset is split between training and testing with a proportion 80/20.
Next step is to create the model and in order to show a more realistic example, the model applies one hot encoded in categorical features and standardization in numerical features. Additionally, prior to that, it applies logarithm transformation in the features AGE, AP_HI and AP_LO. Therefore, the model is defined with a three step Pipeline as follows:

Step 1 --> Logarithm transformation in AGE, AP_HI and AP_LO. The rest of features are unchanged. Note as each feature receive a different approach, a ColumnTransformer class is used.

Step 2 --> Standardization in numerical features and one hot encoded in categorical features. Note as each feature receive a different approach, a ColumnTransformer class is also used. Additionally, one one hot feature is dropped for binary categorical features to avoid model collinearities, since it is not helpful to have a feature and the exact opposite feature because they really provide the same information.

Step 3 --> Logistic regression modeling
Once the three steps pipeline is defined, it is fitted with training data set and tested against testing data set. The results shows an accuracy around 0.73 in both cases, which validates the model.
After model is validated, next step is to serialize the model to generate a PKL file with the model already trained, which will be used in the model deployment stage.
To validate that the model deployment is properly implemented, the three below patient examples are stored to be used as sanity check in the next section.
SECTION 2: LOCAL PREDICTION
The developed model might be intended to be used by a professional in medicine who is not familiar with software development or machine learning. Therefore, all technical design should be masked under a friendly graphic user interface allowing the professional in medicine to introduce the patient data and make a prediction. For the current project, a simple interface as the one depicted below was created with TKINTER framework. 
In the graphic user interface, the end user can introduce the parameters manually, but for sake of simplicity an AUTOFILL button was included to insert automatically the parameters of the three stored examples.
The key point from the graphic user interface is the action to apply after pressing the PREDICT button. In this first approach, the model is locally loaded in the end user computer and a prediction is performed with the input data introduced by the user, as depicted below.
Then, the prediction is translated to string text to allow the end user understanding the prediction in an easy way.
As reference, below example shows a prediction for a healthy patient.
As final step, the prediction in local computer is performed to the three stored examples to check the model serialization was properly performed. As demonstrated below, all three predictions match with the predictions during development.
EXAMPLE 1 - MODEL SERIALIZATION PREDICTION
EXAMPLE 2 - MODEL SERIALIZATION PREDICTION
EXAMPLE 3 - MODEL SERIALIZATION PREDICTION
SECTION 3: ONLINE SERVER PREDICTION
In the previous section a local prediction approach has defined storing the model in the end user computer. However, that approach is highly inefficient and non-realistic in the professional world. The local approach needs the end user to have a copy of the serialized model and manage in the local computer all libraries, frameworks and dependencies required for the model to make the prediction. Each new user would need to configure the local computer according to the model needs prior to make a prediction. That process is inefficient and might lead to potential failures since a change in a dependency version might cause the model not to work properly. 

In light of the above constraints, a better approach is to make predictions from an online server located anywhere in the world through HTTP requests. Therefore, the end user does not need to configure the local computer with the required model dependencies. Instead, end user only needs internet connection to launch a server request to make the prediction, and all related to the model prediction is managed in the online server. Moreover, this online server can be used for multiple applications, not just for the model prediction under discussion.

The machine hosting the online server should run below code to launch the server. Note FLASK framework is used for that purpose. First of all, a FLASK application is generated and the serialized model is loaded.
Then, an HTTP POST method function is defined to allow end user sharing the input data to the online server and receive the response after making the prediction. Note data is shared in HTTP in JSON format, and so this function transforms the input data to a PANDAS dataframe to make the prediction, and transforms it again to JSON to return the response.
Finally, the online server must be enabled. In the below example, server is enabled locally at port 8000 in the computer only as reference, but in the practice the machine hosting the online server can be located anywhere in the world.
Once the online server is enabled, the server is continuously waiting for receiving prediction petitions launched by the end users.
Therefore, the action after pressing the PREDICT button in the graphic user interface must be modified as follows to allow the end user making a prediction petition. The below code transform the input user data to JSON format and launch a predict HTTP petition to the online sever sharing the input user JSON data. Once, the online server receives the petition, it makes the prediction and returns the response.
The petitions are tracked by the online server, generating a log each time a petition is requested as depicted below.
SECTION 4: DOCKER CONTAINER INTRODUCTION
The previous section defines an online server prediction approach, but it still requires manual management of the online server to match all libraries, frameworks, dependencies and versions as the model requires. That might be a hard task.

To make the process more straightforward, a container concept might be applied. A container is simply a running instance of the application image, but in simple words, a container is a space in the machine hosting the online server with all required dependencies to make the application run already set. There are platforms as Dockers, which makes that process in a very optimized and efficient way, and a single computer can have multiple Docker containers. This is the most common approach in the machine learning industry.

Docker needs as input a Dockerfile to build an application image and run a specific container for that application. The Dockerfile is a recipe to define the needs to run the current application. Among other things, Dockerfile needs to include the baseline image, image metadata as maintainer and version, run commands to install software and setup the container environment, the machine learning model itself, URLs to download and extract files, working directories for subsequent instructions, environment variables, entrypoint executables to constantly run, health checks...

The container creation process is simplified in below picture.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
From Machine Learning Model to Production
Published:

From Machine Learning Model to Production

Published:

Creative Fields