Developing a Machine Learning system Part 3


This is part 3 of  the series “Developing a machine learning  system” .

The goal of every single article is reduce the gap between theoretical machine learning and applied machine learning .If you are following Learncodeonline blogs on machine learning in sequence  with which they are being uploaded then you are ready to build a machine learning system from scratch. The goal of this new series of  articles is build a working machine learning system which will be able to classify among three types of flowers, for simplicity of this article we will be using one of the most popular toy dataset called iris flower dataset.

The goal of this article is to develop a fully working machine learning system which will be able to predict class label given a input vector of features. We will be doing a case study approach so we will be comparing performance of two classical machine learning algorithm knn and  logistic regression.

After reading this article you will know,

  • Create a working algorithm using sklearn library
  • how to connect all parts of the system
  • Discussion on,  how you as a  beginner can deploy machine learning system

Before proceeding further it is recommended to follow this series from part 1.

Importing all required libraries

This is a good way , as to import all required libraries in beginning rather than in middle of code

import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

The first two libraries have already been discussed in previous parts, the only one left is Sklearn. It is by far one of the most popular libraries to deploy machine learning algorithms , it consist of pre-built models which we can use for our own data rather than coding from scratch every time we work on some other data and in case of real time model development  it is almost impossible to code everything from scratch.

Reading and understanding how our data is represented

iris = pd.read_csv('iris.csv')

Expected output,

Everything is quite right except the label representation of species , we cannot compare our model prediction with a string label , so we have to convert every string label into a vector using one-hot-encoding.

Converting class labels to one-hot-encoding

lb = preprocessing.LabelBinarizer()['setosa', 'virginica','versicolor']))
Y = lb.transform(specieslabel)

This will change our string labels to one-hot-encoded vectors e.g [0,0,1]

Split train and test data

This is a necessary step as after training algorithm on certain amount of data we need a data which algorithm has not seen before to check its performance on real data. The split ratio which we will be using is usual 70% train and 30% test.

X = iris.drop(['species'],axis =1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

Implementing algorithms

This section is divided into two parts , part 1 covers implementation of knn and in part 2 we have covered implementation of logistic regression.

cv_score = []
k =[]
for i in range(1,2,30):
    knn = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(knn, X_train, Y_train, cv=10, scoring='accuracy')
best_k = k[cv_score.index(max(cv_score))]
print("best k for knn ",best_k)

Expected output , ” best k for knn 5 “

The flow of this snippet is quite simple first we declare two list,

cv_score : to store mean cross validation score of all model at particular value of i , the method cross_val_score is sklearn implementaion of k-fold cross validation with value of k for k-fold.

K : to store different values of k , this is been done to select the best value of k for knn.

Then  we start a loop  to check model’s performance for different values of k ranging from 1 to 30 with a jump of 2. After this loop completes we extract the k which has best accuracy score .

Building final knn model

knn = KNeighborsClassifier(n_neighbors=1), Y_train)
y_pred = knn.predict(X_test)
print("Accuracy on test data ",accuracy_score(Y_test, y_pred)*100,"%")

Expected Output, ” Accuracy on test data   95%”

Logistic regression

For simplicity of this article we will not be using any regularization, so there is no need to use cross-validation dataset.

logisticregression = LogisticRegression(), Y_train)
y_pred = logisticregression.predict(X_test)
print("Accuracy on test data ",accuracy_score(Y_test, y_pred)*100,"%")

Expected output, “Accuracy on test data 96%”

As this model does not include any hyperparameter selection , we can skip cross-validation test and can directly jump over model fitting over data with default parameters and then predicting the lables of test data

This ends our model building procedure. Though in real life predicting model getting an accuracy of over 90% is a big achievement.

It is recommended that you should implement this code on your machine as the only key to learn machine learning is by implementing models and understanding data. Getting good in field of machine learning requires a great amount of hard work and dedication as this branch of computer science is highly focused on applied mathematics, some concepts from Algorithms in computer science etc. This was last article of this series, hope you enjoyed reading it.


About the author


I write blogs about Machine Learning and data science

By abhinavsinghml

Most common tags

%d bloggers like this: