Analyzing IRIS Dataset With Keras and Tensorflow – Machine Learning and Data Analysis

We will analyze the famous IRIS dataset.
Before we start, here are the basic steps that any typical Machine Learning based Data Analysis workflow consists of. (Note: if text inside figure appears small, please increase the font size temporarily by Ctrl+roll-mouse-scroller)

Most of our time will be spent in Phases 1 and 2.

Source code is hosted on github here

Lets walk the process with IRIS dataset.

Phase 1 : Data Preparation
Since IRIS dataset comes prepackaged with sklean, we save the trouble of downloading the dataset. Furthermore, the dataset is already cleaned and labeled. So we just need to put the data in a format we will use in the application.

First, let me dump all the includes. These will be used at various times during the coding.

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelBinarizer
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

Next we begin the data preparation phase.
Since the IRIS dataset involves classification of flowers into three kinds: setosa, versicolor and virginica,
it behooves us to use one hot encoding to encode the target. The dataset uses 0,1 and 2 for respective classes. We will convert these into one-hot encoded vectors.
We will use the value of “seed” later in random_state

Step 1: Create Dataframes for features and target

encoder = LabelBinarizer()
seed = 42

iris = datasets.load_iris()
iris_data_df = pd.DataFrame(data=iris.data, columns=iris.feature_names,
                       dtype=np.float32)
target = encoder.fit_transform(iris.target)
iris_target_df = pd.DataFrame(data=target, columns=iris.target_names) 

Step 2: Create training and testing datasets

X_train,X_test,y_train,y_test = train_test_split(iris_data_df,
                                                 iris_target_df,
                                                 test_size=0.30,
                                                 random_state=seed)

Step 3: Feature scaling.

ML algos perform best when all dataset features have the same scale. Particularly, in Neural Nets, we use MinMaxScaler with range between 0 and 1.
Also, since MinMaxScalar transformation makes us lose the column and index labels, we need to recreate the dataframes for scaled data and target.

scaler = MinMaxScaler(feature_range=(0,1))

X_train = pd.DataFrame(scaler.fit_transform(X_train),
                               columns=X_train.columns,
                               index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test),
                           columns=X_test.columns,
                           index=X_test.index)
Optional micro detail:
Note that we use fit_transform on X_train and transform on X_test.
It is very important that we don’t fit on test data.

Phase 2 : Model Building and Model Cross Validation

In this phase we build various models and use cross validation to see how they perform.

Since we wish to keep things simple, we will be building just one model, that of DNN or Dense Neural Network.

Optional micro detail:
In the diagram, each line of [Model Building] + [Model Cross Validation] corresponds to a model and its cross validation. Since we will be building only one model, DNN, we will have just one line in the Phase-2 of the diagram.

Step 1: Build model

Since we are building a classifier (to classify whether a flower is setosa, versicolor or virginia), we will use a softmax activation function in the output.

def model():
    """build the Keras model callback"""
    model = Sequential()
    model.add(Dense(8, input_dim=4, activation='tanh', name='layer_1'))
    model.add(Dense(10, activation='tanh', name='layer_2'))
    model.add(Dense(10, activation='tanh', name='layer_3'))
    model.add(Dense(3, activation='softmax', name='output_layer'))
    
    model.compile(loss="categorical_crossentropy",
                  optimizer="adam",
                  metrics=['accuracy'])
    return model
Optional micro detail:

  1. Input layer(layer_1) has 4 inputs corresponding to 4 feature columns in X_train. The output_layer has 3 outputs for the three classes in target.
  2. You can try different activation functions in layer_1, layer_2 and layer_3 like sigmoid or relu.
    Sometimes choice of activation function affects the results a great deal. In most cases, you will be using non-linear activation functions like tanh, sigmoid or relu.
  3. metrics=[‘accuracy’]: Here default accuracy is used which is categorical_accuracy. Some other accuracy settings are binary_accuracy, sparse_categorical_accuracy etc. Default is the most appropriate here.

Step 2: Create estimator

estimator = KerasClassifier(
        build_fn=model,
        epochs=200, batch_size=20,
        verbose=2)
Optional micro detail:
Note that setting verbose=2 will give us some helpful messages during training. If you don’t want to see these messages, set verbose=0 to turn off these messages.
The debug messages will show two metrics: loss and accuracy. loss refers to categorical_crossentropy loss function used in the model. accuracy refers to the default accuracy metric which is categorical_accuracy’

Step 3: Do the Model cross validation

Since the dataset is quite small, we can get away with just 5 fold cross validation.
On bigger datasets, increase the folds by modifying n_splits

kfold = KFold(n_splits=5, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Model Performance: mean: %.2f%% std: (%.2f%%)" % (results.mean()*100, results.std()*100))

At this stage, we can run the code. The whole code up to this point looks like this:

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelBinarizer
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

encoder = LabelBinarizer()
seed = 42

iris = datasets.load_iris()
iris_data_df = pd.DataFrame(data=iris.data, columns=iris.feature_names,
                       dtype=np.float32)
target = encoder.fit_transform(iris.target)
iris_target_df = pd.DataFrame(data=target, columns=iris.target_names)

X_train,X_test,y_train,y_test = train_test_split(iris_data_df,
                                                 iris_target_df,
                                                 test_size=0.30,
                                                 random_state=seed)

scaler = MinMaxScaler(feature_range=(0,1))

X_train = pd.DataFrame(scaler.fit_transform(X_train),
                               columns=X_train.columns,
                               index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test),
                           columns=X_test.columns,
                           index=X_test.index)

def model():
    """build the Keras model callback"""
    model = Sequential()
    model.add(Dense(8, input_dim=4, activation='tanh', name='layer_1'))
    model.add(Dense(10, activation='tanh', name='layer_2'))
    model.add(Dense(10, activation='tanh', name='layer_3'))
    model.add(Dense(3, activation='softmax', name='output_layer'))
    
    model.compile(loss="categorical_crossentropy",
                  optimizer="adam",
                  metrics=['accuracy'])
    return model

estimator = KerasClassifier(
        build_fn=model,
        epochs=200, batch_size=20,
        verbose=2)
kfold = KFold(n_splits=5, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Model Performance: mean: %.2f%% std: (%.2f%%)" % (results.mean()*100, results.std()*100))

The mean performance is around 93% which is quite good.

Phase 3 : Model Selection
In our case this is very easy. We have just one model. DNN model we created in Phase 2. The mean performance is around 93% which is quite good.

If we created multiple models in Phase 2, we will compare their performances in this phase and choose the right model.

Phase 4 : Model Selection
Now we are in the fast lane. This phase is usually pretty quick. All we have to do is to fit the model again.

Optional micro detail:
The cross_val_score API call in Phase 2 makes a clone of the model and uses that for cross validation. Hence we will need to fit the model again in Phase 4.

Here we instantiate the model and fit it.

model = model()
model.fit(
       X_train,
       y_train,
       epochs=200,
       shuffle=True, # shuffle data randomly.
       #NNs perform best on randomly shuffled data
       verbose=2 # this will tell keras to print more detailed info
       # during trainnig to know what is going on
       )

Then we do model evaluation. This is the place where we will use the test data for the first time. This is our final check of the model before we deploy the model in production.

#run the test dataset
test_error_rate = model.evaluate(X_test, y_test, verbose=0)
print(
      "{} : {:.2f}%".format(model.metrics_names[1],
              test_error_rate[1]*100))
print(
      "{} : {:.2f}%".format(model.metrics_names[0],
              test_error_rate[0]*100))

You can run the code again to see how the model performs on the test data. The test_error_rate gives us an idea of how well we did on the test data.
We get around 98% accuracy (categorical_accuracy) which is pretty good.

OK. So now its time to try the model out to predict some unseen data. Since we didn’t train the model on X_test, we will use the same data to predict. We will also write a helper function for evaluating the results.

predicted_targets = model.predict_classes(X_test)
true_targets = encoder.inverse_transform(y_test.values)

def performance_tracker(actual, expected):
    flowers = {0:'setosa', 1:'versicolor', 2:'virginica'}
    print("Flowers in test set: Setosa={} Versicolor={} Virginica={}".format(
            y_test.setosa.sum(), y_test.versicolor.sum(),
            y_test.virginica.sum()))
    for act,exp in zip(actual, expected):
        if act != exp:
            print("ERROR: {} predicted as {}".format(flowers[exp],
                  flowers[act]))
            
performance_tracker(predicted_targets, true_targets)

Cool, there were 45 flowers in the test set and only one sample was mis-predicted. Rest were OK.




1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *