Gurtaj's blog!

This post is the main parts of a jupyter notebook (instantiated on Kaggle) in which I made my first submission to a Kaggle competition. The competition was to build a Digit Classifier based on the MNIST data set, and to make it as accurate as possible on the competition test data.

I created a model that was not deep learning, but instead just a simple model that:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

setup_comp is from fastkaggle library, it get’s the path to the data for competition. If not on kaggle it will: download it, and also it will install any of the modules passed to it as strings.

comp = 'digit-recognizer'

path = setup_comp(comp, install='fastai "timm>=0.6.2.dev0"')
from fastai.vision.all import *

let’s check what’s in the data.

path.ls()
(#3) [Path('../input/digit-recognizer/sample_submission.csv'),Path('../input/digit-recognizer/train.csv'),Path('../input/digit-recognizer/test.csv')]

We have a train.csv and a test.csv. test.csv is what we use for submission. So it looks like we will be creating our validation set, as well as the training set, from train.csv.

let’s look at the data.

df = pd.read_csv(path/'train.csv')
df
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41995 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41996 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41997 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41998 6 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41999 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

42000 rows × 785 columns

It is just as described in the competition guidelines.

Lets split this into training and validation data.
We will split by rows.
We will use 80% for training and 20% for validation. 80% of 42,000 is 33,600 so that will be our split index.

train_data = df.iloc[:33_600,:]
valid_data = df.iloc[33_600:,:]

len(train_data)/42000,len(valid_data)/42000
(0.8, 0.2)

Our pixel values can be anywhere between 0 and 255. For good practice, and ease of use later, we’ll normalise all these values by dividing by 255 so that they are all values between 0 and 1.

For the training data:

pixel_value_columns = train_data.iloc[:,1:]
label_value_column = train_data.iloc[:,:1]

pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
train_data = pd.concat([label_value_column, pixel_value_columns], axis=1)

train_data.describe()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 33600.000000 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 ... 33600.000000 33600.000000 33600.000000 33600.000000 33600.000000 33600.000000 33600.0 33600.0 33600.0 33600.0
mean 4.459881 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000801 0.000454 0.000255 0.000086 0.000037 0.000007 0.0 0.0 0.0 0.0
std 2.885525 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.024084 0.017751 0.013733 0.007516 0.005349 0.001326 0.0 0.0 0.0 0.0
min 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
25% 2.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
50% 4.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
75% 7.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
max 9.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.996078 0.996078 0.992157 0.992157 0.956863 0.243137 0.0 0.0 0.0 0.0

8 rows × 785 columns

And for the validation data:

pixel_value_columns = valid_data.iloc[:,1:]
label_value_column = valid_data.iloc[:,:1]

pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
valid_data = pd.concat([label_value_column, pixel_value_columns], axis=1)

valid_data.describe()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 8400.000000 8400.0 8400.0 8400.0 8400.0 8400.0 8400.0 8400.0 8400.0 8400.0 ... 8400.000000 8400.000000 8400.000000 8400.000000 8400.000000 8400.000000 8400.0 8400.0 8400.0 8400.0
mean 4.443690 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.001098 0.000482 0.000137 0.000050 0.000190 0.000027 0.0 0.0 0.0 0.0
std 2.896668 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.027281 0.019766 0.008370 0.003489 0.012709 0.002482 0.0 0.0 0.0 0.0
min 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
25% 2.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
50% 4.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
75% 7.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
max 9.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.992157 0.996078 0.627451 0.309804 0.996078 0.227451 0.0 0.0 0.0 0.0

8 rows × 785 columns

Baseline model: distance from mean

We’ll create a baseline model. For this we will take the average value of each pixel, across all the images (in vector form in our data) for each label (each digit from 0-9). This average will act as our ‘ideal’ version of the corresponding digit. We can use these averages and take the one that’s ‘closest’ to an input to be the correct label for that input.

all_training_zeros = train_data.loc[df['label'] == 0]
all_training_ones = train_data.loc[df['label'] == 1]
all_training_twos = train_data.loc[df['label'] == 2]
all_training_threes = train_data.loc[df['label'] == 3]
all_training_fours = train_data.loc[df['label'] == 4]
all_training_fives = train_data.loc[df['label'] == 5]
all_training_sixes = train_data.loc[df['label'] == 6]
all_training_sevens = train_data.loc[df['label'] == 7]
all_training_eights = train_data.loc[df['label'] == 8]
all_training_nines = train_data.loc[df['label'] == 9]

### the below values are excluding the labels of each of the respective items
### (the mean calculated on pixel values only)
average_zero = all_training_zeros.iloc[:,1:].mean(0)
average_one = all_training_ones.iloc[:,1:].mean(0)
average_two = all_training_twos.iloc[:,1:].mean(0)
average_three = all_training_threes.iloc[:,1:].mean(0)
average_four = all_training_fours.iloc[:,1:].mean(0)
average_five = all_training_fives.iloc[:,1:].mean(0)
average_six = all_training_sixes.iloc[:,1:].mean(0)
average_seven = all_training_sevens.iloc[:,1:].mean(0)
average_eight = all_training_eights.iloc[:,1:].mean(0)
average_nine = all_training_nines.iloc[:,1:].mean(0)

averages = [average_zero, average_one, average_two, average_three, average_four, average_five, average_six, average_seven, average_eight, average_nine]

we can check our distances using root mean squared error. let’s see how that works on a few random items.

### excluding the label value from our calculation
first_zero = all_training_zeros.iloc[0,1:]

((first_zero-average_zero)**2).mean()**0.5
0.2529256872506997

We can see that this is very close (a small difference), as expected. Now let’s see how far the same zero item is from the average four.

((first_zero-average_four)**2).mean()**0.5
0.387471198659217

A much further distance, also as expected.

So now, for each item, we can check it’s distance from each of the average items and then take our prediction to be whichever distance is the shortest.

let’s define a rmse function to use on our dataframes.

def df_rmse(a,b):
    return ((a-b)**2).mean()**0.5

Letes create a list of the labels.

labels = [0,1,2,3,4,5,6,7,8,9]

And now the function (our model) that will return the label with the lowest distance.

def get_label_prediction(x):
    distances = []
    for average in averages:
        distances.append(df_rmse(x, average))
    return labels[distances.index(min(distances))]
get_label_prediction(first_zero)
0

let’s give this a run against our validation set and see what proportion of the predictions made are actually correct.

def is_correct(pred,actual):
    return pred == actual

Let’s produce a list of predictions against actuals.

correct_or_not_list = [is_correct(get_label_prediction(row[1][1:]), row[1][0]) for row in valid_data.iterrows()]

correct_or_not_list[:10]
[True, True, True, True, True, True, True, True, False, False]

We can workout our accuracy on this validation data by:
first converting all True booleans to 1 and all False booleans to 0,
and then taking the mean of all values to see what proportion of our predictions were correct.

def get_accuracy(data):
    correct_or_not_list = [is_correct(get_label_prediction(row[1][1:]), row[1][0]) for row in data.iterrows()]
    boolean_number_list = [int(boolean) for boolean in correct_or_not_list]
    
    return np.array(boolean_number_list).mean()
get_accuracy(valid_data)
0.8126190476190476

81.3%. I deem this as a good bassline. We can run this model on the test data provided and submit it to the kaggle competition with an aim to beat it in future iterations.

Let’s first take a look at the test data to see if it is the same format as our training data.

df = pd.read_csv(path/'test.csv')
df.describe()
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 ... 28000.000000 28000.000000 28000.000000 28000.000000 28000.000000 28000.0 28000.0 28000.0 28000.0 28000.0
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.164607 0.073214 0.028036 0.011250 0.006536 0.0 0.0 0.0 0.0 0.0
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 5.473293 3.616811 1.813602 1.205211 0.807475 0.0 0.0 0.0 0.0 0.0
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 253.000000 254.000000 193.000000 187.000000 119.000000 0.0 0.0 0.0 0.0 0.0

8 rows × 784 columns

Let’s also normalise the pixel values, like we did before.

pixel_value_columns = df.iloc[:,1:]
label_value_column = df.iloc[:,:1]

pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
test_data = pd.concat([label_value_column, pixel_value_columns], axis=1)

test_data.describe()
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 ... 28000.000000 28000.000000 28000.000000 28000.000000 28000.000000 28000.0 28000.0 28000.0 28000.0 28000.0
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000646 0.000287 0.000110 0.000044 0.000026 0.0 0.0 0.0 0.0 0.0
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.021464 0.014184 0.007112 0.004726 0.003167 0.0 0.0 0.0 0.0 0.0
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.992157 0.996078 0.756863 0.733333 0.466667 0.0 0.0 0.0 0.0 0.0

8 rows × 784 columns

Now let’s create our predictions for each of the test images.

predictions = [get_label_prediction(row[1]) for row in test_data.iterrows()]
predictions[:10]
[2, 0, 9, 4, 3, 7, 0, 3, 0, 3]

Let’s take a looks at how our submission of these predictions should look.

ss = pd.read_csv(path/'sample_submission.csv')

ss
ImageId Label
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
... ... ...
27995 27996 0
27996 27997 0
27997 27998 0
27998 27999 0
27999 28000 0

28000 rows × 2 columns

So let’s utilise this sample submission by keeping the id’s and inserting our predictions into the “Label” column.

results = pd.Series(predictions, name="Label")
ss['Label'] = results

ss
ImageId Label
0 1 2
1 2 0
2 3 9
3 4 4
4 5 3
... ... ...
27995 27996 9
27996 27997 7
27997 27998 3
27998 27999 9
27999 28000 2

28000 rows × 2 columns

Now let’s create a csv to submit.

ss.to_csv('subm.csv', index=False)
!head subm.csv
ImageId,Label
1,2
2,0
3,9
4,4
5,3
6,7
7,0
8,3
9,0

Looks good! now we can submit this to kaggle.

We can do it straight from this note book if we are running it on Kaggle, otherwise we can use the API
In this case I did it directly using the kaggle notebook UI and selecting my ‘subm.csv’ file from the output folder there (see these guidelines)

After submitting the results, a score of 0.80750 was given. Meaning it was in the top 80th percintile of scores! (the top being 1.00000). Not a bad result at all for such a simple model. Next thing to try will be a simple neural net…