This post is the main parts of a jupyter notebook (instantiated on Kaggle) in which I made my first submission to a Kaggle competition. The competition was to build a Digit Classifier based on the MNIST data set, and to make it as accurate as possible on the competition test data.
I created a model that was not deep learning, but instead just a simple model that:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
!pip install -Uq fastkaggle
from fastkaggle import *
setup_comp
is from fastkaggle
library, it get’s the path to the data for competition. If not on kaggle it will: download it, and also it will install any of the modules passed to it as strings.
comp = 'digit-recognizer'
path = setup_comp(comp, install='fastai "timm>=0.6.2.dev0"')
from fastai.vision.all import *
let’s check what’s in the data.
path.ls()
(#3) [Path('../input/digit-recognizer/sample_submission.csv'),Path('../input/digit-recognizer/train.csv'),Path('../input/digit-recognizer/test.csv')]
We have a train.csv
and a test.csv
. test.csv
is what we use for submission. So it looks like we will be creating our validation set, as well as the training set, from train.csv
.
let’s look at the data.
df = pd.read_csv(path/'train.csv')
df
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41996 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41997 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41998 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41999 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
42000 rows × 785 columns
It is just as described in the competition guidelines.
Lets split this into training and validation data.
We will split by rows.
We will use 80% for training and 20% for validation. 80% of 42,000 is 33,600 so that will be our split index.
train_data = df.iloc[:33_600,:]
valid_data = df.iloc[33_600:,:]
len(train_data)/42000,len(valid_data)/42000
(0.8, 0.2)
Our pixel values can be anywhere between 0 and 255. For good practice, and ease of use later, we’ll normalise all these values by dividing by 255 so that they are all values between 0 and 1.
For the training data:
pixel_value_columns = train_data.iloc[:,1:]
label_value_column = train_data.iloc[:,:1]
pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
train_data = pd.concat([label_value_column, pixel_value_columns], axis=1)
train_data.describe()
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 33600.000000 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | 33600.0 | ... | 33600.000000 | 33600.000000 | 33600.000000 | 33600.000000 | 33600.000000 | 33600.000000 | 33600.0 | 33600.0 | 33600.0 | 33600.0 |
mean | 4.459881 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000801 | 0.000454 | 0.000255 | 0.000086 | 0.000037 | 0.000007 | 0.0 | 0.0 | 0.0 | 0.0 |
std | 2.885525 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.024084 | 0.017751 | 0.013733 | 0.007516 | 0.005349 | 0.001326 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 2.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
50% | 4.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
75% | 7.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
max | 9.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.996078 | 0.996078 | 0.992157 | 0.992157 | 0.956863 | 0.243137 | 0.0 | 0.0 | 0.0 | 0.0 |
8 rows × 785 columns
And for the validation data:
pixel_value_columns = valid_data.iloc[:,1:]
label_value_column = valid_data.iloc[:,:1]
pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
valid_data = pd.concat([label_value_column, pixel_value_columns], axis=1)
valid_data.describe()
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8400.000000 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | 8400.0 | ... | 8400.000000 | 8400.000000 | 8400.000000 | 8400.000000 | 8400.000000 | 8400.000000 | 8400.0 | 8400.0 | 8400.0 | 8400.0 |
mean | 4.443690 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.001098 | 0.000482 | 0.000137 | 0.000050 | 0.000190 | 0.000027 | 0.0 | 0.0 | 0.0 | 0.0 |
std | 2.896668 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.027281 | 0.019766 | 0.008370 | 0.003489 | 0.012709 | 0.002482 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 2.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
50% | 4.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
75% | 7.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
max | 9.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.992157 | 0.996078 | 0.627451 | 0.309804 | 0.996078 | 0.227451 | 0.0 | 0.0 | 0.0 | 0.0 |
8 rows × 785 columns
We’ll create a baseline model. For this we will take the average value of each pixel, across all the images (in vector form in our data) for each label (each digit from 0-9). This average will act as our ‘ideal’ version of the corresponding digit. We can use these averages and take the one that’s ‘closest’ to an input to be the correct label for that input.
all_training_zeros = train_data.loc[df['label'] == 0]
all_training_ones = train_data.loc[df['label'] == 1]
all_training_twos = train_data.loc[df['label'] == 2]
all_training_threes = train_data.loc[df['label'] == 3]
all_training_fours = train_data.loc[df['label'] == 4]
all_training_fives = train_data.loc[df['label'] == 5]
all_training_sixes = train_data.loc[df['label'] == 6]
all_training_sevens = train_data.loc[df['label'] == 7]
all_training_eights = train_data.loc[df['label'] == 8]
all_training_nines = train_data.loc[df['label'] == 9]
### the below values are excluding the labels of each of the respective items
### (the mean calculated on pixel values only)
average_zero = all_training_zeros.iloc[:,1:].mean(0)
average_one = all_training_ones.iloc[:,1:].mean(0)
average_two = all_training_twos.iloc[:,1:].mean(0)
average_three = all_training_threes.iloc[:,1:].mean(0)
average_four = all_training_fours.iloc[:,1:].mean(0)
average_five = all_training_fives.iloc[:,1:].mean(0)
average_six = all_training_sixes.iloc[:,1:].mean(0)
average_seven = all_training_sevens.iloc[:,1:].mean(0)
average_eight = all_training_eights.iloc[:,1:].mean(0)
average_nine = all_training_nines.iloc[:,1:].mean(0)
averages = [average_zero, average_one, average_two, average_three, average_four, average_five, average_six, average_seven, average_eight, average_nine]
we can check our distances using root mean squared error. let’s see how that works on a few random items.
### excluding the label value from our calculation
first_zero = all_training_zeros.iloc[0,1:]
((first_zero-average_zero)**2).mean()**0.5
0.2529256872506997
We can see that this is very close (a small difference), as expected. Now let’s see how far the same zero item is from the average four.
((first_zero-average_four)**2).mean()**0.5
0.387471198659217
A much further distance, also as expected.
So now, for each item, we can check it’s distance from each of the average items and then take our prediction to be whichever distance is the shortest.
let’s define a rmse function to use on our dataframes.
def df_rmse(a,b):
return ((a-b)**2).mean()**0.5
Letes create a list of the labels.
labels = [0,1,2,3,4,5,6,7,8,9]
And now the function (our model) that will return the label with the lowest distance.
def get_label_prediction(x):
distances = []
for average in averages:
distances.append(df_rmse(x, average))
return labels[distances.index(min(distances))]
get_label_prediction(first_zero)
0
let’s give this a run against our validation set and see what proportion of the predictions made are actually correct.
def is_correct(pred,actual):
return pred == actual
Let’s produce a list of predictions against actuals.
correct_or_not_list = [is_correct(get_label_prediction(row[1][1:]), row[1][0]) for row in valid_data.iterrows()]
correct_or_not_list[:10]
[True, True, True, True, True, True, True, True, False, False]
We can workout our accuracy on this validation data by:
first converting all True
booleans to 1
and all False
booleans to 0
,
and then taking the mean of all values to see what proportion of our predictions were correct.
def get_accuracy(data):
correct_or_not_list = [is_correct(get_label_prediction(row[1][1:]), row[1][0]) for row in data.iterrows()]
boolean_number_list = [int(boolean) for boolean in correct_or_not_list]
return np.array(boolean_number_list).mean()
get_accuracy(valid_data)
0.8126190476190476
81.3%. I deem this as a good bassline. We can run this model on the test data provided and submit it to the kaggle competition with an aim to beat it in future iterations.
Let’s first take a look at the test data to see if it is the same format as our training data.
df = pd.read_csv(path/'test.csv')
df.describe()
pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | ... | 28000.000000 | 28000.000000 | 28000.000000 | 28000.000000 | 28000.000000 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 |
mean | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.164607 | 0.073214 | 0.028036 | 0.011250 | 0.006536 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
std | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.473293 | 3.616811 | 1.813602 | 1.205211 | 0.807475 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
50% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
75% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
max | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 253.000000 | 254.000000 | 193.000000 | 187.000000 | 119.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 rows × 784 columns
Let’s also normalise the pixel values, like we did before.
pixel_value_columns = df.iloc[:,1:]
label_value_column = df.iloc[:,:1]
pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
test_data = pd.concat([label_value_column, pixel_value_columns], axis=1)
test_data.describe()
pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | ... | 28000.000000 | 28000.000000 | 28000.000000 | 28000.000000 | 28000.000000 | 28000.0 | 28000.0 | 28000.0 | 28000.0 | 28000.0 |
mean | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000646 | 0.000287 | 0.000110 | 0.000044 | 0.000026 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
std | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.021464 | 0.014184 | 0.007112 | 0.004726 | 0.003167 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
50% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
75% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
max | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.992157 | 0.996078 | 0.756863 | 0.733333 | 0.466667 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 rows × 784 columns
Now let’s create our predictions for each of the test images.
predictions = [get_label_prediction(row[1]) for row in test_data.iterrows()]
predictions[:10]
[2, 0, 9, 4, 3, 7, 0, 3, 0, 3]
Let’s take a looks at how our submission of these predictions should look.
ss = pd.read_csv(path/'sample_submission.csv')
ss
ImageId | Label | |
---|---|---|
0 | 1 | 0 |
1 | 2 | 0 |
2 | 3 | 0 |
3 | 4 | 0 |
4 | 5 | 0 |
... | ... | ... |
27995 | 27996 | 0 |
27996 | 27997 | 0 |
27997 | 27998 | 0 |
27998 | 27999 | 0 |
27999 | 28000 | 0 |
28000 rows × 2 columns
So let’s utilise this sample submission by keeping the id’s and inserting our predictions into the “Label” column.
results = pd.Series(predictions, name="Label")
ss['Label'] = results
ss
ImageId | Label | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 0 |
2 | 3 | 9 |
3 | 4 | 4 |
4 | 5 | 3 |
... | ... | ... |
27995 | 27996 | 9 |
27996 | 27997 | 7 |
27997 | 27998 | 3 |
27998 | 27999 | 9 |
27999 | 28000 | 2 |
28000 rows × 2 columns
Now let’s create a csv to submit.
ss.to_csv('subm.csv', index=False)
!head subm.csv
ImageId,Label
1,2
2,0
3,9
4,4
5,3
6,7
7,0
8,3
9,0
Looks good! now we can submit this to kaggle.
We can do it straight from this note book if we are running it on Kaggle, otherwise we can use the API
In this case I did it directly using the kaggle notebook UI and selecting my ‘subm.csv’ file from the output folder there (see these guidelines)
After submitting the results, a score of 0.80750 was given. Meaning it was in the top 80th percintile of scores! (the top being 1.00000). Not a bad result at all for such a simple model. Next thing to try will be a simple neural net…