Gurtaj's blog!

Introduction

In the previous notebook, I applied some core concepts I had learned in creating and training models to my, then failing, linear equation model on the MNIST data set.
The optimisation I had made were to do with how many activations I was producing, per image, and what the shape of my data was whenver it had needed processing.

The updated linear equation model was very successful and I am now going to apply the same optimisations to my 2 linear layer model.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *
comp = 'digit-recognizer'

path = setup_comp(comp, install='fastai "timm>=0.6.2.dev0"')
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /Users/gaz/.kaggle/kaggle.json'
from fastai.vision.all import *
path.ls()
(#3) [Path('digit-recognizer/test.csv'),Path('digit-recognizer/train.csv'),Path('digit-recognizer/sample_submission.csv')]
df = pd.read_csv(path/'train.csv')
df
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41995 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41996 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41997 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41998 6 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
41999 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

42000 rows × 785 columns

train_data_split = df.iloc[:33_600,:]
valid_data_split = df.iloc[33_600:,:]

len(train_data_split)/42000,len(valid_data_split)/42000
(0.8, 0.2)
pixel_value_columns = train_data_split.iloc[:,1:]
label_value_column = train_data_split.iloc[:,:1]

pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
train_data = pd.concat([label_value_column, pixel_value_columns], axis=1)

train_data.describe()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 33600.000000 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 33600.0 ... 33600.000000 33600.000000 33600.000000 33600.000000 33600.000000 33600.000000 33600.0 33600.0 33600.0 33600.0
mean 4.459881 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000801 0.000454 0.000255 0.000086 0.000037 0.000007 0.0 0.0 0.0 0.0
std 2.885525 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.024084 0.017751 0.013733 0.007516 0.005349 0.001326 0.0 0.0 0.0 0.0
min 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
25% 2.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
50% 4.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
75% 7.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0
max 9.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.996078 0.996078 0.992157 0.992157 0.956863 0.243137 0.0 0.0 0.0 0.0

8 rows × 785 columns

pixel_value_columns_tensor = torch.tensor(train_data.iloc[:,1:].values).float()
label_value_column_tensor = torch.tensor(train_data.iloc[:,:1].values).float()

train_ds = list(zip(pixel_value_columns_tensor,label_value_column_tensor))

We’ll make this a function, so that we can do the same again for our validation data.

train_dl = DataLoader(train_ds, batch_size=256)
train_xb,train_yb = first(train_dl)

train_xb.shape,train_xb.shape
(torch.Size([256, 784]), torch.Size([256, 784]))
def dataset_from_dataframe(dframe):
    pixel_value_columns = dframe.iloc[:,1:]
    label_value_column = dframe.iloc[:,:1]

    pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)

    pixel_value_columns_tensor = torch.tensor(train_data.iloc[:,1:].values).float()
    label_value_column_tensor = torch.tensor(train_data.iloc[:,:1].values).float()

    return list(zip(pixel_value_columns_tensor, label_value_column_tensor))
valid_ds = dataset_from_dataframe(valid_data_split)

valid_dl = DataLoader(valid_ds, batch_size=256)

To ease my mind and help spot places where I could be making errors, i’ll make a function that can visually show a particular input (digit image) to me.

def show_image(item):
    item = item.view(28,28) * 255
    plt.gray()
    plt.imshow(item, interpolation='nearest')
    plt.show()

Now, for my sanity, i’ll test an images in train_xb.

show_image(train_xb[0])

png

def init_params(size): return (torch.rand(size) - 0.5).requires_grad_()
# 30 lots of 784 length weight arrays, therefore 30 outputs
w1 = init_params((30,784))
b1 = init_params((30, 1))
# 10 lots of 30 length weight arrays, therefore 10 outputs
w2 = init_params((10,30))
b2 = init_params((10, 1))
def simple_nn(batch):
    res = w1@batch.T + b1
    res = F.relu(res)
    res = w2@res + b2
    return F.softmax(res, dim=0)
first_batch_predictions = simple_nn(train_xb)

first_batch_predictions.shape
torch.Size([10, 256])

Let’s see what we get from the first column (results of the first image).

first_batch_predictions[:,0]
tensor([1.0476e-03, 5.4595e-03, 9.7319e-02, 2.4645e-02, 4.4612e-03, 9.0264e-02,
        2.5988e-04, 7.6168e-01, 1.2942e-02, 1.9236e-03],
       grad_fn=<SelectBackward0>)

Just to confirm that we did the softmax call across the right dimenstion, let’s ensure all these 10 values now add up to 1.

sum(first_batch_predictions[:,0])
tensor(1., grad_fn=<AddBackward0>)

Loss Function

number_of_classes = 10

def one_hot(yb):
    batch_size = len(yb)
    one_hot_yb = torch.zeros(batch_size, number_of_classes)
    x_coordinates_array = torch.arange(len(one_hot_yb))
    # used `.squeeze()` becasue yb originally has the size (batch_size, 1) and we just want a size of (batch_size). ([1, 2, 3, ...] instead of [[1], [2], [3], ...])
    # used `.long()` because: "tensors used as indices must be long, int, byte or bool tensors"
    y_coordinates_array = yb.squeeze().long()
    # set to `1.` rather than `1` because: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.
    one_hot_yb[x_coordinates_array, y_coordinates_array] = torch.tensor(1.)
    
    return one_hot_yb.T
def rmse(a, b):
    b = one_hot(b)
    mse = nn.MSELoss()
    loss = torch.sqrt(mse(a, b))
    
    return loss
one_hot(train_yb),one_hot(train_yb).shape,one_hot(train_yb)[:,0],train_yb[0]
(tensor([[0., 1., 0.,  ..., 0., 0., 0.],
         [1., 0., 1.,  ..., 0., 0., 1.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 1., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 torch.Size([10, 256]),
 tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]),
 tensor([1.]))

Let’s run it on our test batch predictions from above.

rmse(first_batch_predictions, train_yb)
tensor(0.3866, grad_fn=<SqrtBackward0>)

Trainability

def calc_grad(batch_inputs, batch_labels, batch_model):
    batch_preds = batch_model(batch_inputs)
    loss = rmse(batch_preds, batch_labels)
    loss.backward()
def train_epoch(dl, batch_model, params, lr):
    for xb,yb in dl:
        calc_grad(xb, yb, batch_model)
        for p in params:
            pdata1 = p.data
            p.data -= p.grad*lr
            pdata2 = p.data
            p.grad.zero_()

Validation and Metric

def get_predicted_label(pred):
    #returns index of highest value in tensor, which convenietnly also is directly the the digit/label that it corresponds to
    return torch.argmax(pred)
get_predicted_label(torch.tensor([0,4,3,2,6,1]))
tensor(4)

Let’s test this on some predictions from first_batch_predictions to ensure that we are getting sensible values (values from 0-9)

get_predicted_label(first_batch_predictions[:,0]),get_predicted_label(first_batch_predictions[:,1]),get_predicted_label(first_batch_predictions[:,3]),get_predicted_label(first_batch_predictions[:,5]),get_predicted_label(first_batch_predictions[:,33])
(tensor(7), tensor(3), tensor(3), tensor(3), tensor(3))

Accuracy

def batch_accuracy(preds, yb):
    #remember each column in our preds is an indivudual prediction, so we transpose preds in order to iterate through each precition in our list comprehension below
    preds = torch.tensor([get_predicted_label(pred) for pred in preds.T])
    # is_correct is a tensor of True and False values
    is_correct = preds==yb.squeeze()
    # now we turn all True values into 1 and all False values into 0, then return the mean of those values
    return is_correct.float().mean()
batch_accuracy(simple_nn(train_xb[:100]),train_yb[:100])
tensor(0.0800)
def validate_epoch(dl, batch_model):
    accuracies = [batch_accuracy(batch_model(xb),yb) for xb,yb in dl]
    # turn list of tensors into one single tensor of stacked values, so that we can then calculate the mean across all those values
    stacked_tensor = torch.stack(accuracies)
    mean_tensor = stacked_tensor.mean()
    # round method only works on value within tensor so we use item() to get it (and then round to four decimal places)
    return round(mean_tensor.item(), 4)
validate_epoch(valid_dl, simple_nn)
0.0644

Train for Number of Epochs

lr = 0.01
params = w1,b1,w2,b2
train_epoch(train_dl, simple_nn, params, lr)
validate_epoch(valid_dl, simple_nn)
0.082

Now let’s attempt at training our model over 500 more epochs and see if it improves.

for i in range(500):
    train_epoch(train_dl, simple_nn, params, lr)
    # run validate_epoch on every 50th iteration
    if i % 50 == 0:
        print(validate_epoch(valid_dl, simple_nn), ' ')
0.1113  
0.5865  
0.7313  
0.7669  
0.8314  
0.8682  
0.8825  
0.8929  
0.8995  
0.9053  

A ~91% accuracy on our validation data. For the single linear function model in the previous notebook we got an ~87% accuracy in the same number of epochs. So we are getting a higher accuracy with this two layer network but one key thing to notice is that huge jump in accuracy from the first printout to the second, this is a drastic improvement in trainability on the single layer linear model in the previous notebook.

I’ll now run this on the test data and submit it to the kaggle competition in order to see if it’s any good…

Submission

test_df = pd.read_csv(path/'test.csv')

test_df.describe()
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 ... 28000.000000 28000.000000 28000.000000 28000.000000 28000.000000 28000.0 28000.0 28000.0 28000.0 28000.0
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.164607 0.073214 0.028036 0.011250 0.006536 0.0 0.0 0.0 0.0 0.0
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 5.473293 3.616811 1.813602 1.205211 0.807475 0.0 0.0 0.0 0.0 0.0
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 253.000000 254.000000 193.000000 187.000000 119.000000 0.0 0.0 0.0 0.0 0.0

8 rows × 784 columns

Now we format it into what our model expects. Since we don’t have labels for this data we’ll make a batch of inputs only (the batch will be the whole of the test data)

test_tensor = torch.tensor(test_df.values)/255

test_tensor.shape
torch.Size([28000, 784])

Let’s take a look at the first image, just for sanity.

show_image(test_tensor[0])

png

Now let’s create a function that produces a meaningful output (digit prediction) for each image, using our model.

def predict(batch):
    preds = simple_nn(batch)
    # convert tensor to numpy value
    preds = [get_predicted_label(pred).numpy() for pred in preds.T]
    
    return preds
preds = predict(test_tensor)

preds[:5]
[array(2), array(0), array(9), array(9), array(2)]
pred_labels_series = pd.Series(preds, name="Label")

pred_labels_series
0        2
1        0
2        9
3        9
4        2
        ..
27995    9
27996    7
27997    3
27998    9
27999    2
Name: Label, Length: 28000, dtype: object
sample_submission = pd.read_csv(path/'sample_submission.csv')
sample_submission
ImageId Label
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
... ... ...
27995 27996 0
27996 27997 0
27997 27998 0
27998 27999 0
27999 28000 0

28000 rows × 2 columns

len(test_df)
28000
sample_submission['Label'] = pred_labels_series

sample_submission
ImageId Label
0 1 2
1 2 0
2 3 9
3 4 9
4 5 2
... ... ...
27995 27996 9
27996 27997 7
27997 27998 3
27998 27999 9
27999 28000 2

28000 rows × 2 columns

# this outputs the actual file
sample_submission.to_csv('subm.csv', index=False)
#this shows the head (first few lines)
!head subm.csv
ImageId,Label
1,2
2,0
3,9
4,9
5,2
6,7
7,0
8,3
9,0
!kaggle competitions list
!kaggle competitions files digit-recognizer 
!kaggle competitions submit -c digit-recognizer  -f ./subm.csv -m "going from single linear function model to two-linear-layer model"

This received a score of 0.90142, an improvement on the previous model yet again!

Conclusion

Using just the things learnt in the previous notebook we have managed to get our, previously non-working, 2 linear layer model to work, and with better results than the single layer (as one would expect).

The next step will be to try out our 2 linear layer model that utilises PyTorch’s nn modules and see if any of the things we have learnt thusfar will help us to get that working too. This will be done in a separate notebook.