Gurtaj's blog!

Introduction

In the previous notebook, I applied some core concepts I had learned in creating and training models to my, then failing, linear equation model on the MNIST data set.
The optimisation I had made were to do with how many activations I was producing, per image, and what the shape of my data was whenver it had needed processing.

The updated linear equation model was very successful and I am now going to apply the same optimisations to my 2 linear layer model.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

comp = 'digit-recognizer'

path = setup_comp(comp, install='fastai "timm>=0.6.2.dev0"')

Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /Users/gaz/.kaggle/kaggle.json'

from fastai.vision.all import *

path.ls()

(#3) [Path('digit-recognizer/test.csv'),Path('digit-recognizer/train.csv'),Path('digit-recognizer/sample_submission.csv')]

df = pd.read_csv(path/'train.csv')
df

	label	pixel0	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	...	pixel774	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783
0	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	4	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
41995	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41996	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41997	7	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41998	6	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41999	9	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

42000 rows × 785 columns

train_data_split = df.iloc[:33_600,:]
valid_data_split = df.iloc[33_600:,:]

len(train_data_split)/42000,len(valid_data_split)/42000

(0.8, 0.2)

pixel_value_columns = train_data_split.iloc[:,1:]
label_value_column = train_data_split.iloc[:,:1]

pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)
train_data = pd.concat([label_value_column, pixel_value_columns], axis=1)

train_data.describe()

	label	pixel0	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	...	pixel774	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783
count	33600.000000	33600.0	33600.0	33600.0	33600.0	33600.0	33600.0	33600.0	33600.0	33600.0	...	33600.000000	33600.000000	33600.000000	33600.000000	33600.000000	33600.000000	33600.0	33600.0	33600.0	33600.0
mean	4.459881	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000801	0.000454	0.000255	0.000086	0.000037	0.000007	0.0	0.0	0.0	0.0
std	2.885525	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.024084	0.017751	0.013733	0.007516	0.005349	0.001326	0.0	0.0	0.0	0.0
min	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0
25%	2.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0
50%	4.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0
75%	7.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0
max	9.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.996078	0.996078	0.992157	0.992157	0.956863	0.243137	0.0	0.0	0.0	0.0

8 rows × 785 columns

pixel_value_columns_tensor = torch.tensor(train_data.iloc[:,1:].values).float()
label_value_column_tensor = torch.tensor(train_data.iloc[:,:1].values).float()

train_ds = list(zip(pixel_value_columns_tensor,label_value_column_tensor))

We’ll make this a function, so that we can do the same again for our validation data.

train_dl = DataLoader(train_ds, batch_size=256)
train_xb,train_yb = first(train_dl)

train_xb.shape,train_xb.shape

(torch.Size([256, 784]), torch.Size([256, 784]))

def dataset_from_dataframe(dframe):
    pixel_value_columns = dframe.iloc[:,1:]
    label_value_column = dframe.iloc[:,:1]

    pixel_value_columns = pixel_value_columns.apply(lambda x: x/255)

    pixel_value_columns_tensor = torch.tensor(train_data.iloc[:,1:].values).float()
    label_value_column_tensor = torch.tensor(train_data.iloc[:,:1].values).float()

    return list(zip(pixel_value_columns_tensor, label_value_column_tensor))

valid_ds = dataset_from_dataframe(valid_data_split)

valid_dl = DataLoader(valid_ds, batch_size=256)

To ease my mind and help spot places where I could be making errors, i’ll make a function that can visually show a particular input (digit image) to me.

def show_image(item):
    item = item.view(28,28) * 255
    plt.gray()
    plt.imshow(item, interpolation='nearest')
    plt.show()

Now, for my sanity, i’ll test an images in train_xb.

show_image(train_xb[0])

png

def init_params(size): return (torch.rand(size) - 0.5).requires_grad_()

# 30 lots of 784 length weight arrays, therefore 30 outputs
w1 = init_params((30,784))
b1 = init_params((30, 1))
# 10 lots of 30 length weight arrays, therefore 10 outputs
w2 = init_params((10,30))
b2 = init_params((10, 1))

def simple_nn(batch):
    res = w1@batch.T + b1
    res = F.relu(res)
    res = w2@res + b2
    return F.softmax(res, dim=0)

first_batch_predictions = simple_nn(train_xb)

first_batch_predictions.shape

torch.Size([10, 256])

Let’s see what we get from the first column (results of the first image).

first_batch_predictions[:,0]

tensor([1.0476e-03, 5.4595e-03, 9.7319e-02, 2.4645e-02, 4.4612e-03, 9.0264e-02,
        2.5988e-04, 7.6168e-01, 1.2942e-02, 1.9236e-03],
       grad_fn=<SelectBackward0>)

Just to confirm that we did the softmax call across the right dimenstion, let’s ensure all these 10 values now add up to 1.

sum(first_batch_predictions[:,0])

tensor(1., grad_fn=<AddBackward0>)

Loss Function

number_of_classes = 10

def one_hot(yb):
    batch_size = len(yb)
    one_hot_yb = torch.zeros(batch_size, number_of_classes)
    x_coordinates_array = torch.arange(len(one_hot_yb))
    # used `.squeeze()` becasue yb originally has the size (batch_size, 1) and we just want a size of (batch_size). ([1, 2, 3, ...] instead of [[1], [2], [3], ...])
    # used `.long()` because: "tensors used as indices must be long, int, byte or bool tensors"
    y_coordinates_array = yb.squeeze().long()
    # set to `1.` rather than `1` because: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.
    one_hot_yb[x_coordinates_array, y_coordinates_array] = torch.tensor(1.)
    
    return one_hot_yb.T

def rmse(a, b):
    b = one_hot(b)
    mse = nn.MSELoss()
    loss = torch.sqrt(mse(a, b))
    
    return loss

one_hot(train_yb),one_hot(train_yb).shape,one_hot(train_yb)[:,0],train_yb[0]

(tensor([[0., 1., 0.,  ..., 0., 0., 0.],
         [1., 0., 1.,  ..., 0., 0., 1.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 1., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 torch.Size([10, 256]),
 tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]),
 tensor([1.]))

Let’s run it on our test batch predictions from above.

rmse(first_batch_predictions, train_yb)

tensor(0.3866, grad_fn=<SqrtBackward0>)

Trainability

def calc_grad(batch_inputs, batch_labels, batch_model):
    batch_preds = batch_model(batch_inputs)
    loss = rmse(batch_preds, batch_labels)
    loss.backward()

def train_epoch(dl, batch_model, params, lr):
    for xb,yb in dl:
        calc_grad(xb, yb, batch_model)
        for p in params:
            pdata1 = p.data
            p.data -= p.grad*lr
            pdata2 = p.data
            p.grad.zero_()

Validation and Metric

def get_predicted_label(pred):
    #returns index of highest value in tensor, which convenietnly also is directly the the digit/label that it corresponds to
    return torch.argmax(pred)

get_predicted_label(torch.tensor([0,4,3,2,6,1]))

tensor(4)

Let’s test this on some predictions from first_batch_predictions to ensure that we are getting sensible values (values from 0-9)

get_predicted_label(first_batch_predictions[:,0]),get_predicted_label(first_batch_predictions[:,1]),get_predicted_label(first_batch_predictions[:,3]),get_predicted_label(first_batch_predictions[:,5]),get_predicted_label(first_batch_predictions[:,33])

(tensor(7), tensor(3), tensor(3), tensor(3), tensor(3))

Accuracy

def batch_accuracy(preds, yb):
    #remember each column in our preds is an indivudual prediction, so we transpose preds in order to iterate through each precition in our list comprehension below
    preds = torch.tensor([get_predicted_label(pred) for pred in preds.T])
    # is_correct is a tensor of True and False values
    is_correct = preds==yb.squeeze()
    # now we turn all True values into 1 and all False values into 0, then return the mean of those values
    return is_correct.float().mean()

batch_accuracy(simple_nn(train_xb[:100]),train_yb[:100])

tensor(0.0800)

def validate_epoch(dl, batch_model):
    accuracies = [batch_accuracy(batch_model(xb),yb) for xb,yb in dl]
    # turn list of tensors into one single tensor of stacked values, so that we can then calculate the mean across all those values
    stacked_tensor = torch.stack(accuracies)
    mean_tensor = stacked_tensor.mean()
    # round method only works on value within tensor so we use item() to get it (and then round to four decimal places)
    return round(mean_tensor.item(), 4)

validate_epoch(valid_dl, simple_nn)

0.0644

Train for Number of Epochs

lr = 0.01
params = w1,b1,w2,b2

train_epoch(train_dl, simple_nn, params, lr)
validate_epoch(valid_dl, simple_nn)

0.082

Now let’s attempt at training our model over 500 more epochs and see if it improves.

for i in range(500):
    train_epoch(train_dl, simple_nn, params, lr)
    # run validate_epoch on every 50th iteration
    if i % 50 == 0:
        print(validate_epoch(valid_dl, simple_nn), ' ')

A ~91% accuracy on our validation data. For the single linear function model in the previous notebook we got an ~87% accuracy in the same number of epochs. So we are getting a higher accuracy with this two layer network but one key thing to notice is that huge jump in accuracy from the first printout to the second, this is a drastic improvement in trainability on the single layer linear model in the previous notebook.

I’ll now run this on the test data and submit it to the kaggle competition in order to see if it’s any good…

Submission

test_df = pd.read_csv(path/'test.csv')

test_df.describe()

	pixel0	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	pixel9	...	pixel774	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783
count	28000.0	28000.0	28000.0	28000.0	28000.0	28000.0	28000.0	28000.0	28000.0	28000.0	...	28000.000000	28000.000000	28000.000000	28000.000000	28000.000000	28000.0	28000.0	28000.0	28000.0	28000.0
mean	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.164607	0.073214	0.028036	0.011250	0.006536	0.0	0.0	0.0	0.0	0.0
std	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	5.473293	3.616811	1.813602	1.205211	0.807475	0.0	0.0	0.0	0.0	0.0
min	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
25%	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
50%	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
75%	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
max	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	253.000000	254.000000	193.000000	187.000000	119.000000	0.0	0.0	0.0	0.0	0.0

8 rows × 784 columns

Now we format it into what our model expects. Since we don’t have labels for this data we’ll make a batch of inputs only (the batch will be the whole of the test data)

test_tensor = torch.tensor(test_df.values)/255

test_tensor.shape

torch.Size([28000, 784])

Let’s take a look at the first image, just for sanity.

show_image(test_tensor[0])

png

Now let’s create a function that produces a meaningful output (digit prediction) for each image, using our model.

def predict(batch):
    preds = simple_nn(batch)
    # convert tensor to numpy value
    preds = [get_predicted_label(pred).numpy() for pred in preds.T]
    
    return preds

preds = predict(test_tensor)

preds[:5]

[array(2), array(0), array(9), array(9), array(2)]

pred_labels_series = pd.Series(preds, name="Label")

pred_labels_series

      2
      0
      9
      9
      2
        ..
  9
  7
  3
  9
  2
Name: Label, Length: 28000, dtype: object

sample_submission = pd.read_csv(path/'sample_submission.csv')
sample_submission

	ImageId	Label
0	1	0
1	2	0
2	3	0
3	4	0
4	5	0
...	...	...
27995	27996	0
27996	27997	0
27997	27998	0
27998	27999	0
27999	28000	0

28000 rows × 2 columns

len(test_df)

sample_submission['Label'] = pred_labels_series

sample_submission

	ImageId	Label
0	1	2
1	2	0
2	3	9
3	4	9
4	5	2
...	...	...
27995	27996	9
27996	27997	7
27997	27998	3
27998	27999	9
27999	28000	2

28000 rows × 2 columns

# this outputs the actual file
sample_submission.to_csv('subm.csv', index=False)
#this shows the head (first few lines)
!head subm.csv

ImageId,Label
1,2
2,0
3,9
4,9
5,2
6,7
7,0
8,3
9,0

!kaggle competitions list

!kaggle competitions files digit-recognizer 

!kaggle competitions submit -c digit-recognizer  -f ./subm.csv -m "going from single linear function model to two-linear-layer model"

This received a score of 0.90142, an improvement on the previous model yet again!

Conclusion

Using just the things learnt in the previous notebook we have managed to get our, previously non-working, 2 linear layer model to work, and with better results than the single layer (as one would expect).

The next step will be to try out our 2 linear layer model that utilises PyTorch’s nn modules and see if any of the things we have learnt thusfar will help us to get that working too. This will be done in a separate notebook.