Alongside Git

Stockroom is built to use alongside git. This tutorial will guide you through a typical git workflow that uses stockroom to - Store data - Use that data to train a network in PyTorch - Version the model as we go - Tag the hyper parameters in different experiments

For this tutorial, we use a pretrained PyTorch network to classify cats and dogs. We have divided the whole tutorial into 7 stages. 1. Setup the repository 2. Download some data and store it in stockroom 3. Train the network and save the model + hyper parameters 4. Fine tune the hyper parameters

1. Setup the repository

In a typical software development project, we’ll have a git repository ready. Let’s make that first.

Initialize git

[1]:
!git init
Initialized empty Git repository in /home/hhsecond/mypro/stockroom/examples/.git/

Initialize stock

We need to initialize stock repository at the same location. A stock initialization is essentially a hangar initialization (if hangar repo doesn’t exist at the given location) and creating head.stock file

[2]:
!stock init --name sherin --email a@b.c
Hangar Repo initialized at: /home/hhsecond/mypro/stockroom/examples/.hangar
Stock file created

Initial git commit

Now we need to make the first commit. Remember, we use this notebook for controlling this workflow tutorial. Versioning the notebook might not be a good idea in this case since each checkout will change the status of our notebook which hinder us from moving forward. But in a typical project workflow you require you to version everything

[3]:
!git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        .gitignore
        .ipynb_checkpoints/
        head.stock
        requirements.txt
        with-git.ipynb

nothing added to commit but untracked files present (use "git add" to track)
[4]:
!echo "\ndownloads" > .gitignore
!git add .gitignore head.stock
!git commit -m 'initialized repo'
[master (root-commit) bfe72a8] initialized repo
 2 files changed, 2 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 head.stock

2. Download & Store Data

For this tutorial, as most of the tutorials, we’ll build a fully connected network to predict hand written digits from MNIST dataset.

Download images

We download the data using below utility functions (inspired from https://gist.github.com/goldsborough/6dd52a5e01ed73a642c1e772084bcd03)

[5]:
from urllib.request import urlretrieve
import gzip
import os
import sys


def report_download_progress(chunk_number, chunk_size, file_size):
    if file_size != -1:
        percent = min(1, (chunk_number * chunk_size) / file_size)
        bar = '#' * int(64 * percent)
        sys.stdout.write('\r0% |{:<64}| {}%'.format(bar, int(percent * 100)))


def download(destination_path, url):
    if os.path.exists(destination_path):
        print('{} already exists, skipping ...'.format(destination_path))
    else:
        print('Downloading {} ...'.format(url))
        urlretrieve(url, destination_path, reporthook=report_download_progress)

def unzip(zipped_path):
    unzipped_path = os.path.splitext(zipped_path)[0]
    if os.path.exists(unzipped_path):
        print('{} already exists, skipping ... '.format(unzipped_path))
        return
    with gzip.open(zipped_path, 'rb') as zipped_file:
        with open(unzipped_path, 'wb') as unzipped_file:
            unzipped_file.write(zipped_file.read())
            print('\nUnzipped {} ...'.format(zipped_path))
[6]:
from pathlib import Path

RESOURCES = [
    'train-images-idx3-ubyte.gz',
    'train-labels-idx1-ubyte.gz',
    't10k-images-idx3-ubyte.gz',
    't10k-labels-idx1-ubyte.gz',
]

path = Path('downloads')
path.mkdir(exist_ok=True)

for resource in RESOURCES:
    destination = os.path.join(str(path), resource)
    url = 'http://yann.lecun.com/exdb/mnist/{}'.format(resource)
    download(destination, url)
    unzip(destination)
downloads/train-images-idx3-ubyte.gz already exists, skipping ...
downloads/train-images-idx3-ubyte already exists, skipping ...
downloads/train-labels-idx1-ubyte.gz already exists, skipping ...
downloads/train-labels-idx1-ubyte already exists, skipping ...
downloads/t10k-images-idx3-ubyte.gz already exists, skipping ...
downloads/t10k-images-idx3-ubyte already exists, skipping ...
downloads/t10k-labels-idx1-ubyte.gz already exists, skipping ...
downloads/t10k-labels-idx1-ubyte already exists, skipping ...

Store to StockRoom

We need hangar columns ready for stockroom to store data there.

[7]:
!hangar arrayset create image INT64 784
!hangar arrayset create label INT64 1
!stock commit -m 'arrayset initialized'
Initialized Arrayset: image
Initialized Arrayset: label
Commit message:
arrayset initialized
Commit Successful. Digest: a=28a09ff56d69697bc313561b362200ae94b389d5
[8]:
from mnist import MNIST
mndata = MNIST(path)
[9]:
images, labels = mndata.load_training()
tmpimages, tmplabels = mndata.load_testing()
images.extend(tmpimages)
labels.extend(tmplabels)
[10]:
from stockroom import StockRoom
stock = StockRoom()
[11]:
from tqdm import tqdm
import numpy as np

with stock.optimize(write=True):
    for i in tqdm(range(len(images))):
        img = np.array(images[i])
        label = np.array(labels[i]).reshape(1)
        stock.data['image', i] = img
        stock.data['label', i] = label
 * Checking out COMMIT: a=28a09ff56d69697bc313561b362200ae94b389d5
100%|██████████| 70000/70000 [00:28<00:00, 2433.96it/s]
[12]:
!stock commit -m 'added data'
Commit message:
added data
Commit Successful. Digest: a=d6b2e5d8bbc397eda5448b3eadc0dc39e14c123e

3. Network training

Let’s build a simple fully connected network in PyTorch

[14]:
from tqdm import tqdm
import torch
from stockroom import StockRoom

def train(model, optimizer, criterion):
    stock = StockRoom()

    with stock.optimize():
        for epoch in range(stock.tag['epoch']):
            running_loss = 0
            trange = tqdm(range(70000))
            for i in trange:
                optimizer.zero_grad()
                sample = torch.from_numpy(stock.data['image', i]).float()
                sample /= 255
                out = model(sample).unsqueeze(0)
                label = torch.from_numpy(stock.data['label', i])
                loss = criterion(out, label)
                running_loss += loss.item()
                loss.backward()
                optimizer.step()
                if i % 1000 == 0 and i != 0:
                    trange.set_description(str(running_loss / i))
            stock.model['mnist'] = model.state_dict()
            stock.commit('added model')
[15]:
import torch.nn as nn

stock.tag['lr'] = 0.01
stock.tag['momentum'] = 0.5
stock.tag['epoch'] = 2
stock.commit('hyper params')

input_size = 784
hidden_sizes = [32, 16]
output_size = 10

model = nn.Sequential(
    nn.Linear(input_size, hidden_sizes[0]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[0], hidden_sizes[1]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[1], output_size),
    nn.LogSoftmax())
[15]:
Sequential(
  (0): Linear(in_features=784, out_features=32, bias=True)
  (1): ReLU()
  (2): Linear(in_features=32, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=10, bias=True)
  (5): LogSoftmax()
)
[16]:
from torch import optim

optimizer = optim.SGD(model.parameters(), lr=stock.tag['lr'], momentum=stock.tag['momentum'])
criterion = nn.NLLLoss()
 * Checking out COMMIT: a=5c291a0b2d946e3bfa359f754837a112df575bd6
 * Checking out COMMIT: a=5c291a0b2d946e3bfa359f754837a112df575bd6
[17]:
train(model, optimizer, criterion)
 * Checking out COMMIT: a=5c291a0b2d946e3bfa359f754837a112df575bd6
  0%|          | 0/70000 [00:00<?, ?it/s]/home/hhsecond/anaconda3/envs/stockroom/lib/python3.7/site-packages/torch/nn/modules/container.py:92: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  input = module(input)
0.33214209180101045: 100%|██████████| 70000/70000 [01:20<00:00, 869.94it/s]
0.20504333917206713: 100%|██████████| 70000/70000 [01:21<00:00, 854.18it/s]

4. Fine tuning

The loss doesn’t go below 0.2 with the hyper parameters we have. Let’s try increasing the number of neurons in the inner layer

[18]:
hidden_sizes = [128, 64]

model = nn.Sequential(
    nn.Linear(input_size, hidden_sizes[0]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[0], hidden_sizes[1]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[1], output_size),
    nn.LogSoftmax())
[18]:
Sequential(
  (0): Linear(in_features=784, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=10, bias=True)
  (5): LogSoftmax()
)
[20]:
optimizer = optim.SGD(model.parameters(), lr=stock.tag['lr'], momentum=stock.tag['momentum'])

train(model, optimizer, criterion)
 * Checking out COMMIT: a=8b8b7a2f7966acf1c3b5820470fdc34580ef6aaa
 * Checking out COMMIT: a=8b8b7a2f7966acf1c3b5820470fdc34580ef6aaa
 * Checking out COMMIT: a=8b8b7a2f7966acf1c3b5820470fdc34580ef6aaa
0.22921190452411824: 100%|██████████| 70000/70000 [03:47<00:00, 307.77it/s]
0.12486177534811682: 100%|██████████| 70000/70000 [03:25<00:00, 340.91it/s]

Now that the model has enough learning capacity, we can try reducing the learning rate to avoid the jittering of loss across the valley

[21]:
stock.tag['lr'] = 0.003
stock.commit('new lr value')
optimizer = optim.SGD(model.parameters(), lr=stock.tag['lr'], momentum=stock.tag['momentum'])

train(model, optimizer, criterion)
 * Checking out COMMIT: a=b54ed6f62420c590e2d3206907e239dfa17945f2
 * Checking out COMMIT: a=b54ed6f62420c590e2d3206907e239dfa17945f2
 * Checking out COMMIT: a=b54ed6f62420c590e2d3206907e239dfa17945f2
0.05775309108767264: 100%|██████████| 70000/70000 [04:33<00:00, 255.85it/s]
0.0373574975491017: 100%|██████████| 70000/70000 [04:57<00:00, 235.37it/s]

Conclusion

Great! Now we have a well trained MNIST classifier, the data and the hyperparameters we have used, saved in stockroom. Perhaps, for this tutorial, we haven’t used practical training methedologies, like splitting the dataset into validation / test etc. But the idea of the existence of this example is to show how stockroom could be used in a real world scenario. Stockroom is still under active development and we’ll have more features such as dataloaders for pytorch, tensorflow etc soon.

[ ]: