Go to file

Matt 996f4bc97c add miniconda install script.		2023-01-27 13:52:47 -08:00
bin	add miniconda install script.	2023-01-27 13:52:47 -08:00
data	rename batch to runner.	2023-01-26 11:00:24 -08:00
docs	reorganize pipeline dir and location of files.	2023-01-26 11:49:28 -08:00
src	reorganize pipeline dir and location of files.	2023-01-26 11:49:28 -08:00
test	add hydra config.	2023-01-26 07:25:07 -08:00
.gitignore	add hydra config.	2023-01-26 07:25:07 -08:00
Makefile	reorganize pipeline dir and location of files.	2023-01-26 11:49:28 -08:00
README.md	reorganize pipeline dir and location of files.	2023-01-26 11:49:28 -08:00
environment.yml	rename batch to runner.	2023-01-26 11:00:24 -08:00
launch.sh	reorganize pipeline dir and location of files.	2023-01-26 11:49:28 -08:00

README.md

Mimimal Viable Deep Learning Infrastructure

Deep learning pipelines are hard to reason about and difficult to code consistently.

Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.

Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.

This project lives here: https://github.com/publicmatt.com/ml_pipeline.

Usage

make help # lists available options.

Install:

Install the conda requirements:

make install

Data:

Download mnist data from PJReadie's website:

make data

Run:

Run the code on MNIST with the following command:

make run

Tutorial

The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.

Especially in a research lab, standardizing on a few components makes switching between projects easier.

In this template, you'll see the following:

directory structure

.
├── README.md
├── environment.yml
├── launch.sh
├── Makefile
├── data
│   ├── mnist_test.csv
│   └── mnist_train.csv
├── docs
│   └── 2023-01-26.md
├── src
│   ├── config
│   │   └── main.yaml
│   ├── data
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── collate.py
│   │   └── dataset.py
│   ├── eval.py
│   ├── __init__.py
│   ├── model
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── cnn.py
│   │   └── linear.py
│   ├── pipeline
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── logger.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── sample.py
│   └── train.py
└── test
    ├── __init__.py
    └── test_pipeline.py

8 directories, 25 files

what and why?

environment.yml
- hutch research has standardized on conda
- here's a good tutorial on getting that setup: seth email
launch.sh or Makefile
- to install and run stuff.
- houses common operations and scripts.
- launch.sh to dispatch training.
README.md
- explain the project and how to run it.
- list authors.
- list resources that new collaborators might need.
- root level dir.
- can exist inside any dir.
- reads nicely on github.com.
docs/
- switching projects is easier with these in place.
- organize them by meeting, or weekly agenda.
- generally collection of markdown files.
test/
- TODO
- pytest: unit testing.
- good for data shape. not sure what else.
data/
- raw data
- do not commit these to repo generally.
  - echo "*.csv" >> data/.gitignore
__init__.py
- creates modules out of dir.
- import module works b/c of these.
src/model/
- if you have a large project, you might have multiple architectures/models.
- small projects might just have model/VGG.py or model/3d_unet.py.
src/config
- based on hydra python package.
- quickly change run variables and hyperparameters.
src/pipeline
- where the magic happens.
- train.py creates all the objects, hands them off to runner for batching, monitors each epoch.

testing

if __name__ == "__main__".
- good way to test things
enables lots breakpoints.

config

Hydra config.
- quickly experiment with hyperparameters
- good way to define env. variables
  - lr, workers, batch_size
  - debug

data

collate functions!
datasets.
dataloader.

formatting python

python type hints.
automatic linting with the black package.

running

tqdm to track progress.
wandb for logging.

architecture

dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.