996f4bc97c | ||
---|---|---|
bin | ||
data | ||
docs | ||
src | ||
test | ||
.gitignore | ||
Makefile | ||
README.md | ||
environment.yml | ||
launch.sh |
README.md
Mimimal Viable Deep Learning Infrastructure
Deep learning pipelines are hard to reason about and difficult to code consistently.
Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.
Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.
This project lives here: https://github.com/publicmatt.com/ml_pipeline.
Usage
make help # lists available options.
Install:
Install the conda requirements:
make install
Data:
Download mnist data from PJReadie's website:
make data
Run:
Run the code on MNIST with the following command:
make run
Tutorial
The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.
Especially in a research lab, standardizing on a few components makes switching between projects easier.
In this template, you'll see the following:
directory structure
.
├── README.md
├── environment.yml
├── launch.sh
├── Makefile
├── data
│ ├── mnist_test.csv
│ └── mnist_train.csv
├── docs
│ └── 2023-01-26.md
├── src
│ ├── config
│ │ └── main.yaml
│ ├── data
│ │ ├── __init__.py
│ │ ├── README.md
│ │ ├── collate.py
│ │ └── dataset.py
│ ├── eval.py
│ ├── __init__.py
│ ├── model
│ │ ├── __init__.py
│ │ ├── README.md
│ │ ├── cnn.py
│ │ └── linear.py
│ ├── pipeline
│ │ ├── __init__.py
│ │ ├── README.md
│ │ ├── logger.py
│ │ ├── runner.py
│ │ └── utils.py
│ ├── sample.py
│ └── train.py
└── test
├── __init__.py
└── test_pipeline.py
8 directories, 25 files
what and why?
environment.yml
- hutch research has standardized on conda
- here's a good tutorial on getting that setup: seth email
launch.sh
orMakefile
- to install and run stuff.
- houses common operations and scripts.
launch.sh
to dispatch training.
README.md
- explain the project and how to run it.
- list authors.
- list resources that new collaborators might need.
- root level dir.
- can exist inside any dir.
- reads nicely on github.com.
docs/
- switching projects is easier with these in place.
- organize them by meeting, or weekly agenda.
- generally collection of markdown files.
test/
- TODO
- pytest: unit testing.
- good for data shape. not sure what else.
data/
- raw data
- do not commit these to repo generally.
echo "*.csv" >> data/.gitignore
__init__.py
- creates modules out of dir.
import module
works b/c of these.
src/model/
- if you have a large project, you might have multiple architectures/models.
- small projects might just have
model/VGG.py
ormodel/3d_unet.py
.
src/config
- based on hydra python package.
- quickly change run variables and hyperparameters.
src/pipeline
- where the magic happens.
train.py
creates all the objects, hands them off to runner for batching, monitors each epoch.
testing
if __name__ == "__main__"
.- good way to test things
- enables lots breakpoints.
config
- Hydra config.
- quickly experiment with hyperparameters
- good way to define env. variables
- lr, workers, batch_size
- debug
data
- collate functions!
- datasets.
- dataloader.
formatting python
- python type hints.
- automatic linting with the
black
package.
running
- tqdm to track progress.
- wandb for logging.
architecture
- dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.