|  | ||
|---|---|---|
| bin | ||
| data | ||
| docs | ||
| src | ||
| test | ||
| .gitignore | ||
| Makefile | ||
| README.md | ||
| conda-lock.yml | ||
| environment.yml | ||
| launch.sh | ||
		
			
				
				README.md
			
		
		
			
			
		
	
	Mimimal Viable Deep Learning Infrastructure
Deep learning pipelines are hard to reason about and difficult to code consistently.
Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.
Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.
This project lives here: https://github.com/publicmatt.com/ml_pipeline.
Usage
make help # lists available options.
Install:
Install the conda requirements:
make install
Data:
Download mnist data from PJReadie's website:
make data
Run:
Run the code on MNIST with the following command:
make run
Tutorial
The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.
Especially in a research lab, standardizing on a few components makes switching between projects easier.
In this template, you'll see the following:
directory structure
.
├── README.md
├── environment.yml
├── launch.sh
├── Makefile
├── data
│   ├── mnist_test.csv
│   └── mnist_train.csv
├── docs
│   └── 2023-01-26.md
├── src
│   ├── config
│   │   └── main.yaml
│   ├── data
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── collate.py
│   │   └── dataset.py
│   ├── eval.py
│   ├── __init__.py
│   ├── model
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── cnn.py
│   │   └── linear.py
│   ├── pipeline
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── logger.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── sample.py
│   └── train.py
└── test
    ├── __init__.py
    └── test_pipeline.py
8 directories, 25 files
what and why?
- environment.yml- hutch research has standardized on conda
- here's a good tutorial on getting that setup: seth email
 
- launch.shor- Makefile- to install and run stuff.
- houses common operations and scripts.
- launch.shto dispatch training.
 
- README.md- explain the project and how to run it.
- list authors.
- list resources that new collaborators might need.
- root level dir.
- can exist inside any dir.
- reads nicely on github.com.
 
- docs/- switching projects is easier with these in place.
- organize them by meeting, or weekly agenda.
- generally collection of markdown files.
 
- test/- TODO
- pytest: unit testing.
- good for data shape. not sure what else.
 
- data/- raw data
- do not commit these to repo generally.
- echo "*.csv" >> data/.gitignore
 
 
- __init__.py- creates modules out of dir.
- import moduleworks b/c of these.
 
- src/model/- if you have a large project, you might have multiple architectures/models.
- small projects might just have model/VGG.pyormodel/3d_unet.py.
 
- src/config- based on hydra python package.
- quickly change run variables and hyperparameters.
 
- src/pipeline- where the magic happens.
- train.pycreates all the objects, hands them off to runner for batching, monitors each epoch.
 
testing
- if __name__ == "__main__".- good way to test things
 
- enables lots breakpoints.
config
- Hydra config.
- quickly experiment with hyperparameters
- good way to define env. variables
- lr, workers, batch_size
- debug
 
 
data
- collate functions!
- datasets.
- dataloader.
formatting python
- python type hints.
- automatic linting with the blackpackage.
running
- tqdm to track progress.
- wandb for logging.
architecture
- dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.