ml_pipeline/README.md

# Mimimal Viable Deep Learning Infrastructure

Deep learning pipelines are hard to reason about and difficult to code consistently.

Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.

Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.

This project lives here: [https://github.com/publicmatt.com/ml_pipeline](https://github.com/publicmatt.com/ml_pipeline).


# Usage

```bash
make help # lists available options.
```

## Install:

Install the conda requirements:

```bash
make install
```

## Data:

Download mnist data from PJReadie's website:

```bash
make data
```

## Run:

Run the code on MNIST with the following command:

```bash
make run
```

# Tutorial

The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.

Especially in a research lab, standardizing on a few components makes switching between projects easier.

In this template, you'll see the following:

## directory structure
```
.
├── README.md
├── environment.yml
├── launch.sh
├── Makefile
├── data
│   ├── mnist_test.csv
│   └── mnist_train.csv
├── docs
│   └── 2023-01-26.md
├── src
│   ├── config
│   │   └── main.yaml
│   ├── data
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── collate.py
│   │   └── dataset.py
│   ├── eval.py
│   ├── __init__.py
│   ├── model
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── cnn.py
│   │   └── linear.py
│   ├── pipeline
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── logger.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── sample.py
│   └── train.py
└── test
    ├── __init__.py
    └── test_pipeline.py

8 directories, 25 files

```

## what and why?

- `environment.yml`
    - hutch research has standardized on conda
    - here's a good tutorial on getting that setup: [seth email](emailto:bassetis@wwu.edu)
- `launch.sh` or `Makefile`
    - to install and run stuff.
    - houses common operations and scripts.
    - `launch.sh` to dispatch training.
- `README.md`
    - explain the project and how to run it.
    - list authors.
    - list resources that new collaborators might need.
    - root level dir.
    - can exist inside any dir.
    - reads nicely on github.com.
- `docs/`
    - switching projects is easier with these in place.
    - organize them by meeting, or weekly agenda.
    - generally collection of markdown files.
- `test/`
    - TODO
    - pytest: unit testing.
    - good for data shape. not sure what else.
- `data/`
    - raw data
    - do not commit these to repo generally.
        - `echo "*.csv" >> data/.gitignore`
- `__init__.py`
    - creates modules out of dir.
    - `import module` works b/c of these.
- `src/model/`
    - if you have a large project, you might have multiple architectures/models.
    - small projects might just have `model/VGG.py` or `model/3d_unet.py`.
- `src/config`
    - based on hydra python package.
    - quickly change run variables and hyperparameters.
- `src/pipeline`
    - where the magic happens.
    - `train.py` creates all the objects, hands them off to runner for batching, monitors each epoch.

## testing
- `if __name__ == "__main__"`.
    - good way to test things
- enables lots breakpoints.

## config
- Hydra config.
    - quickly experiment with hyperparameters
    - good way to define env. variables
        - lr, workers, batch_size
        - debug

## data
- collate functions!
- datasets.
- dataloader.

## formatting python
- python type hints.
- automatic linting with the `black` package.

## running
- tqdm to track progress.
- wandb for logging.

## architecture
- dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.