ml_pipeline/README.md

# Mimimal Viable Deep Learning Infrastructure

Deep learning pipelines are hard to reason about and difficult to code consistently.

Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.

Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.

This project lives here: [https://github.com/publicmatt.com/ml_pipeline](https://github.com/publicmatt.com/ml_pipeline).


# Usage

```bash
make help # lists available options.
```

## Install:

Install the conda requirements:

```bash
make install
```

## Data:

Download mnist data from PJReadie's website:

```bash
make data
```

## Run:

Run the code on MNIST with the following command:

```bash
make run
```

# Tutorial

The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.

Especially in a research lab, standardizing on a few components makes switching between projects easier.

In this template, you'll see the following:

## directory structure
```
.
├── README.md
├── environment.yml
├── launch.sh
├── Makefile
├── data
│   ├── mnist_test.csv
│   └── mnist_train.csv
├── docs
│   └── 2023-01-26.md
├── src
│   ├── config
│   │   └── main.yaml
│   ├── data
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── collate.py
│   │   └── dataset.py
│   ├── eval.py
│   ├── __init__.py
│   ├── model
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── cnn.py
│   │   └── linear.py
│   ├── pipeline
│   │   ├── __init__.py
│   │   ├── README.md
│   │   ├── logger.py
│   │   ├── runner.py
│   │   └── utils.py
│   ├── sample.py
│   └── train.py
└── test
    ├── __init__.py
    └── test_pipeline.py

8 directories, 25 files

```

## what and why?

- `environment.yml`
    - hutch research has standardized on conda
    - here's a good tutorial on getting that setup: [seth email](emailto:bassetis@wwu.edu)
- `launch.sh` or `Makefile`
    - to install and run stuff.
    - houses common operations and scripts.
    - `launch.sh` to dispatch training.
- `README.md`
    - explain the project and how to run it.
    - list authors.
    - list resources that new collaborators might need.
    - root level dir.
    - can exist inside any dir.
    - reads nicely on github.com.
- `docs/`
    - switching projects is easier with these in place.
    - organize them by meeting, or weekly agenda.
    - generally collection of markdown files.
- `test/`
    - TODO
    - pytest: unit testing.
    - good for data shape. not sure what else.
- `data/`
    - raw data
    - do not commit these to repo generally.
        - `echo "*.csv" >> data/.gitignore`
- `__init__.py`
    - creates modules out of dir.
    - `import module` works b/c of these.
- `src/model/`
    - if you have a large project, you might have multiple architectures/models.
    - small projects might just have `model/VGG.py` or `model/3d_unet.py`.
- `src/config`
    - based on hydra python package.
    - quickly change run variables and hyperparameters.
- `src/pipeline`
    - where the magic happens.
    - `train.py` creates all the objects, hands them off to runner for batching, monitors each epoch.

## testing
- `if __name__ == "__main__"`.
    - good way to test things
- enables lots breakpoints.

## config
- Hydra config.
    - quickly experiment with hyperparameters
    - good way to define env. variables
        - lr, workers, batch_size
        - debug

## data
- collate functions!
- datasets.
- dataloader.

## formatting python
- python type hints.
- automatic linting with the `black` package.

## running
- tqdm to track progress.
- wandb for logging.

## architecture
- dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.
add readme and make install. 2022-11-05 07:57:52 -07:00			`# Mimimal Viable Deep Learning Infrastructure`

			`Deep learning pipelines are hard to reason about and difficult to code consistently.`

			`Instead of remembering where to put everything and making a different choice for each project, this repository is an attempt to standardize on good defaults.`

			`Think of it like a mini-pytorch lightening, with all the fory internals exposed for extension and modification.`

reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`This project lives here: [https://github.com/publicmatt.com/ml_pipeline](https://github.com/publicmatt.com/ml_pipeline).`

add readme and make install. 2022-11-05 07:57:52 -07:00
add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`# Usage`
add readme and make install. 2022-11-05 07:57:52 -07:00
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			```bash
			`make help # lists available options.`
			```

add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`## Install:`
add readme and make install. 2022-11-05 07:57:52 -07:00
			`Install the conda requirements:`

			```bash
			`make install`
			```

reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`## Data:`

			`Download mnist data from PJReadie's website:`
add readme and make install. 2022-11-05 07:57:52 -07:00
			```bash
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`make data`
add readme and make install. 2022-11-05 07:57:52 -07:00			```

add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`## Run:`
add readme and make install. 2022-11-05 07:57:52 -07:00
			`Run the code on MNIST with the following command:`

			```bash
			`make run`
			```

add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`# Tutorial`

			`The motivation for building a template for deep learning pipelines is this: deep learning is hard enough without every code baase being a little different.`

			`Especially in a research lab, standardizing on a few components makes switching between projects easier.`

			`In this template, you'll see the following:`

rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00			`## directory structure`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			```
			`.`
			`├── README.md`
			`├── environment.yml`
			`├── launch.sh`
			`├── Makefile`
			`├── data`
			`│ ├── mnist_test.csv`
			`│ └── mnist_train.csv`
			`├── docs`
			`│ └── 2023-01-26.md`
			`├── src`
			`│ ├── config`
			`│ │ └── main.yaml`
			`│ ├── data`
			`│ │ ├── __init__.py`
			`│ │ ├── README.md`
			`│ │ ├── collate.py`
			`│ │ └── dataset.py`
			`│ ├── eval.py`
			`│ ├── __init__.py`
			`│ ├── model`
			`│ │ ├── __init__.py`
			`│ │ ├── README.md`
			`│ │ ├── cnn.py`
			`│ │ └── linear.py`
			`│ ├── pipeline`
			`│ │ ├── __init__.py`
			`│ │ ├── README.md`
			`│ │ ├── logger.py`
			`│ │ ├── runner.py`
			`│ │ └── utils.py`
			`│ ├── sample.py`
			`│ └── train.py`
			`└── test`
			`├── __init__.py`
			`└── test_pipeline.py`

			`8 directories, 25 files`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			```

			`## what and why?`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
			- `environment.yml`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`- hutch research has standardized on conda`
			`- here's a good tutorial on getting that setup: [seth email](emailto:bassetis@wwu.edu)`
			- `launch.sh` or `Makefile`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00			`- to install and run stuff.`
			`- houses common operations and scripts.`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			- `launch.sh` to dispatch training.
			- `README.md`
			`- explain the project and how to run it.`
			`- list authors.`
			`- list resources that new collaborators might need.`
			`- root level dir.`
			`- can exist inside any dir.`
			`- reads nicely on github.com.`
			- `docs/`
			`- switching projects is easier with these in place.`
			`- organize them by meeting, or weekly agenda.`
			`- generally collection of markdown files.`
			- `test/`
			`- TODO`
			`- pytest: unit testing.`
			`- good for data shape. not sure what else.`
			- `data/`
			`- raw data`
			`- do not commit these to repo generally.`
			- `echo "*.csv" >> data/.gitignore`
			- `__init__.py`
			`- creates modules out of dir.`
			- `import module` works b/c of these.
			- `src/model/`
			`- if you have a large project, you might have multiple architectures/models.`
			- small projects might just have `model/VGG.py` or `model/3d_unet.py`.
			- `src/config`
			`- based on hydra python package.`
			`- quickly change run variables and hyperparameters.`
			- `src/pipeline`
			`- where the magic happens.`
			- `train.py` creates all the objects, hands them off to runner for batching, monitors each epoch.
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
			`## testing`
			- `if __name__ == "__main__"`.
			`- good way to test things`
			`- enables lots breakpoints.`

			`## config`
add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`- Hydra config.`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00			`- quickly experiment with hyperparameters`
			`- good way to define env. variables`
			`- lr, workers, batch_size`
			`- debug`

			`## data`
			`- collate functions!`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`- datasets.`
			`- dataloader.`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
			`## formatting python`
add hydra config. remove click. add launch script. add test dir. switch from fashion mnist to generic. 2023-01-26 07:25:07 -08:00			`- python type hints.`
			- automatic linting with the `black` package.
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
			`## running`
			`- tqdm to track progress.`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00			`- wandb for logging.`
rename batch to runner. fill out makefile. add dev pipeline. gitignore data dir. add logger.py. fill out readme.md. export env.yml. 2023-01-26 11:00:24 -08:00
			`## architecture`
			`- dataloader, optimizer, criterion, device, state are constructed in main, but passed to an object that runs batches.`
reorganize pipeline dir and location of files. add readmes to all dir. 2023-01-26 11:49:28 -08:00