Merge pull request #11 from drivendata/update-doc

Add more to overview
This commit is contained in:
Peter Bull 2016-04-23 16:52:49 -04:00
commit dee0d3ff21
1 changed files with 31 additions and 12 deletions

View File

@ -1,29 +1,44 @@
# Cookiecutter Data Science
## Why?
_A logical, reasonably standardized, but flexible project structure for doing and sharing data science work._
We often think data analysis is a report, some visualizations and or some insights. While these end products are generated by code, it's easy to focus on making to products look _real good_ and ignore the quality of the code that generates them.
## Why use this project structure?
On top of that, it's no secret that good analyses are often the result of exploration, experimentation, and digging into the data to see what works. This is not a process that lends itself to thinking carefully about the structure of your code or your project beforehand. So, let someone else do that thinking and the setup for you. Here's why:
We often think of data analysis as just the resulting report, insights, or visualizations. Even though these end products are generated by code, it's easy to focus on making the products _look nice_ and ignore the _quality of the code that generates them_. While these end products are generally the main event, **code quality is still important**! And we're not talking about bikeshedding the aesthetics or pedantic formatting standards — it's ultimately about correctness and reproducibility.
It's no secret that good analyses are often the result of very scattershot and serendipitous explorations, tentative experiments, and rapidly testing what works and what doesn't. That is all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression. That being said, once started it is not a process that lends itself to thinking carefully about the structure of your code or project layout.
We think it's a pretty big win all around to let someone else do that up-front thinking and setup for you. Here's why:
### Other people will thank you
A well-defined project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. Well organized code is self-documenting and provides a lot of context for your code without much overhead. People will thank you for this because they can:
> Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run `rails new` to get a standard project skeleton like everybody else.
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. Well organized code is self-documenting and provides a lot of context for your code without much overhead. People will thank you for this because they can:
- Collaborate easily with you on this analysis
- Easily learn from your analysis about the process and the domain
- Feel confident in the conclusions the analysis presents
A consistent project structure means that the
A consistent project structure means less random searching for what gets called where. A good example of this can be found in web development frameworks like Ruby on Rails, Django, and most others. Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run `rails new` to get a standard project skeleton like everybody else. And because the default project structure is _reasonably logical_ and _standard across most projects_, it takes almost no time at all for somebody who has never seen a particular project to figure out where they would find the various moving parts.
Ideally, that's how it should be when a colleague opens up your data science project.
### You will thank you
Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use `make_figures.py.old`, `make_figures_working.py` or `new_make_figures01.py` to get things done. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph), and engineering best practices like version control.
Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use `make_figures.py.old`, `make_figures_working.py` or `new_make_figures01.py` to get things done. Here are some questions we've learned to ask with a sense of existential dread:
* Are we supposed to go in and join the "region" column to the data before we get started or did that come from one of the notebooks?
* Come to think of it, which notebook do we have to run first before running the plotting code: was it "process data" or "clean data"?
* Where did the shapefiles get downloaded from for the geographic plots you made?
* _Et cetera, times infinity._
These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph), and engineering best practices like version control.
## Getting started
With this in mind, we've created a Cookiecutter Data Science template for projects in Python. Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate that you'd want to remove (exclusively in the `src` folder).
With this in mind, we've created a data science cookiecutter template for projects in Python. Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate that you'd want to remove (in the `src` folder for example, and the Sphinx documentation skeleton in `docs`).
### Requirements
@ -88,17 +103,17 @@ Starting a new project is as easy as running this command at the command line. N
## Opinions
There are some opinions implicit in the project structure that have grown out of learning what works and what doesn't when collaborating on data science projects. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier. Here are some of the beliefs which this project is built on--if you've got thoughts, please [contribute or share them](#contributing).
There are some opinions implicit in the project structure that have grown out of our experience with what works and what doesn't when collaborating on data science projects. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier. Here are some of the beliefs which this project is built onif you've got thoughts, please [contribute or share them](#contributing).
### Data is immutable
Don't edit your raw data in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure (q.v. [Analysis is a DAG](#analysis-is-a-dag)), but anyone should be able to reproduce the final products with only the code in `src` and the data in `data/raw`.
Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure (see [Analysis is a DAG](#analysis-is-a-dag)), but anyone should be able to reproduce the final products with only the code in `src` and the data in `data/raw`.
Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, ***by default, the data folder is included in the .gitignore file.*** If you have a small amount of data that rarely changes, you may want to include the data in the repository. Github currently warns if files are over 50MB and rejects files over 100MB. Some other options for storing/syncing large data include [AWS S3](https://aws.amazon.com/s3/) with a syncing tool (e.g., [`s3cmd`](http://s3tools.org/s3cmd)), [Git Large File Storage](https://git-lfs.github.com/), [Git Annex](https://git-annex.branchable.com/), and [dat](http://dat-data.com/). Currently by default, we ask for an S3 bucket and use `s3cmd` to sync data in the `data` folder with the server.
Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, ***by default, the data folder is included in the `.gitignore` file.*** If you have a small amount of data that rarely changes, you may want to include the data in the repository. Github currently warns if files are over 50MB and rejects files over 100MB. Some other options for storing/syncing large data include [AWS S3](https://aws.amazon.com/s3/) with a syncing tool (e.g., [`s3cmd`](http://s3tools.org/s3cmd)), [Git Large File Storage](https://git-lfs.github.com/), [Git Annex](https://git-annex.branchable.com/), and [dat](http://dat-data.com/). Currently by default, we ask for an S3 bucket and use `s3cmd` to sync data in the `data` folder with the server.
### Notebooks are for exploration
### Notebooks are for exploration and communication
Notebooks such as the [Jupyter notebook](http://jupyter.org/) and other literate programming tools are very effective for exploratory data analysis. However, these tools can be less effective for reproducing an analysis. When we use notebooks in our work, we often subdivide the `notebooks` folder. For example, `notebooks/exploratory` contains initial explorations, whereas `notebooks/reports` is more polished work that can be exported as html to the `reports` directory.
Notebook packages like the [Jupyter notebook](http://jupyter.org/), [Beaker notebook](http://beakernotebook.com/), [Zeppelin](http://zeppelin-project.org/), and other literate programming tools are very effective for exploratory data analysis. However, these tools can be less effective for reproducing an analysis. When we use notebooks in our work, we often subdivide the `notebooks` folder. For example, `notebooks/exploratory` contains initial explorations, whereas `notebooks/reports` is more polished work that can be exported as html to the `reports` directory.
Since notebooks are challenging objects for source control (e.g., diffs of the `json` are often not human-readable and merging is near impossible), we recommended not collaborating directly with others on Jupyter notebooks. There are two steps we recommend for using notebooks effectively:
@ -144,6 +159,10 @@ One effective approach to this is use [virtualenv](https://virtualenv.pypa.io/en
If you have more complex requirements for recreating your environment, consider a virtual machine based approach such as [Docker](https://www.docker.com/) or [Vagrant](https://www.vagrantup.com/). Both of these tools use text-based formats (Dockerfile and Vagrantfile respectively) you can easily add to source control to describe how to create a virtual machine with the requirements you need.
### Keep secrets out of version control
You _really_ don't want to leak your AWS secret key or Postgres username and password on Github. Enough said, mostly — see the [Twelve Factor App](http://12factor.net/) principles on this point. We generally use a `.env` file that, thanks to the `.gitignore`, never that defines secrets as environment variables, and are read in automatically by a package like `dotenv` in Python.
## Contributing
The Cookiecutter Data Science project is opinionated, but not afraid to be wrong. Best practices change, tools evolve, and lessons are learned. **The goal of this project is to make it easier to start, structure, and share an analysis.** [Pull requests](https://github.com/drivendata/cookiecutter-data-science/pulls) and [filing issues](https://github.com/drivendata/cookiecutter-data-science/issues) is encouraged. We'd love to hear what works for you, and what doesn't.