Disable syntax highlighting for folder structure

- Wording tweaks
This commit is contained in:
Isaac Slavitt 2016-04-28 19:12:52 -04:00
parent 34867e1bff
commit 25512d610a
1 changed files with 27 additions and 11 deletions

View File

@ -4,24 +4,30 @@ _A logical, reasonably standardized, but flexible project structure for doing an
## Why use this project structure?
We often think of data analysis as just the resulting report, insights, or visualizations. Even though these end products are generated by code, it's easy to focus on making the products _look nice_ and ignore the _quality of the code that generates them_. While these end products are generally the main event, **code quality is still important**! And we're not talking about bikeshedding the aesthetics or pedantic formatting standards — it's ultimately about correctness and reproducibility.
> We're not talking about bikeshedding the indentation aesthetics or pedantic formatting standards — ultimately, data science code quality is about correctness and reproducibility.
It's no secret that good analyses are often the result of very scattershot and serendipitous explorations, tentative experiments, and rapidly testing what works and what doesn't. That is all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression. That being said, once started it is not a process that lends itself to thinking carefully about the structure of your code or project layout.
When we think about data analysis, we often think just about the resulting reports, insights, or visualizations. While these end products are generally the main event, it's easy to focus on making the products _look nice_ and ignore the _quality of the code that generates them_. Because these end products are created programmatically, **code quality is still important**! And we're not talking about bikeshedding the indentation aesthetics or pedantic formatting standards — ultimately, data science code quality is about correctness and reproducibility.
We think it's a pretty big win all around to let someone else do that up-front thinking and setup for you. Here's why:
It's no secret that good analyses are often the result of very scattershot and serendipitous explorations. Tentative experiments and rapidly testing approaches that might not work out are all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression.
That being said, once started it is not a process that lends itself to thinking carefully about the structure of your code or project layout, so it's best to start with a clean, logical structure and stick to it throughout. We think it's a pretty big win all around to use a fairly standardized setup like this one. Here's why:
### Other people will thank you
> Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run `rails new` to get a standard project skeleton like everybody else.
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. Well organized code is self-documenting and provides a lot of context for your code without much overhead. People will thank you for this because they can:
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.
- Collaborate easily with you on this analysis
- Easily learn from your analysis about the process and the domain
- Feel confident in the conclusions the analysis presents
Well organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. People will thank you for this because they can:
A consistent project structure means less random searching for what gets called where. A good example of this can be found in web development frameworks like Ruby on Rails, Django, and most others. Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run `rails new` to get a standard project skeleton like everybody else. And because the default project structure is _reasonably logical_ and _standard across most projects_, it takes almost no time at all for somebody who has never seen a particular project to figure out where they would find the various moving parts.
- Collaborate more easily with you on this analysis
- Learn from your analysis about the process and the domain
- Feel confident in the conclusions at which the analysis arrives
A good example of this can be found in any of the major web development frameworks like Django or Ruby on Rails. Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run `rails new` to get a standard project skeleton like everybody else. Because that default project structure is _logical_ and _reasonably standard across most projects_, it is much easier for somebody who has never seen a particular project to figure out where they would find the various moving parts.
Another great example is the [Filesystem Hierarchy Standard](https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard) for Unix-like systems. The `/etc` directory has a very specific purpose, as does the `/tmp` folder, and everybody (more or less) agrees to honor that social contract. That means an Red Hat user and an Ubuntu user both know roughly where to look for certain types of files, even when using each other's system — or any other standards-compliant system for that matter!
Ideally, that's how it should be when a colleague opens up your data science project.
@ -29,13 +35,23 @@ Ideally, that's how it should be when a colleague opens up your data science pro
Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use `make_figures.py.old`, `make_figures_working.py` or `new_make_figures01.py` to get things done. Here are some questions we've learned to ask with a sense of existential dread:
* Are we supposed to go in and join the "region" column to the data before we get started or did that come from one of the notebooks?
* Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks?
* Come to think of it, which notebook do we have to run first before running the plotting code: was it "process data" or "clean data"?
* Where did the shapefiles get downloaded from for the geographic plots you made?
* Where did the shapefiles get downloaded from for the geographic plots?
* _Et cetera, times infinity._
These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph), and engineering best practices like version control.
### Nothing here is binding
> "A foolish consistency is the hobgoblin of little minds" — Ralph Waldo Emerson (and [PEP 8!](https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds))
Disagree with a couple of the default folder names? Working on a project that's a little nonstandard and doesn't exactly fit with the current structure? Prefer to use a different package than one of the (few) defaults?
**Go for it!** This is a lightweight structure, and is intended to be a good _starting point_ for many projects. Or, as PEP 8 put it:
> Consistency within a project is more important. Consistency within one module or function is the most important. ... However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!
## Getting started
With this in mind, we've created a data science cookiecutter template for projects in Python. Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate that you'd want to remove (in the `src` folder for example, and the Sphinx documentation skeleton in `docs`).
@ -58,7 +74,7 @@ Starting a new project is as easy as running this command at the command line. N
## Directory structure
```none
```nohighlight
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.