The inspiration for this post came most recently from a slide-deck by Ming Tang, a Bioinformatician at Harvard, and a new Chromebook Data Science course offered by Jeffery Leek from John Hopkins University.
However, this has been a topic I’ve been thinking about for some time. A number of other great resources I’d read in the past (1,2,3,4) inspired me to create a Github repo for a Gold Standard workflow for setting up a new data science project directory.
Leek has talked about the Duke scandal (along with Roger Peng and Rafa Irizarry) on the simplystats blog in the past so it wasn’t surprising that it also made it into his course. For those of you who don’t want to watch the presentation “The Importance of Reproducible Research in High-Throughput Biology: Case Studies in Forensic Bioinformatics” I’ll give the Coles Notes version (but I warn you, you’re missing out!).
Researcher’s from Duke university published a study on using personalized-genomics for patient-specific chemotherapy treatment. Two researchers, Baggerly and Coombes, from MD Anderson requested the data and code. It took many months of back-and-forth (hounding) to actually get the data and code — when they did it was disorganized and poorly documented. Baggerly and Coombes eventually found an error in the code, showing it actually put patients at risk, leading to a major lawsuit and terminated clinical trials.
As an aside, the reluctance to share data is, sadly, still prevalent today. Case in point, the now infamous “Research Parasite” editorial published by the New England Journal of Medicine, a leading journal in the field. I’m also including screen-shots I saved from some unknown graduate student’s frustration tracking down data/code (source unknown — sorry I forgot).
I know that some scientists will skim the abstract of a manuscript first but my stamp-of-approval is a Reproducibility Statement with all of the (raw) data and code available. If you don’t see something like this RUN!
Failure to make your project reproducible is academic misconduct and can have serious repercussions. It was one of the charges (“failure to properly document and preserve research results”), among many others, for the recently disgraced Cornell researcher Brian Wansink (I’m not saying it’s worse than the p-hacking allegations but I refuse to say it’s any better either). This excellent post on a Gold Standard for software documentation by Daniele Procida sums it up nicely when he says:
“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.
Even if for some reason they have to use it because they have no choice, without good documentation, they won’t use it effectively or the way you’d like them to.”
So follow the sage advice of Mr. Procida and make it effortless for others to understand what you did in your project and reproduce those results. It’s essential for collaborating with colleagues in the present and also for posterity’s sake (e.g. your future-self when you’re asked to re-run an analysis 6-months down the road or for any other researcher wishing to revisit your work). Leek considers it important enough to suggest that you “budget 10–20% of the time you will be working on a data science project just to organizing and documenting your work.”
Since my Github repo already addresses the Gold Standard for setting up folders in your data science project (*please star the repo as I plan on updating and improving it in the coming days*) I’ll talk about another important aspect of data science project management, which is:
Jenny Bryan gives three key principles of file naming for data science projects.
- Machine readable
- Human readable
- Plays well with default ordering
For machine readability we want to avoid spaces, punctuation, periods and any other special characters (except _ and -).
For human readability you want to give files meaningful names. When naming R objects there is a tendency to abbreviate object names, this is okay as long as you include a comment. For example, that the cv_perf_recall_rf was the calculation for the validate recall of each cross-validation fold of a random forest model.
mutate(validate_recall = map2_dbl(validate_actual, validate_predicted, ~recall(actual = .x, predicted = .y)))
However, when were naming files I would caution against acronyms unless absolutely necessary and, if so, including that information in the README file.
The next piece of advice is to place dates and number in the beginning of the file name. Always use the ISO8601 date format (yyyy-mm-dd) and left-pad numbers with zeros. The maximum number of digits will be determined by how many files you may generate. Say you expected to save 100 structural MRI image files then it should look like this 001_T1_mri.nii.gz. Say you thought you would actually generate 1000 files then it would look like this 0025_T1_mri.nii.gz.
Taking A Contrarian Stance
Leek also says you should avoid case sensitivity, for example Esophageal-Cancer_Report.md is obviously a horrible filename (my fingers hurt from all the extra key-strokes 😫) but suggest that esophageal-cancer_report.md is superior.
Respectfully I have to disagree with him here. Personally I find camelCase aesthetically pleasing esophagealCancer_report.md looks much more pleasant to me and it doesn’t come with the risks that Leek mentions; as long as don’t forget to include the -iname flag to ignore case insensitivity with the find command in linux. If you’re forgetful, or just efficient (i.e. lazy), you could always just include this as an alias in your .bashrc file 🤷
Making your file name start with a capital letter is obviously a bad idea as it causes you to add additional key strokes to generate the capital letter (e.g. Shift +
If you use R you should read up on Jenny Bryan’s here() package which does away with awkward workflow problems that can be caused by setwd().
Read her blog post “Project-oriented workflow” to get the low-down on the how and why.
Follow this Gold Standard data science project management advice and you’ll have no problem dealing with “Big Data” 🙄
The “Gold Standard” of Data Science Project Management was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source link https://www.r-bloggers.com/the-gold-standard-of-data-science-project-management/