File Organization in Research

Table of Contents

I believe it substantially simplifies many aspects of your PhD studies and your research if you organize your files and folders in a smart way. The setup that I’m going to document is probably most useful for Computer Science, and especially Machine Learning, but should also be relevant for related fields. One should also note that there is definitely no ultimate solution to the problem.

Use a Git Repository #

I keep most of my work in one large git repository (named “phd”) which I push to to my bitbucket account (they offer free private repositories and have an awesome academic license!). If you are still using subversion or do not use any revision control system then go and read up on git immediately! My phd repository has roughly the following layout:

courses/
it/
publications.bib
readings/
research/
reviews/
software/
supervision/
teaching/
thesis/
writings/

courses stores all the course material of lectures I attended as a student during my studies. it contains smaller code snippets or documentation on how to use some obscure university services or some data backup scripts. publications.bib contains my own publications in one single BibTeX file. readings stores all my 1000+ PDFs of articles I read at one point. research is where all the files directly related to my research go into. When I start working on something new, which is not too related to a previous project I just create a new subfolder in there, something like svm_on_steroids. Ideally you would start a separate git repository for each new project. But when you are initially prototyping some ideas I found it convenient to just keep files in the same git repository. If you share code or write-ups with your colleagues in some other SVN or git repository, I just check out the repository to the research folder and put its name into a .gitignore so that git does not bother about this folder. reviews is ordered by years as a top folder and conference/journal name as a subfolder. I put all the material related to reviewing in there (papers, supplement, review). software holds scientific software, third-party or my own, that I use in more than one research project. supervision stores information related to projects of master students. This might include the project description, but also the code and the report they hand in. teaching contains subfolders (which are usually external SVN repositories) storing material related to classes that I’m co-teaching. I always prefix the name/acronym of the lecture by the year, such that things are in order. thesis is self-explanatory and writings contains talks, grant proposals/reports or writeups not related to any of the larger projects in the research folder.

Archive #

In the research and teaching I have a special folder called archive. When I think I’m done with a project/course, I move the folder to the archive folder and prefix the folder name with the year of completion. In case these subfolders are external git or SVN repositories, I still tend to move everything into the phd git repository (say by doing an svn export) in order to prevent any data loss in the future.

Project Folder Organization #

All my project folders have at least the following three subfolders: doc, experiments and src. Organizing experiments and code for research is a topic on its own, and recently there was a good discussion of the topic by Ali Eslami and John Langford. For the documents in the doc folder I again tend to prefix the conference/journal name by the year.

Flaws #

The readings folder is steadily increasing in size and its size is currently at around 2GB. Also, there is no clear advantage of having a revision history of these files, as they never change. Git is fairly fast and so this does not slow things down too much.

The archiving policy described above should probably also be improved. Git web-hosters, such as github and bitbucket are probably a good way to go about it: If the project is not already one of your private repositories, just create a new repository and import all the content from the external source (or alternatively fork it). One advantage of moving content to a dedicated archive folder is availability on the go.