File Organization in Research
Table of Contents
I believe it substantially simplifies many aspects of your PhD studies and your research if you organize your files and folders in a smart way. The setup that I’m going to document is probably most useful for Computer Science, and especially Machine Learning, but should also be relevant for related fields. One should also note that there is definitely no ultimate solution to the problem.
Use a Git Repository #
I keep most of my work in one large git repository (named “phd”) which I push to to my bitbucket account (they offer free private repositories and have an awesome academic license!). If you are still using subversion or do not use any revision control system then go and read up on git immediately! My phd repository has roughly the following layout:
courses/
it/
publications.bib
readings/
research/
reviews/
software/
supervision/
teaching/
thesis/
writings/
courses
stores all the course material of lectures I attended as
a student during my studies. it
contains smaller code snippets or
documentation on how to use some obscure university services or some data backup
scripts. publications.bib
contains my own publications in one
single BibTeX file. readings
stores all my 1000+ PDFs of articles
I read at one point. research
is where all the files directly
related to my research go into. When I start working on something new, which
is not too related to a previous project I just create a new subfolder in
there, something like svm_on_steroids
.
Ideally you would start a separate git repository for each new project. But
when you are initially prototyping some ideas I found it convenient to just
keep files in the same git repository.
If you share code or write-ups with your colleagues in
some other SVN or git repository, I just check out the repository
to the research folder and put its name into a .gitignore
so that
git does not bother about this folder.
reviews
is ordered by years as a top folder and
conference/journal name as a subfolder. I put all the material related to
reviewing in there (papers, supplement, review). software
holds
scientific software, third-party or my own, that I use in more than one
research project. supervision
stores information related to
projects of master students. This might include the project description, but also
the code and the report they hand in.
teaching
contains subfolders (which are usually external SVN
repositories) storing material related to classes that I’m co-teaching. I always prefix
the name/acronym of the lecture by the year, such that things are in order.
thesis
is self-explanatory and writings
contains talks, grant proposals/reports or writeups not related to any of the larger projects in
the research folder.
Archive #
In the research
and teaching
I have a special folder
called archive
. When I think I’m done with a project/course, I move the
folder to the archive folder and prefix the folder name with the year of
completion. In case these subfolders are external git or SVN repositories, I
still tend to move everything into the phd git repository (say by doing an
svn export
) in order to prevent any data loss in the future.
Project Folder Organization #
All my project folders have at least the following three subfolders:
doc
, experiments
and src
. Organizing
experiments and code for research is a topic on its own, and recently there
was a good discussion of the topic by Ali Eslami
and John Langford. For the
documents in the doc
folder I again tend to prefix the
conference/journal name by the year.
Flaws #
The readings
folder is steadily increasing in size and its size
is currently at around 2GB. Also, there is no clear advantage of having a
revision history of these files, as they never change. Git is fairly fast and
so this does not slow things down too much.
The archiving policy described above should probably also be improved. Git web-hosters, such as github and bitbucket are probably a good way to go about it: If the project is not already one of your private repositories, just create a new repository and import all the content from the external source (or alternatively fork it). One advantage of moving content to a dedicated archive folder is availability on the go.