Notetaking for data science

I was looking for a suitable method of taking notes during exploratory data analysis. I needed something that imposes enough structure to have all the essential aspects are covered. Something that makes it impossible to make a silly mistake and omit important information.

At the same time, I did not want anything as formal as the laboratory notebook used by scientists working in laboratories. In my opinion, that would be an overkill.

Obviously, the best kind of documentation is the code. It covers the reproducibility requirement and cannot be misinterpreted. It is almost perfect.

Sadly, the code does not explain why it was written. It does not reveal its purpose. It does not explain the decision made by the author.

It is even worse when I read my code written six months ago. I see that I have generated a chart and decided to remove outliers from the dataset. What if it took me three days to conclude that the outliers should be removed? What if I had to dig in the source code of the application that implements the business process which generates the data? Well, I would probably remember that after six months.

Nevertheless, we should not rely on our memory. People forget things. People quit. People recall things that did not happen or happened on a different occasion. We don’t even know that some of our “memories” are fake.

We need a decent and straightforward method of documenting what we have done with data. Fortunately, there have been people who tried to write some suggestions about writing such documentation:

Summary

In general, the guidelines and best practices I have found so far suggest:

  • keeping the input data separated from code and the results

  • using names that indicate the order of running files, such as 001_merge_input.py, 002_validate_data.py, etc.

  • writing down the purpose of the analysis, the author’s name and the name of the person who will be using the result

  • making a list of all the data sources (including the queries used to retrieve them)

  • noting the purpose and reason of every step, before writing the code.

  • separating experiments. One hypothesis per file. Keep all the files, especially if the result is not what you expected.

  • if something is strange or something surprised you, write it down.

  • if you needed to ask someone for an explanation, write it down

  • keep all your files in a code repository like Git or Subversion

  • no manual steps, everything must be automated. If the script requires some initial state, it should either prepare it or at least validate whether the initial conditions are met.

  • keep track of the software environment (automatically generate a list of libraries you used and their versions)

  • keep all versions of the output, write down the date when you produced them.
    If you have sent a report to someone, keep it. Always. Even if one day later you had to change the code and create the report again, keep the old version.

Older post

Wilson score in Python - example

How to calculate page popularity using the Wilson Score

Newer post

From Scala to Python - Python dataclasses

Domain model in Python