Notetaking for data science

Notetaking for data science

I was looking for a suitable method of taking notes during exploratory data analysis. I needed something that imposes enough structure to have all the essential aspects are covered. Something that makes it impossible to make a silly mistake and omit important information.

At the same time, I did not want anything as formal as the laboratory notebook used by scientists working in laboratories. In my opinion, that would be an overkill.

Obviously, the best kind of documentation is the code. It covers the reproducibility requirement and cannot be misinterpreted. It is almost perfect.

Sadly, the code does not explain why it was written. It does not reveal its purpose. It does not explain the decision made by the author.

It is even worse when I read my code written six months ago. I see that I have generated a chart and decided to remove outliers from the dataset. What if it took me three days to conclude that the outliers should be removed? What if I had to dig in the source code of the application that implements the business process which generates the data? Well, I would probably remember that after six months.

Nevertheless, we should not rely on our memory. People forget things. People quit. People recall things that did not happen or happened on a different occasion. We don’t even know that some of our “memories” are fake.

We need a decent and straightforward method of documenting what we have done with data. Fortunately, there have been people who tried to write some suggestions about writing such documentation:

Do you want to show your product/service to 25000 data science enthusiasts every month? I am looking for companies which would like to become a partner of this blog.

Are you interested? Is your employer interested? Here are the details of the offer.

Summary

In general, the guidelines and best practices I have found so far suggest:

  • keeping the input data separated from code and the results

  • using names that indicate the order of running files, such as 001_merge_input.py, 002_validate_data.py, etc.

  • writing down the purpose of the analysis, the author’s name and the name of the person who will be using the result

  • making a list of all the data sources (including the queries used to retrieve them)

  • noting the purpose and reason of every step, before writing the code.

  • separating experiments. One hypothesis per file. Keep all the files, especially if the result is not what you expected.

  • if something is strange or something surprised you, write it down.

  • if you needed to ask someone for an explanation, write it down

  • keep all your files in a code repository like Git or Subversion

  • no manual steps, everything must be automated. If the script requires some initial state, it should either prepare it or at least validate whether the initial conditions are met.

  • keep track of the software environment (automatically generate a list of libraries you used and their versions)

  • keep all versions of the output, write down the date when you produced them.
    If you have sent a report to someone, keep it. Always. Even if one day later you had to change the code and create the report again, keep the old version.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


If this article was helpful, consider donating to WWF or any other charity of your choice.
Bartosz Mikulski
Bartosz Mikulski * data scientist / software engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group