What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.
The implicit expectation with data science projects is that the results reported at the end can be trusted.
This means that if someone asked you to re-run your or somebody else’s analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.
Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.
These statements may seem obvious, but satisfying both expectations can be quite difficult.
If you don’t believe me, think about your past projects.
Have you ever struggled to run your old code or to figure out which version of your data or which hyperparameters you used to obtain a specific result?
This is a second article of a series where I talk about practical data science skills that are in my experience not talked about in data science courses, but will occupy much of your day to day as a data scientist. This post is inspired by a course I taught at the University of Tennessee in Knoxville — DSE 511, and a fantastic MIT course that is aptly called “the missing semester of your CS education.”
This second post focuses on skills to help you make your results more reliable and your code more reusable.