If you’ve gone through the process of learning how to code, you understand that it isn’t just about memorizing syntax. It’s about learning a new way of thinking.
First you learn the tools (syntax, data structures, algorithms, etc). Then you’re given a problem, and you have to solve it in a way that efficiently uses those tools.
Data science is the same. Working in this field means you encounter problems on a daily basis, and I don’t just mean code bugs.
Examples of problems that data scientists need to solve:
How can I detect outliers in this dataset?
How can I forecast tomorrow’s energy consumption?
How can I classify this image as a face or object?
Data scientists use a variety of tools to tackle these problems: machine learning, statistics, visualization, and more. But if you want to find optimal solutions, you need an approach that keeps certain principles in mind.
Understand that data is the most important thing.
I know, that sounds really obvious. Let me explain.
One of the biggest mistakes that people who are new to data science make, as well as non-technical people who are working with data scientists, is focusing too much on the wrong things, such as:
- Choosing the most complex models
- Tuning hyperparameters to excess
- Trying to solve every data problem with machine learning
The field of data science and ML develops rapidly. There’s always a new library, a faster technology, or a better model. But the most complicated, cutting edge choice is not always the best choice. There’s a lot of considerations that go into selecting a model, including asking if machine learning is even required.
I work in energy and a big chunk of the work I do is outlier detection — whether that’s so I can remove them and train a model, or so I can flag them for further human inspection.