you looking to become a data scientist and don’t know where to start?
In this article, I want to provide you with a straightforward, no-nonsense learning roadmap that you can follow to break into the industry.
By the end, you’ll finally have a clear understanding of what is required and the best resources to use, which should hopefully reduce any overwhelm you may have and help you land that data science job quicker!
A hill that I am willing to die on is that, in my opinion, statistics is the most important area you should know as a data scientist.
New machine learning trends come and go, technologies often get replaced, but statistics has stood the test of time for centuries.
According to Wikipedia:
Statistics is the discipline that concerns the collection, organisation, analysis, interpretation, and presentation of data.
Given the title is “data” scientist, I think it’s obvious how vital statistics is to our field.
Fortunately, you don’t need to have a PhD in causal inference or stochastic calculus to have the required statistics knowledge. The fundamentals are the most important and literally 90% of the job.
What To Learn
The areas you need to strongly grasp are:
- Summary Statistics — Mean, median, mode, variance, correlations, anything that allows you to summarise data to draw interesting conclusions.
- Visualisations — Learn to plot data with graphs like bar chart, line graph, pie chart, etc. After all, a picture speaks a 1000 words.
- Probability Distributions — Learn the most common ones like Normal, Poisson, Binomial and Gamma. These are the ones I use most frequently.
- Probability Theory — This area is quite big, but the main things to learn are: random variables, central limit theorem, sampling and maximum likelihood estimation.
- Hypothesis Testing — If you are going to work on any experiments, you need to understand how they are statistically run. This involves learning about confidence intervals, significance levels, the z-test, the t-test, and test statistics. You simply need to know how to run hypothesis testing.
- Bayesian Statistics — It’s well worth knowing some Bayesian statistics, as I find people throw around this term loosely in the field all the time without really understanding. It’s a massive area, but as always, learn the fundamentals, such as Bayes’ theorem, conjugate priors, credible intervals, and Bayesian regression.
How To Learn
As I mentioned at the beginning, I want this roadmap to be simple and prevent any analysis paralysis you may experience, so to learn nearly all the above, I recommend getting the Practical Statistics for Data Science (affiliate link) textbook.
However, it does not cover Bayesian statistics, and for that, I recommend Think Bayes (affiliate link) textbook.
These two books are all you need and they are specifically designed for data scientists and are in Python.
Statistics, by nature, is a pretty applied field, and some of the concepts require pure maths knowledge to fully understand.
Additionally, when it comes to areas like machine learning, you need a good understanding of linear algebra and calculus to fully grasp what is happening under the hood.
What To Learn
Calculus
Calculus is how machine learning algorithms actually “learn.” Their “learning” is done through numerical continuous optimisation, and the areas you should learn are:
- What is a derivative, and what is it measuring?
- Learn the derivatives of standard functions like sine, cosine, exponential, tan, etc.
- What are turning points, maxima and minima?
- Chain and product rules are the reason neural networks work so well, as they are the core process behind backpropagation.
- Understand partial derivatives and their use in multivariable calculus.
- What is integration, and what is it doing?
- Integration by parts and substitution.
- The integral of standard functions like sine, natural log and other polynomials.
Linear Algebra
Linear algebra is a mathematical field that deals with vectors, matrices, and their transformations.
You should learn:
- Vectors, their magnitude, orientation and component. Additionally, operations such as the dot and cross product rules.
- Matrices and their operations, including trace, inverse, transpose, dot product, and cross product rules.
- Learn how to solve systems of linear equations through techniques like elimination, row reduction, and Cramer’s rule.
- Gain an understanding of eigenvalues and eigenvectors. These are the foundation of techniques like Principal Component Analysis, which helps reduce dimensionality in datasets.
How To Learn
In previous videos, I recommended some textbooks which, while useful, were quite dense and not practical for most people to get through in just a few months.
That’s why I now suggest taking the Mathematics for Machine Learning and Data Science Specialization on Coursera.
This course is tailored specifically for data science with exercises in Python. It skips the unnecessary theory and focuses on what you actually need for real-world work.
There are two, and only two, programming languages you need: Python and SQL.
What To Learn
Python
Keep it simple and learn the fundamentals:
- Variables and data types
- Boolean and comparison operators
- Control flow and conditionals
- For and while loops
- Functions and classes
You also want to learn specific scientific computing libraries:
SQL
You want to learn all the fundamental functions needed for analysis in SQL. It’s quite a small language, so there aren’t many things to learn.
- SELECT * FROM (standard query)
- ALTER, INSERT, CREATE (modify tables)
- GROUP BY, ORDER BY
- WHERE, AND, OR, BETWEEN, IN, HAVING (filter tables)
- AVG, COUNT, MIN, MAX, SUM (aggregate functions)
- FULL JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, UNION
- CASE (if statements)
- DATEADD, DATEDIFF, DATEPART (date and time functions)
How To Learn
There are many introductory Python and SQL courses, and they all teach the same material. So, choose one and get going with it. You literally can’t go wrong here.
If you want a recommendation, then checkout W3Schools or freeCodeCamp videos. I have used both and found them very good.
As well as Python and SQL, you need to invest some time learning other technologies that are used on the job.
What To Learn
There are so many tools, and every company is different, but these are the ones that remain consistent throughout:
- Git and GitHub — Virtually every company uses this for version control, so you need to learn it; there’s no way around it, I am afraid.
- Bash/Zsh — You will work in the terminal a lot, and the majority of companies rely on UNIX-like systems, so you need to be comfortable operating in the command line.
- Poetry / PyEnv / UV — Managing packages and Python versions is crucial in any real-world application, so it’s well worth getting familiar with these tools.
How To Learn
For git, I recommend this crash course from freeCodeCamp:
For learning terminal and bash shell scripting, I also recommend this video from freeCodeCamp.
And for learning PyEnv, Poetry and UV, check out these articles:
Right, time for the fun stuff!
Machine learning is a vast field, and we can’t learn everything, even if we tried our whole lives.
To be a data scientist, like I always say, we only need to know the fundamentals and a little bit of deep learning.
Forget learning LLMs, transformers, diffusion models, etc. That is not necessary for the majority of entry-level positions, and to be honest, for many jobs in general.
Focus on nailing the basics, as they transcend into everything else. To this day, I still use basic regression models, as do many senior machine learning engineers I work with.
It’s all about the application and understanding your problem, rather than trying to be flashy by using the latest state-of-the-art technology when it is not needed.
What To Learn
The key algorithms and concepts you should learn are:
- Linear, logistic and polynomial regression.
- Decision trees, random forests and gradient-boosted trees.
- Support vector machines.
- Regular neural networks.
- K-means and K-nearest neighbour clustering.
- Regularisation, bias vs variance tradeoff and cross-validation.
How To Learn
The following two resources is all you need. So, work through them iteratively, and your machine learning knowledge will surpass that of most practitioners in the industry. Trust me.
The first course ML course I took was Machine Learning Specialisation by Andrew Ng and I think it is probably the best one out there. You could get away with just doing this one on its own, as it’s that good.
The second one is probably the best machine learning book ever written: Hands-On ML with Scikit-Learn, Keras, and TensorFlow (affiliate link). If I had to give only one book to learn machine learning, this would be it!
In my opinion, this is optional, but I know many of you are interested in deep learning, so I have included it here for completeness.
I personally wouldn’t waste too much time here, as it can be easy to get lost in all the latest developments.
What To Learn
These deep learning concepts have stood the test of time, so they are well worth investing your learning in:
How To Learn
These are the resources I have used to learn deep learning, and they are all you need.
Deep Learning Specialization by Andrew Ng. — This is the follow-on course from the Machine Learning Specialisation and will teach all you need to know about deep learning, CNNs, and RNNs.
Again, the Hands-On ML with Scikit-Learn, Keras, and TensorFlow (affiliate link) textbook as an excellent deep learning section from chapter 14 onwards.
Finally, some of you may have heard of Andrej Karpathy, if you haven’t he is probably one of the best AI researchers at the moment and has worked at Tesla and OpenAI.
Anyway, his Neural Networks: Zero to Hero YouTube course is phenomenal and teaches you how to build your own Generative Pre-trained Transformers (GPT) from scratch.
If you go through everything in this article, you will have excellent knowledge to enter the data science field.
However, having this knowledge is not enough; you need to build a solid portfolio to land a job.
That’s why I recommend checking out my previous article, where I explain the exact projects you need to build to secure a job as soon as possible.
See you there!
STOP Building Useless ML Projects – What Actually Works | Towards Data Science
How to find machine learning projects that will get you hired.towardsdatascience.com
I offer 1:1 coaching calls where we can chat about whatever you need — whether it’s projects, career advice, or just figuring out your next step. I’m here to help you move forward!
1:1 Mentoring Call with Egor Howell
Career guidance, job advice, project help, resume reviewtopmate.io