Coding skills are just as essential as mathematics for thriving as a data scientist. Coding skills help develop your problem-solving and critical-thinking abilities. Python and SQL are the most important coding skills you must possess.
3.1 Python
Python is the most widely used programming language in data science due to its simplicity, versatility, and powerful libraries.
What will you have to do?
- Your first target must be learning basic data structures like strings, lists/arrays, dictionaries, and core Object-Oriented Programming (OOP) concepts like classes and objects. Become an expert in these two areas.
- Knowledge of advanced data structures like trees, graphs, and traversal algorithms is a plus point.
- You must be proficient in time and space complexity analysis. It’ll help you write efficient code in practice. Learning the basic sorting and searching algorithms can help you gain a sufficient understanding of time and space complexity.
Python has the best data science library collection. Two of the most essential libraries are —
- NumPy — This library supports efficient operations on vectors and matrices.
- Pandas/PySpark — Pandas is a powerful data frame library for data manipulation and analysis. It can handle structured data formats like
.csv
,.parquet
, and.xlsx
. Pandas dataframes support operations that simplify tasks like filtering, sorting, and aggregating data. Pandas library is good for handling small datasets. The PySpark library is used to handle big data. It supports a variety of SQL operations (discussed later in the article), making it ideal for working with large datasets in distributed environments.
Beyond these, there are several other libraries you’ll encounter and use regularly —
- Scikit-learn — A go-to library for implementing machine learning algorithms, data preprocessing, and model evaluation.
- PyTorch — A deep learning framework widely used for building and training neural networks.
- Matplotlib and Seaborn — Libraries for data visualization, allowing you to create plots, charts, and graphs to visualize and understand data.
As a beginner, mastering every library isn’t a requirement. There are countless domain-specific libraries, like OpenCV, statsmodel, and Transformers, that you’ll pick up naturally through hands-on practice. Learning to use libraries is one of the easiest parts of data science and becomes second nature as you work on more projects. There’s no need to memorize functions — honestly, I still google various Pandas and PySpark functions all the time! I’ve seen many aspirants focus solely on libraries. While libraries are important, they’re just a small part of your toolkit.
3.2 SQL
SQL (Structured query language) is a fundamental tool for data scientists, especially when working with large datasets stored in relational databases. Data in many industries is stored in relational databases like SQL. SQL is one of the most important skills to hone when starting your data science journey. SQL allows you to query, manipulate, and retrieve data efficiently. This is often the first step in any data science workflow. Whether you’re extracting data for exploratory analysis, joining multiple tables, or performing aggregate operations like counting, averaging, and filtering, SQL is the go-to language.
I had only a basic understanding of SQL queries when I started my career. That changed when I joined my current company, where I began using SQL professionally. I worked with industry-level big data, ran SQL queries to fetch data, and gained hands-on experience.
The following SQL statements and operations are important —
Basic —
- Extraction —The
select
statement is the most basic statement in SQL querying. - Filtering —The
where
keyword is used to filter data as per conditions. - Sorting — The
order by
keyword is used to order the data in eitherasc
ordesc
order. - Joins — As the name suggests, SQL Joins help you join multiple tables in your SQL database. SQL has different types of joins —
left, right, inner, outer
, etc. - Aggregation Functions— SQL supports various aggregation functions such as
count(), avg(), sum(), min(), max()
. - Grouping — The
group by
keyword is often used with an aggregation function.
Advanced —
- Window Functions — Window functions are a powerful feature in SQL that allows you to perform calculations across a set of table rows related to the current row. Once you are proficient with the basic SQL queries mentioned above, familiarize yourself with window functions such as
row_number(), rank(), dense_rank(), lead(), lag()
. Aggregation functions can also be used as window functions. Thepartition by
keyword is used to partition the set of rows (called the window) and then perform the window operations. - Common Table Expressions (CTEs) — CTEs make SQL queries more readable and modular, especially when working with complex subqueries or recursive queries. They are defined using the
with
keyword. This is an advanced concept.
You’ll often use Python’s PySpark library in conjunction with SQL. PySpark has APIs for all SQL operations and helps integrate SQL and Python. You can perform various SQL operations on PySpark dataframes in Python seamlessly!
3.3 Practice, Practice, Practice
- Rigorous practice is key to mastering coding skills, and platforms like LeetCode and GeeksForGeeks offer great tutorials and exercises to improve your Python skills.
- SQLZOO and w3schools are great platforms to start learning SQL.
- Kaggle is the best place to combine your ML and coding skills to solve ML problems. It’s important to get hands-on experience. Pick up any contest. Play with the dataset and apply the skills you learn from the lectures.
- Implementing ML algorithms without using special ML libraries like scikit-learn or PyTorch is a great self-learning exercise. Writing code from scratch for basic algorithms like PCA, gradient descent, and linear/logistic regression can help you enhance your understanding and coding skills.
During my Master’s in AI course at the Indian Institute of Science, Bengaluru, we had coding assignments where we implemented algorithms in C! Yes C! One of these assignments was about training a deep neural network for MNIST digits classification.
I built a deep neural network from scratch in C. I created a custom data structure for storing weights and wrote algorithms for gradient descent and backpropagation. I felt immense satisfaction when the C code ran successfully on my laptop’s CPU. My friend mocked me for doing this “impractical” exercise and argued that we have highly efficient libraries for such a task. Although my code was inefficient, writing the code from scratch deepened my understanding of the internal mechanics of deep neural networks.
You’ll eventually use libraries for your projects in academia and industry. However, as a beginner, jumping straight into libraries can prevent you from fully understanding the fundamentals.