When it comes to machine learning, exploratory data analysis (EDA) is one the first things you need to do once you’ve collected and loaded your data into Python.
EDA involves:
- Summarizing data via descriptive statistics
- Visualizing data
- Identifying patterns, detecting anomalies, and generating hypotheses
Through EDA, data scientists gain a deeper understanding of their data, enabling them to assess data quality and prepare for more complex machine learning tasks.
But sometimes it can be a challenge when you’re first starting out and don’t know where to begin.
Here are 5 simple Python 1 liners that can kickstart your EDA process.
1. df.info()
This is a must for every EDA process. In fact this is always the first line of code I run after I’ve loaded in my df.
It tells you:
- The names of columns
- How many non-null values are in each column
- The data types of the columns