Exploratory data analysis and data visualization often includes inspecting a dataset’s distribution. Doing so provides important insights into the data, such as identifying the range, outliers or unusual groupings, the data’s central tendency, and skew within the data. Comparing subsets of the data can reveal even more information about the data on hand. A professionally built visualization of a dataset’s distribution will provide immediate insights. This guide details several options for quickly using Python to create those clean, meaningful visualizations.
Visualizations covered:
- Histograms
- KDE (Density) Plots
- Joy Plots or Ridge Plots
- Box Plots
- Violin Plots
- Strip and Swarm Plots
- ECDF Plots
Data and Code:
This article uses completely synthetic weather data generated following the concepts in one of my previous articles. The data for this article and the full Jupyter notebook are available at this linked GitHub page. Feel free to download both and follow along, or reference the code blocks below.
The libraries, imports, and settings used for this are as follows:
# Data Handling:
import pandas as pd
from pandas.api.types import CategoricalDtype# Data Visualization Libraries:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from joypy import joyplot
# Display Configuration:
%config InlineBackend.figure_format='retina'
First, let’s load in and prepare the data, which is a simple synthetic weather dataframe showing various temperature readings for 3 cities across the 4 seasons.
# Load data:
df = pd.read_csv('weatherData.csv')# Set season as a categorical data type:
season = CategoricalDtype(['Winter', 'Spring', 'Summer', 'Fall'])
df['Season'] = df['Season'].astype(season)
Note that the code sets the Season column to a categorical data type. This will…