Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

Contents

The Project The Dataset Initializing the Coding Environment Installing and Importing Relevant Libraries Loading the Dataset & Basic Analysis Data Cleaning General Income Analysis Income in relation to Education Analysis Income in relation to Hours per Week Analysis Income in relation to Gender Analysis Income in relation to Workclass Analysis Income in relation to Occupation Analysis Income in relation to Age Analysis Key Findings & Conclusion

earnings and income, we tend to credit success to hard work and intelligence. Other times, we just assume that certain people got lucky, and despite their sub-standard education levels or lack of expertise, they were able to succeed in their profession and earn comfortably. The truth, however, lies somewhat in between these two extremes. Yes, some people do get lucky and become millionaires at a young age, but we also see people working hard to climb the ladder of their career, and putting in efforts wherever needed to boost up professionally, and thus increase their income.

In this article, we will use Python to explore the relationship of income with different factors, namely age, gender, profession, level of education, etc. Although in today’s age, , plotting graphs and deriving insights, it is very important that we know how to extract insights from raw data by combining the humanly analysis with computing power. that requires certain prerequisite knowledge of Python fundamentals. By using Python and its powerful data processing libraries, we will identify some predictable patterns that will help us derive insights into factors that influence income generally, according to the dataset we will use!

The Project

In this project, we will dive into a census dataset with the help of Python, and use some of its powerful data analysis libraries like pandas, matplotlib, and seaborn, to uncover income patterns. With the help of data cleaning tools, data visualization, and exploratory analysis, we will convert this raw data into valuable insights about what factors influence income and to what extent. This is a beginner to intermediate level Python programming project that expects you to know about the basic Python fundamentals, especially how to import and use functions from different libraries for data exploration and analysis.

The Dataset

In this project, we will use the Adult Census Income Dataset, which is a real-world dataset derived from US census data. Although this dataset dates back to the 1990s, we can use it to derive income patterns with the margin that things have changed in 30 years, especially in terms of the gender gap that was previously very dominant. This dataset contains demographic and employment-related information, including the person’s age, occupation, level of education, marital status, gender, working hours etc, most of which are valuable to the purpose of our project. This dataset is publicly available and commonly used for educational & research projects.

Dataset: Adult Census Income Dataset
Source: UCI Machine Learning Repository (CC BY 4.0)
Original data derived from the U.S. Census Bureau database.

Now, let us get started!

Initializing the Coding Environment

Before we begin, let us make sure that we have our coding environment properly set up. For this, ensure that Python is installed in your system and open a programming IDE of your choice. I will be using PyCharm for its beginner-friendly nature and package accessibility.

First, let us create a new project called “Adults Income Pattern Analysis” and create a Python file main.py. This is where we will do our coding.

Installing and Importing Relevant Libraries

Next, let us install the relevant libraries/Python packages. We will be using the following libraries for data exploration and analysis:

Pandas – this is one of the most popular libraries that helps one work with tabular data like CSV files
Matplotlib – this Python library allows one to create graphs, charts, and other data visualizations
Seaborn – this is a library built on matplotlib, that extends data visualization to a large degree, making charts and graphs easier to make and prettier.

We will install the above libraries using the terminal option in PyCharm (search for how to install for your particular IDE).

pip install pandas matplotlib seaborn

Once the installation is complete, we will go forward and import these libraries into our main.py file.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Loading the Dataset & Basic Analysis

Now, we will load the dataset as a dataframe in a variable called df. This is the standard pandas library way of loading the dataset into a dataframe to use ahead:

df = pd.read_csv("https://huggingface.co/api/resolve-cache/datasets/scikit-learn/adult-census-income/fbeef6ec0e6fd88a5028b94683144000a6b380d5/adult.csv?%2Fdatasets%2Fscikit-learn%2Fadult-census-income%2Fresolve%2Fmain%2Fadult.csv=&etag=%225cf74ede1a6de37d85c96a61d30819a694dee749%22")
print(df.head())

df.head() (Image by Author)

As you can see, we first loaded the dataset from the URL shared before, and then used the df.head() function to print out the first 5 rows of the dataset to get an idea of what it looks like. We can see some column names: age, workclass, fnlweight, hours.per.week, native.country and income. Here we see the dotted lines between the fnlweight and hours.per.week, suggesting that there exist other columns as well that could not be properly displayed given the limited space on the output screen. This can be solved using some lines of code (we will look into this later).

Now, let us see how many rows and columns our dataset has. We will do this using the df.shape command that will output the number of rows and columns, so we get an idea of how big the dataset is that we are dealing with.

print(df.shape)

Lastly, let us get the detailed version of the column, including the type of data they are storing:

print(df.info())

As can be seen from the above results, our dataset is a mix of categorical and numerical features with the following column names:

Column Name	Meaning
`age`	Age of the individual in years
`workclass`	Type of employment/work sector: `Private` → Private company employee `Self-emp-not-inc` → Self-employed (not incorporated) `Government` → Government employee `Without-pay` → Unpaid worker
`fnlwgt`	Final sampling weight assigned by the Census Bureau
`education`	Highest education level completed: `Bachelors` `HS-grad` `Masters` `Doctorate`
`education.num`	Numerical representation of education level: High School → 9 Bachelors → 13 Masters → 14
`marital.status`	Marital status of the person
`occupation`	Type of job/profession
`relationship`	Relationship status within the household: `Husband` `Wife` `Own-child` `Not-in-family`
`race`	Race category
`sex`	Gender
`capital.gain`	Profit earned from investments/assets
`capital.loss`	Loss incurred from investments/assets
`hours.per.week`	Average working hours per week
`native.country`	Country of origin
`income`	Income category (target variable) Usually: `<=50K` or `>50K`

Now, let us display the first 5 rows with their complete columns. Run the code below to get the first 5 columns without truncation:

print(df.head().to_string())

df.head() Complete Row (Image by Author)

Data Cleaning

Now that our generic overview of what the data looks like is done, we can move on to the analysis part. But before that, it is very important that our data is clean and valuable for deriving insights; in other words, we don’t want our data to be inaccurate or have missing values, which could skew the metrics of analysis. For this reason, we will clean the data by removing rows with missing values.

We will use the pandas replace() function to replace the question marks with NA and drop the rows containing missing values from our dataframe using the dropna() function. So any row that has either a missing occupation, a missing education information, or such will be deleted from the dataframe. We can also see that a number of rows get dropped off from the dataframe with the following code:

print("Before cleaning:", len(df))
df = df.replace("?", pd.NA).dropna()
print("After cleaning:", len(df))

As can be seen, now that our dataset has been cleaned, we are left with only 30,162 rows from the initial 32,561 rows.

General Income Analysis

Let us begin our data analysis from here! We know that the income criterion is stored in the column income, which is either greater than 50k or less than 50k. Also note that this is the data from the 1994 census, so don’t be shocked at the numbers!

Let us visualize the data regarding the income:

#Income Graph
sns.countplot(x="income", data=df)
plt.show()

We have used Python’s seaborn library to create a count plot of the income.

As can be seen from the above graph, most of the individuals earned less than 50k, while a smaller percentage earned above 50k. The graph not only highlights an imbalanced income distribution, but also provides a snapshot of the economic structure of the US during the early 1990s, where $50k was considered a relatively strong annual salary.

Income in relation to Education Analysis

Now, let us see how education level impacts income. This seems to be of particular interest because it is generally seen that people who have higher education tend to earn a lot more than people who have minimal or lesser education. Let’s check if the data verifies this:

#Eductaion & Income Relationship
result = df.groupby("education")["income"].value_counts()
print(result)

Education & Income Relationship (Image by Author)

#Eductaion & Income Relationship
result = df.groupby("education")["income"].value_counts().unstack()
# Plot
result.plot(kind="bar", figsize=(12,6))
# Labels and title
plt.title("Education vs Income")
plt.xlabel("Education Level")
plt.ylabel("Count")
plt.xticks(rotation=45)
# Show graph
plt.show()

The above results point towards how higher education is a factor that leads to higher income generally. Although this cannot be visualized from the pandas groupby() result, we have thus used the seaborn count graph to give us a picture of how the different education levels influence the income range of individuals.

It is thus seen that:

There is a negligible population of people with an education level lower than “some college” who are earning more than 50k.
Higher education strongly correlates with higher income, with most of the 50k+ individuals having an education of bachelor’s, master’s, college degree, etc
High school education dominates the lower-income bracket, which can be seen with the tallest blue bar.
Professional degrees and doctorates have a taller orange bar than a blue bar, implying that they have a higher percentage of people earning more than 50k, making sense as people are highly rewarded for technical specialty
Interestingly, in the highly educated population, not everyone earns more than 50k, meaning there are other vital factors influencing income apart from just education level. Let us do further analysis with the other columns!

Income in relation to Hours per Week Analysis

Let us now see if hard work is rewarded as it is, that is, do people who work more hours tend to get higher income? The next few lines of code uses a boxplot to analyze if a particular relationship exists between income and hours per week:

# Show graph
sns.boxplot(x="income", y="hours.per.week", data=df)
plt.show()

As can be seen from the above boxplot, the population earning higher than 50k has a higher median, a wider spread and more people working long hours. This goes with our assumption that yes, people who work more hours per week generally tend to have higher incomes than others. But on the other hand, we have another interesting observation regarding the outliers on the left box, implying that there are certain people who do work more than 70 hours and still earn less than 50k. This means that although higher-income individuals usually work more hours per week, but long working hours alone do not guarantee a high salary. There are more factors other than education level and hours per week, so let us jump to the next parameter!

Income in relation to Gender Analysis

We know that there have been huge talks about the gender gap and the pay discrimination between male and female employees who are employed on the same job description, but was it the case in the 1990s, and if so, to what extent?

Let us visualize this with the help of a bar chart:

result = df.groupby("sex")["income"].value_counts().unstack()

# Plot with custom colors
ax = result.plot(
    kind="bar",
    figsize=(10,6),
    color=["skyblue", "blue"]
)

# Add labels on bars
for container in ax.containers:
    ax.bar_label(container)

# Titles and labels
plt.title("Income Distribution by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.xticks(rotation=0)

# Show graph
plt.show()

Income Distribution by Gender (Image by Author)

As can be seen from the above graph about the income distribution by gender, we can easily tell that there were significantly more males earning more than 50k than females. Also, most female population of the dataset fall in the less than 50k income category. we can just conclude that the percentage of males earning a high income is greater than the percentage of females.

Income in relation to Workclass Analysis

Now, let us move forward and see how income varies across different work classes. For this purpose, let us first see the different work classes and then create a bar chart for visual analysis.

print(df["workclass"].unique())

# Select workclasses
top_workclasses = df["workclass"].value_counts().head(7).index

filtered_df = df[df["workclass"].isin(top_workclasses)]

# Create chart
plt.figure(figsize=(10,6))

sns.countplot(
    data=filtered_df,
    x="workclass",
    hue="income"
)

# Titles and labels
plt.title("Income Distribution Across Workclass Categories")
plt.xlabel("Workclass")
plt.ylabel("Number of Individuals")

plt.xticks(rotation=15)

plt.show()

Income Distribution across Workclass (Image by Author)

The bar chart above shows that most individuals in our dataset worked in the Private sector, making it the dominant employment category. Across almost all work classes, the number of people earning less than 50k is significantly higher than those earning more than 50k. Self-employed incorporated workers (Self-emp-inc) appear to have a relatively stronger high-income proportion compared to some government sectors.

Income in relation to Occupation Analysis

Now, let us see how occupation affected income through a simple print statement.

result = df[df["income"] == ">50K"]["occupation"].value_counts().head(10)
print(result)

In the code above, we have accessed the top 10 occupations where people were earning more than 50k. As can be seen in the output above, the highest occupation was of “exec managerial”, then “prof speciality”, etc. This means that executive roles, technical professions, and specialized skilled work have higher earning potential than others.

Income in relation to Age Analysis

Next, let us see how income was affected by age.

# Create figure
plt.figure(figsize=(10,6))

# Boxplot
sns.boxplot(
    x="income",
    y="age",
    data=df
)

# Titles and labels
plt.title("Age vs Income Pattern")
plt.xlabel("Income Category")
plt.ylabel("Age")

# Show graph
plt.show()

Age vs Income Patterns (Image by Author)

As can be seen in the image above, we have created a simple chart that shows 2 boxes, one defining the less than 50k income category and the other showing greater than 50k. We can see from the graph that younger people were mostly in the less than 50k category, with a few outliers of old age in this category. We can also see that the median age for high-income earners is noticeably higher, suggesting that income tends to increase with age and experience. This reflects how career growth, experience, and seniority often lead to higher salaries over time.

Key Findings & Conclusion

In this article, we have thoroughly analyzed the US census dataset 1994 to find out trends in income with respect to different factors, namely: age, gender, occupation, workclass, hours per week etc. The following were the key findings:

The majority of the population falls in the lower-income category
Education generally increased the probability of a higher income
Work hours do matter, but only slightly; there is no guarantee of higher income for all greater hours/week!
Occupation is one of the strongest factors in higher-income individuals; certain job positions are enough to have increased income!
Income generally increases with age.

We have used Python’s pandas, matplotlib, and seaborn to not only clean data, but also analyze it with the help of plots and charts. We can practically conclude from this analysis that income is not determined by a single factor; rather, it is a combination of education, occupation, experience, and opportunity!