I Rewrote a Real Data Workflow in Polars. Pandas Didn’t Stand a Chance.

Contents

Isn’t Pandas Enough?The Workflow The Pandas Version Installing Polars and First Impressions The Eager Version The Lazy Version Mental Model Shift #1 — Lazy vs Eager Execution Mental Model Shift #2 — Query Optimization Predicate Pushdown Projection Pruning Mental Model Shift #3 — Columnar Memory Where Pandas Still Wins Conclusion

— I wasn’t actively looking for Polars.

I’ve been on a bit of a Pandas optimization journey lately. First, I wrote about why you should stop writing loops in Pandas and think in columns instead.

Then I went deeper to profiling real workflows, fixing vectorization mistakes, and ended up cutting a 61-second pipeline down to 0.33 seconds using nothing but better Pandas and NumPy. That one surprised even me.

So I was in a good place with Pandas. I felt like I finally understood how to use it properly.

Then someone dropped a comment on one of my posts. Something along the lines of: “Have you tried Polars? It’s built for exactly this kind of thing.”

I’d seen the name floating around in the data community. There was buzz around it — something about speed, about a completely different way of thinking about data pipelines. But I’d never actually touched it.

That comment was enough to push me over the edge.

So I did what I always do. I got curious, I installed it, and I rewrote the exact same workflow from my last article, the one I’d already optimized in Pandas, in a tool I’d never used before.

What I found surprised me. Not just the speed numbers, but what Polars quietly teaches you about how data pipelines actually work.

Isn’t Pandas Enough?

Fair question.

In my last article, I took a slow Pandas pipeline and optimized it down to 0.33 seconds. Vectorized operations, correct data types, no unnecessary copies. The results were honestly better than I expected.

So why are we even talking about Polars?

Here’s the thing. Everything I did in that article was me doing the optimization manually. I had to know which operations were slow, why they were slow, and how to fix them. Polars does a lot of that thinking for you automatically, before it even runs your code.

On top of that, Polars is built on a completely different foundation than Pandas. It uses all your CPU cores by default. It manages memory differently. And it introduces a way of writing data pipelines that, once it clicks, changes how you think about the whole process.

Optimized Pandas is impressive. But it still has a ceiling. This article is about what’s on the other side of it.

The Workflow

To keep things consistent and useful for anyone who followed along with my last article. I’m using the same synthetic e-commerce dataset. One million rows. Nothing exotic. Just the kind of data you’d realistically encounter in the wild.

If you want to generate it yourself, here’s the setup code:

import pandas as pd
import numpy as np
import time

np.random.seed(42)

n = 1_000_000

regions = ['north', 'south', 'east', 'west']
categories = ['electronics', 'clothing', 'furniture', 'food', 'sports']
statuses = ['completed', 'returned', 'pending', 'cancelled']

df = pd.DataFrame({
    'order_id': np.arange(1000, 1000 + n),
    'order_date': pd.date_range(start='2022-01-01', periods=n, freq='1min'),
    'region': np.random.choice(regions, size=n),
    'category': np.random.choice(categories, size=n),
    'sales': np.random.randint(100, 10000, size=n),
    'quantity': np.random.randint(1, 20, size=n),
    'discount': np.round(np.random.uniform(0.0, 0.5, size=n), 2),
    'status': np.random.choice(statuses, size=n),
})

df.to_csv('large_sales_data.csv', index=False)

The pipeline we’re running on it is straightforward. The kind of thing that shows up in real workflows all the time:

Fix data types upfront
Calculate net revenue per order
Flag high-value orders
Aggregate total net revenue by region

Simple. Familiar. And already optimized in Pandas from the last article, which makes it the perfect benchmark for Polars.

The Pandas Version

I’m not going to show you the naive Pandas code here. I already did that in my last article — the version with three .apply() calls that took 61 seconds on this same dataset. If you haven’t read that one, it’s worth a look.

What I’m showing here is the optimized version. The best Pandas can do on this pipeline.

import pandas as pd
import numpy as np
import time

df = pd.read_csv('large_sales_data.csv')

start = time.time()

# Fix data types upfront
df['region'] = df['region'].astype('category')
df['category'] = df['category'].astype('category')
df['status'] = df['status'].astype('category')

# Vectorized revenue calculation
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)

# Vectorized flagging
df['order_flag'] = np.where(df['net_revenue'] > 50000, 'high', 'low')

# Aggregation
result = df.groupby('region')['net_revenue'].sum()

end = time.time()
print(f"Pandas runtime: {end - start:.2f} seconds")
print(result)

This is clean Pandas. Vectorized operations, correct data types, no unnecessary intermediate columns. Everything I learned from going down that rabbit hole.

And the result?

Pandas runtime: 0.31 seconds

That’s already really good. Genuinely impressive for a million rows.
But here’s the question I kept sitting with after seeing that number: what does it look like when the tool is doing the optimization for you, instead of you doing it manually?

That’s what we’re about to find out.

Installing Polars and First Impressions

Getting started was straightforward. If you’re on Google Colab like me, one line is all you need:

!pip install polars 
import polars as pl 
print(pl.__version__)

1.35.2

Done. No environment headaches, no dependency conflicts. That alone was a good start.

But then I opened the Polars documentation and immediately noticed something. The syntax looked familiar enough — DataFrames, columns, filtering — but the way you’re supposed to think about operations felt different.

In Pandas, you work with your data step by step. You do something, store the result, do something else. In Polars, you describe what you want as a single expression, and Polars figures out how to execute it.

I didn’t fully understand that yet. But I was about to.

The other thing that caught my eye immediately was this concept of lazy vs eager execution. In Pandas, every line of code runs the moment you write it. Polars gives you a choice: you can run eagerly like Pandas, or you can build up a full query plan first and let Polars optimize it before executing anything.

I didn’t know what that meant in practice yet. But I kept seeing it everywhere in the docs. So I decided the best way to understand it was to just rewrite my pipeline and see what happened.

The Eager Version

import polars as pl
import time

start = time.time()

result = (
    pl.read_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('region')
    .agg(pl.col('net_revenue').sum())
)

end = time.time()
print(f"Polars Eager runtime: {end - start:.2f} seconds")
print(result)

Polars Eager runtime: 0.83 seconds

The first thing I noticed was the method chaining. In Pandas, I was making separate assignments, doing one thing, storing the result, doing the next thing.

Here, everything flows as a single expression from start to finish. You’re describing the entire pipeline at once.

Let me break down the syntax quickly since some of it will look unfamiliar:

pl.read_csv() — reads the CSV, same concept as pd.read_csv(). Nothing surprising.
.with_columns([...]) — this is how Polars adds or transforms columns. The equivalent of df['new_col'] = ... in Pandas. You can compute multiple columns in a single call.
pl.col('sales') — this is how you reference a column in Polars. Instead of df['sales'], you write pl.col('sales'). You’re not grabbing the data directly — you’re describing an operation on that column. That distinction matters more than it sounds.
.alias('net_revenue') — just naming the result. Like saying “call this new column net_revenue.”
pl.when(...).then(...).otherwise(...) — Polars’ version of np.where(). Reads almost like plain English: when this condition is true, return this value, otherwise return that one.
.group_by('region').agg(...) — same concept as Pandas .groupby(). Group the data, then define your aggregation. Different syntax, same idea.

Now here’s the thing. That eager version ran in 0.83 seconds. That’s actually slower than our optimized Pandas at 0.31 seconds. If I’d stopped here, I’d have written Polars off entirely.

But I kept reading the docs. And I found something called lazy evaluation.

The Lazy Version

start = time.time()

result = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('region')
    .agg(pl.col('net_revenue').sum())
    .collect()
)

end = time.time()
print(f"Polars Lazy runtime: {end - start:.2f} seconds")
print(result)

Polars Lazy runtime: 0.20 seconds

Spot the two differences from the eager version. pl.read_csv() became pl.scan_csv() — this tells Polars not to load anything yet, just start building a query plan. And .collect() was added at the end — that’s where you tell Polars “okay, now execute everything.”

Two changes. That’s it.

And just like that, 0.83 seconds became 0.20 seconds.

Polars lazy is 35% faster than our already optimized Pandas pipeline. Without me doing any manual optimization. Without me profiling anything. Without me knowing in advance which operations were bottlenecks.

Polars figured that out on its own.

That’s when I started paying attention.

Mental Model Shift #1 — Lazy vs Eager Execution

This is the one that changed how I think about data pipelines.
In Pandas, every line of code executes the moment you write it. You assign a column — it runs. You filter a DataFrame — it runs. You group and aggregate — it runs.

Each operation is independent, immediate, and unaware of what comes before or after it.

That’s eager execution. It’s intuitive. It feels natural because it matches how we think about writing code step by step.

Polars gives you a choice.

When you use pl.scan_csv() instead of pl.read_csv(), you’re telling Polars: don’t execute anything yet. Just start building a plan. Every .with_columns(), every .filter(), every .group_by() you chain after that isn’t running — it’s being recorded. You’re describing what you want, not triggering it.

Then when you call .collect() at the end, Polars takes that entire plan, looks at it as a whole, optimizes it, and then executes it in one efficient pass.

Think of it this way:

Pandas is like following a recipe step by step. Chop the onions. Put them in the pan. Add the garlic. Each instruction happens immediately, in order, one at a time.
Polars lazy is like a chef who reads the entire recipe first, figures out the most efficient way to prepare everything, and then executes it in the optimal order. Same result. Less wasted effort.

That optimization step is the key. Before Polars runs a single line of your pipeline it asks: what columns do we actually need? What rows can we eliminate early? What operations can be parallelized? What work can be skipped entirely?

You can actually see the query plan Polars builds before it executes.

Run this:

lazy_query = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('region')
    .agg(pl.col('net_revenue').sum())
)

print(lazy_query.explain())

That .explain() call shows you exactly what Polars is planning to do before it does it. It’s the optimizer’s thinking made visible.
This is something Pandas simply doesn’t have.

In Pandas, the optimization is your responsibility. You have to know which operations are expensive, profile your code, and restructure it manually — which is exactly what I did in my last article. In Polars lazy mode, the optimizer handles that for you.

That’s not a small difference. That’s a fundamentally different relationship between you and your data pipeline.

Mental Model Shift #2 — Query Optimization

Once I understood lazy evaluation, I started wondering — okay, but what exactly is Polars optimizing? What is it actually doing differently under the hood?

Turns out there are two big ones worth knowing about. And once you understand them, you start seeing why Polars is faster not just on this pipeline, but on almost any non-trivial workflow.

Predicate Pushdown

Let’s say you add a filter to our pipeline — you only want completed orders:

result = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .filter(pl.col('status') == 'completed')
    .group_by('region')
    .agg(pl.col('net_revenue').sum())
    .collect()
)

In eager mode — Pandas or Polars eager — this is what happens: load all one million rows, compute net revenue for all of them, then filter down to completed orders. You did expensive work on rows you were going to throw away anyway.

Polars lazy does something smarter. It looks at the entire query plan and says: there’s a filter here. Let me apply that filter as early as possible — ideally before loading data, or right after. That way I’m only doing the expensive computations on the rows that actually matter.

That’s predicate pushdown. Push the filter down to the earliest possible point in the pipeline. Less data processed. Less memory used. Faster result.

Projection Pruning

Same idea, but for columns instead of rows.

Our CSV has eight columns. But our final result only needs sales, quantity, region, and status. Polars looks at the query plan, figures out which columns are actually needed to produce the final output, and only loads those. The rest are ignored entirely.

In Pandas, you load everything first and figure out what you need later. Polars figures out what it needs before loading anything.
These two optimizations — predicate pushdown and projection pruning — are what a database query optimizer does. If you’ve ever written SQL and wondered why the database is fast even on massive tables, this is a big part of why.

And that’s the mental shift here. When you write a Polars lazy pipeline, you’re not really writing a script anymore. You’re writing a query. You’re describing what you want, and Polars — like a database engine — figures out the most efficient way to get it.

That’s a different way of thinking. And once it clicks, you start approaching data pipelines differently. Instead of asking “what should I do next?”, you start asking “what do I actually need at the end?” and working backwards from there.

Mental Model Shift #3 — Columnar Memory

There’s one more piece to why Polars is fast. And it lives at a level below your code.

Pandas stores data in a row-oriented format under the hood. Imagine a spreadsheet where each row is stored together in memory — all the values for order 1, then all the values for order 2, and so on. That works fine for looking up individual records.

But when you want to compute something across an entire column — like summing all revenue values — Pandas has to hop through memory, picking out the relevant value from each row one at a time.

Polars uses a columnar memory format called Apache Arrow. Instead of storing rows together, it stores columns together. All the revenue values sit next to each other in memory. All the region values sit next to each other. When you need to compute something on a column, everything you need is already in one contiguous block of memory.

Why does that matter? Two reasons.

First, modern CPUs are built to process contiguous blocks of memory extremely efficiently. When your data is laid out in columns, the CPU can tear through it using vectorized instructions — processing multiple values in a single operation. Row-oriented storage breaks that optimization.

Second, columnar storage means Polars only touches the columns it needs. If your DataFrame has 20 columns but your operation only needs 3, Polars works with those 3 columns in memory. Pandas loads the whole thing regardless.

You don’t have to do anything to get this benefit. It’s just how Polars is built.

Think of it this way. Pandas is like a filing cabinet where each drawer contains everything about one customer — their name, their orders, their payment history, all together.

Polars is like a filing cabinet where one drawer has all the names, another has all the orders, another has all the payment histories. If you need to analyze payment histories across all customers, Polars opens one drawer. Pandas opens every single one.

That’s the columnar memory model. And combined with lazy evaluation and query optimization, it’s why Polars can do in 0.20 seconds what takes Pandas 0.31 seconds — even after Pandas has been carefully optimized by hand.

Where Pandas Still Wins

I want to be straight with you here. This article isn’t a Pandas obituary.

After going through this entire exercise, there are still situations where I’d reach for Pandas without hesitation.

For quick exploration and ad hoc analysis, Pandas is hard to beat. The syntax is familiar, the ecosystem is massive, and when you’re just poking around a dataset trying to understand it, the performance difference doesn’t matter. You’re not running a pipeline — you’re thinking out loud.

For small datasets, Polars’ advantages largely disappear. The overhead of building a query plan only pays off when there’s enough data to make optimization worthwhile. On a few thousand rows, just use Pandas.

For visualization and integrations, Pandas is deeply woven into the Python data ecosystem. Matplotlib, Seaborn, Scikit-learn, Statsmodels — they all speak Pandas natively. Polars is catching up, but Pandas is still the common language.

And honestly, for teams and collaborators, familiarity matters. If everyone on your team knows Pandas and nobody knows Polars, introducing it has a real cost.

I’m not replacing Pandas. I’m becoming more intentional about when I use it.

Small dataset, quick exploration, visualization, team familiarity — Pandas. Large dataset, repeated pipeline, production workflow, performance matters — Polars.

Both tools. Right situations.

Conclusion

Here’s what I came in expecting: a speed comparison. Run the same code in two libraries, show the numbers, done.

Here’s what I actually got: a different way of thinking about data.

The speed numbers are real — we went from a 61-second broken Pandas pipeline in my last article, to 0.31 seconds with optimized Pandas, to 0.20 seconds with Polars lazy evaluation. That’s a journey worth documenting.

But the more lasting thing is what Polars quietly teaches you along the way. That filters should be applied as early as possible. That you should only load what you need. That describing what you want and letting an optimizer figure out the how is a legitimate — and powerful — way to write data pipelines.

Those aren’t just Polars ideas. They’re good data engineering ideas. And I wouldn’t have sat down to really understand them if a comment on one of my posts hadn’t pushed me toward a tool I’d never used before.

Pandas didn’t become worse through any of this. My expectations just got bigger.

If you’ve got a workflow that feels slower than it should — even after you’ve cleaned up the code — it might be worth asking whether the tool itself has a ceiling. Sometimes the next step isn’t writing better code. It’s writing the same code in something built differently.

What workflows are you running that might be worth a rewrite?