Using Causal Inference to Estimate the Impact of Tube Strikes on Cycling Usage in London

Contents

What is the question we want to ask?Potential Outcomes Naive Treatment Effect Panel Data Results Causal Inference Assumptions Closing Remarks

(TFL) is a statutory body responsible for London’s public transport network, managing buses, the Underground, Docklands Light Railway, Overground, and major roads. Their ‘Open Data’ policy means that they share much of their internal data with the public, which they say is currently powering over 600 apps for Londoners.

One interesting data source they share with the public is Santander Cycle (also known colloquially as Boris Bikes) usage data. Every bike journey is recorded. This data goes back from 2015 all the way up to 2025. The data is arranged in unwieldy weekly CSV files to download: https://cycling.data.tfl.gov.uk/#!usage-stats%2F. Each row of this data is one bike journey, with each bike journey starting from a particular bike station. This equals 9.2 million station-hours, 800 bike stations, 144 weekly CSVs. See an example of the data below.

| Start Date       | StartStation Name                | End Date         | EndStation Name                     |   Duration |
|:-----------------|:---------------------------------|:-----------------|:------------------------------------|-----------:|
| 10/01/2016 00:00 | Drury Lane, Covent Garden        | 10/01/2016 00:04 | Frith Street, Soho                  |        240 |
| 10/01/2016 00:00 | Pott Street, Bethnal Green       | 10/01/2016 00:05 | Victoria Park Road, Hackney Central |        300 |
| 10/01/2016 00:00 | Harrington Square 2, Camden Town | 10/01/2016 00:20 | Baylis Road, Waterloo               |       1200 |
| 10/01/2016 00:01 | Canton Street, Poplar            | 10/01/2016 00:14 | Hewison Street, Old Ford            |        780 |
| 10/01/2016 00:01 | Cephas Street, Bethnal Green     | 10/01/2016 00:11 | Brick Lane Market, Shoreditch       |        600 |

We can take each row and aggregate this data up so we can see the seasonality trends across a few years:

Image by author

This dataset now gives us a glimpse into the bike usage across London (this data doesn’t contain every bike journey in London, but we can expect that Boris Bike usage is related to overall bike usage). For a Causal Data Science enthusiast, the natural next question is: how can we use this dataset to answer some interesting causal questions? What events occur that have a large impact on cycle trips? What are some common large scale disruptions that cause people to not be able to take the tube? How do workers show the value of their labour to their employers by withholding it? Strikes!

In this article I will be examining the causal impact of major tube strikes on cycle usage in London. Historical strikes are somewhat hard to pin down across the internet, but luckily for me there is a FOI into strike action, which gives us dates of strike action at a line level, between 2014-18.

As the data starts off as one row for every bike journey across all bike stations across London, we have some work to do to get into a format we can use. We have 144 weekly CSVs that we convert to parquet’s to help with memory constraints. We then combine all these parquet files into together one big dataframe and group by bike station and hour.

|   station_id | trips_start         |   ts |
|-------------:|:--------------------|-----:|
|            1 | 2016-01-10 09:00:00 |    4 |
|            1 | 2016-01-10 10:00:00 |    1 |
|            1 | 2016-01-10 11:00:00 |    2 |
|            1 | 2016-01-10 12:00:00 |    2 |
|            1 | 2016-01-10 13:00:00 |    2 |

TFL also provide coordinates for each bike station. We join on the coordinates to their corresponding H3 cell. H3 is a hexagonal grid system that is used by Uber and is useful for many spatial analysis tasks. The plot below shows how bike trips are distributed across London.

We can now aggregate the trip data up to H3 cell-day level along with some confounders that we think also have an impact on cycling usage in London. These include weather and seasonality features.

# Process in chunks to avoid memory spike
chunk_size = 100_000
h3_cells = []
for i in range(0, len(bf), chunk_size):
    chunk = bf.iloc[i:i+chunk_size]
    h3_cells.extend([h3.latlng_to_cell(lat, lon, 8) for lat, lon in zip(chunk["lat"], chunk["lon"])])
    print(f"  Processed {min(i+chunk_size, len(bf)):,} / {len(bf):,}")

bf["h3_cell"] = h3_cells

# Aggregate to cell-day
bf["day"] = pd.to_datetime(bf["trips_start"]).dt.date

cell_day = (
    bf.groupby(["h3_cell", "day"])
    .agg(
        total_trips           = ("ts", "sum"),
        frac_exposed          = ("strike_exposed", "mean"),
        n_stations            = ("station_id", "nunique"),
        temperature_2m        = ("temperature_2m", "mean"),
        precipitation         = ("precipitation", "mean"),
        is_weekend            = ("is_weekend", "first"),
        is_bank_holiday       = ("is_bank_holiday", "first"),
        is_school_holiday     = ("is_school_holiday", "first"),
        days_to_next_strike   = ("days_to_next_strike", "first"),
        days_since_last_strike= ("days_since_last_strike", "first"),
        month                 = ("month", "first"),
        year                  = ("year", "first"),
        doy                   = ("doy", "first"),
        lat                   = ("lat", "mean"),
        lon                   = ("lon", "mean"),
    )
    .reset_index()
)

This means that every row of our dataset now contains all Santander bike trips for each day and each H3 cell. We have 172 cells observed across 1,192 days.

We also filtered so that each cells that had at least one tube stop within 500m were included – this is neccessary to satisfy the Positivity Assumption. This assumption states that every unit has to have a non zero probability of both treatment and control. If a cell has no tube stops within 500m (we can reasonably assume that a commuter who can’t use the tube because of strikes would walk 500m to use a Santander bike).

cell_day = cell_day[cell_day["n_tube_within_500m"] >= 1].copy()

This gives us a cell-day dataset with 62 H3 cells, 66,039 rows and 98.4% of cells ever treated.

Next we can define our outcome and treatment variables. As each cell will have differing levels of expected bike usage, we create our outcome variable to be relative to each cell’s capacity – the total trips for each cell on each day divided by the number of bike stations in that cell. we take the log so that our coefficient tells us about proportional changes rather than absolute ones and so that the statistical assumptions of the regression are satisfied, and we add one so that quiet cell-days with zero recorded trips are included in the analysis rather than silently dropped.

\[
Y_{i,t} = \log\left(1 + \frac{\text{Total Bike Trips in cell } i \text{ on day } t}{\text{Number of Bike Stations in cell } i}\right)
\]

We can calculate the outcome variable in python with the following code.

cell_day["y_per_station_log1p"] = np.log1p(cell_day["total_trips"] / cell_day["n_stations"])

Defining the treatment variable for strike exposure isn’t as simple. We know which tube lines were striking on each day – but this information doesn’t neatly map to each cell, as each tube line snakes across London. When we are pondering the question of what happens to bike usage when tube lines are not operational, it’s helpful to first decide when bike stations are “near” to tube stations that are being effected by strikes. We have defined a bike station to be affected by a strike if it is within 400m of a tube station that serves one of the striking lines.

We then define a h3 cell to be strike affected if any bike station is strike affected within that h3 cell. This is now our treatment variable.

\[
T_{i,t} =
\begin{cases}
1, & \text{if cell } i \text{ is strike-exposed on day } t \\
0, & \text{otherwise}
\end{cases}
\]

To construct this treatment variable for our dataset, we first have to create a strike effected column for our station level data. We do this using the following function which takes in our station-hour data, a dataframe which tells us which lines were striking on each day and a dataframe which tells us stations are connected to each striking line.

def attach_strikes_to_base(
    base:             pd.DataFrame,
    strikes_daily:    pd.DataFrame,
    station_line_map: pd.DataFrame,
) -> pd.DataFrame:
    """
    Attach a binary strike_exposed indicator to the station-hour panel.

    A station-hour is treated (strike_exposed = 1) if any Underground line
    serving that station is on strike on that day.

    base must have columns: station_id, trips_start (datetime), ts (numeric trip count).
    """
    df          = base.copy()
    df["date"]  = pd.to_datetime(df["trips_start"]).dt.floor("D")

    station_day_treat = (
        strikes_daily
        .merge(station_line_map[["station_id", "affected_line"]], on="affected_line", how="inner")
        .drop_duplicates(subset=["station_id", "date"])
        .assign(strike_exposed=1)
        [["station_id", "date", "strike_exposed"]]
    )

    df = df.merge(station_day_treat, on=["station_id", "date"], how="left")
    df["strike_exposed"] = df["strike_exposed"].fillna(0).astype(int)
    return df.drop(columns=["date"])

When we aggregate the station-hour dataframe to cell-day level we take the mean of strike_exposed column into a new column frac_exposed, and any cells with a positive frac_exposed become treated cells.

cell_day["treated"]          = (cell_day["frac_exposed"] > 0).astype(int)

More detail on the data wrangling can be found on https://github.com/stucsk99/tfl_bike_casual/blob/main/01_data_pipeline.ipynb

Now we’ve defined our outcome and treatment variables, let’s take a step back and talk about the underlying causal theory that underpins all the results that we’ll arrive at in this article.

What is the question we want to ask?

The causal mechanism underlying our analysis is substitution. When a tube line strikes, commuters who would normally travel underground are displaced and must find an alternative. We argue that for commuters near major interchange stations, Santander Bikes represent the most accessible alternative: they are available without pre-registration, priced for short journeys, and physically present at the stations where displaced commuters emerge. This substitution story is what connects our treatment variable, to our outcome through a credible causal pathway rather than mere correlation.

Strike occurs → tube commuters cannot travel → those commuters look for alternatives → some walk to a nearby Santander dock → bike trips increase. Each arrow in that chain is a step in the mechanism. Without it, even a statistically significant result is just a correlation with a story attached. With it, you have a reason to believe the effect is real.

The causal mechanism we’re describing can be described by the following structural causal model.

Because strike timing is determined by labour negotiations rather than by anything related to cycling demand, we have good reason to believe that strike days are not systematically different from non-strike days in ways that would independently affect bike usage. A strike called on a Tuesday in January is not called because January Tuesdays are unusually good or bad for cycling – it is called because a wage negotiation broke down. This makes the counterfactual comparison credible: the bike usage we observe on comparable non-strike days is a reasonable approximation of what would have happened on strike days had the strike not occurred.

Now that we have our causal mechanism stated, we can carry on with our causal analysis. But before we do that, let’s go through some of the important building blocks of causal inference – the potential outcomes framework.

Potential Outcomes

The fundamental problem of causal inference is that we don’t observe the counterfactual outcomes – we never know what would have happened to bike usage on a strike day, if that strike had not happened. This is by definition unobservable.

In an ideal world, we would observe both potential outcomes for each unit: $Y_{i,t}(0)$ which is the potential outcome if cell $i$ had not experienced a strike on day $t$ , and $Y_{i,t}(1)$ which is the potential outcome if it did experience a stike. From here we can define the individual treatment effect for cell $i$ on day $t$ which is the difference between the two potential outcomes:

\[
\tau_{i,t} = Y_{i,t}(1) – Y_{i,t}(0)
\]

We would love to know this quantity for each observation, but as mentioned above, we only ever observe one of the two potential outcomes. The logical next step is to average this effect for over all units. This is the Average Treatment Effect (ATE):

\[
ATE = E[Y_{i,t}(1) – Y_{i,t}(0)] = E[\tau_{i,t}]
\]

This is the expected treatment effect for a randomly selected unit from the full. In our setting, it answers: for a randomly selected cell-day in our panel, what is the expected change in log bike trips per station if that cell-day were to become strike-exposed?

We can also define another treatment effect: The Average Treatment Effect on the Treated (ATT):

\[
ATT = E[Y_{i,t}(1) – Y_{i,t}(0) | D_i = 1] = E[\tau_{i,t} | D_i = 1]
\]

Where $D_i \in \{0, 1\}$ is the treatment indicator. This shifts focus onto units that were actually treated. for a cell-day that was actually strike-exposed, what was the causal effect of that exposure?

Naive Treatment Effect

Before we get into how we estimate these figures using robust causal methods, we can first illustrate what goes wrong when we estimate the ATE naively. To do this as simply as possible, we could estimate the ATE to be difference in sample means between the treated and control observations. That is,

\[
\tau^{naive} = \overline{Y}_{D=1} – \overline{Y}_{D=0}
\]

print(f"Naive diff   : {np.expm1(cell_day.loc[cell_day['treated']==1,'y_per_station_log1p'].mean() - cell_day.loc[cell_day['treated']==0,'y_per_station_log1p'].mean())*100:+.1f}%")

In our data, this gives a naive difference of +5.5%. Cells with any strike exposure have substantially higher log bike trips per station than cells without. But this is not a credible causal estimate. We can decompose the naive difference algebraically to see exactly what it is estimating:

\[
\overline{Y}_{D=1} – \overline{Y}_{D=0} = \underbrace{E[Y_{i,t}(1) – Y_{i,t}(0) | D_i = 1]}_{ATT} + \underbrace{E[Y_{i,t}(0) | D_i = 1] – E[Y_{i,t}(0) | D_i = 0] }_{\text{selection bias}}
\]

The first term is the ATT, what we want. The second term is selection bias – the difference in control potential outcomes between treated and untreated units. In our case, this bias is likely positive: cells that are strike-exposed are near tube lines, which means they are in denser, more central areas of London that have higher baseline bike usage regardless of any strike. The naive estimate conflates the effect of strikes with the pre-existing advantage of centrally located cells.

Eliminating this selection bias is the entire job of the methods that follow.

Panel Data

Our dataset has a structure that is particularly well-suited to addressing selection bias. It is a panel. A panel dataset observes the same units repeatedly over time. Our specific panel has the following structure

\[
\{ X_{i,t}, D_{i,t}, Y_{i,t} \}
\]

Where $i = 1, \dots, N$ represents our H3 cells and and $t = 1 , \dots , T$ represents our days observed over our dataset. (get actual value of T and N here) We have N x T of total observsations.

The key insight that panel data provides is this: if we observe the same cell on multiple days, we can separate the time-invariant component of that cell’s outcome from the day-specific variation. A cell near Bank station is always going to be busier than a cell near Pimlico – that is a permanent feature of the cell’s location, not something that changes with strikes. Panel methods let us account for this permanent feature without ever having to measure it directly.

We can use the inherent set up of the panel data to model the treatment effect using a two way fixed effects model. This is a generalisation of a traditional Difference in Differences method. This model is set up in the following way:

\[
Y_{i,t} = \alpha_{i} + \lambda_{t} + \tau{D}_{i,t} + \beta X_{i,t} + \epsilon_{i,t}
\]

Where $Y_{i,t}$ is our outcome variable for cell $i$ on day $t$ , $\alpha_{i}$ is the fixed effect for cell $i$ , $\lambda_{t}$ is the fixed effect for day $t$ , $\tau$ is the causal treatment effect, ${D}_{i,t}$ is the treatment indicator, $\beta$ are the coefficents for covariates $X_{i,t}$ and $\epsilon_{i,t}$ are our errors.

In this model, we have two fixed effects, $\alpha_{i}$ and $\lambda_{t}$ for each cell $i$ and each day $t$ , which act as dummy variables for each cell and day. The cell fixed effect contains all time invariant cell characteristics (all the geographical features of cell $i$ that don’t change over time) and the date fixed effect contains all cell invariant variation (day specific variation). This is equivalent to demeaning within each cell and within each date, which removes all time invariant cell characteristics and common day-level shocks.

We can simply run this regression analysis using the ols function from the statsmodels.formula.api library:

twfe = smf.ols(
    """y_per_station_log1p ~ treated
       + temperature_2m + precipitation
       + is_weekend + is_bank_holiday + is_school_holiday
       + days_to_next_strike + days_since_last_strike
       + C(h3_cell) + C(date_str)""",
    data=cell_day,
).fit(
    cov_type="cluster",
    cov_kwds={"groups": cell_day["h3_cell"]},
)

Note how we can’t run ordinary OLS as the observations from the cell across different days are correlated. If we ignored this correlation and used standard OLS standard errors, we would systematically understate the uncertainty in $\epsilon_{i,t}$ , producing confidence intervals that are too narrow and p-values that are too small. We can address this by using the standard solution of clustering errors at the cell level. This allows for arbitrary correlation between the residuals $\epsilon_{i,s}$ and $\epsilon_{i,t}$ for the same cell i across any two dates $t$ and $s$ , while maintaining the assumption of independence across cells.

Results

Our TWFE method gives us an increase of 3.95% in Santander bike usage on strike days, with a p-value of 0.097.

Before we dive deeper into these results, we first focus on some changes we made to our data to tighten the causal mechanisms that we want to understand.

Having established that every cell in our analysis must have at least one tube station within 500 metres – our positivity condition – we apply a stronger restriction motivated by the causal mechanism itself. Not all tube stations generate equal commuter displacement when they strike. The 42 stations we focus on are the major interchange stations of central London: Bank, Liverpool Street, King’s Cross, Waterloo, Victoria, and their neighbours. These are the stations where thousands of commuters converge each morning, where Santander Bike docks are densest, and where the substitution from tube to bike is most frictionless – a displaced commuter walks out of a closed station and finds a rack of bikes within metres.

At more peripheral stations, even where a Santander dock exists nearby, the displacement mechanism is weaker. Fewer commuters are purely tube-dependent, and the walking distance to a bike dock is more likely to exceed what a time-pressured commuter will tolerate. Restricting to the 32 cells within 800 metres of these 42 major interchange stations is therefore a deliberate focus on the geographic population where both the demand shock from the strike and the supply response from the bike network are sufficiently concentrated for the substitution effect to be detectable.

# Get centroids of all unique cells in cell_day_clean
unique_cells = cell_day["h3_cell"].unique()
cell_centroids = pd.DataFrame([
    {"h3_cell": c,
     "lat": h3.cell_to_latlng(c)[0],
     "lon": h3.cell_to_latlng(c)[1]}
    for c in unique_cells
])

# Build KD-tree over the 42 station coordinates
station_coords = np.radians(CENTRAL_42[["lat", "lon"]].values)
tree           = cKDTree(station_coords)

# Query each cell centroid
cell_coords    = np.radians(cell_centroids[["lat", "lon"]].values)
radius_rad     = 0.8 / 6371.0   # 800m in radians

# For each cell, find distance to nearest of the 42 stations
nearest_dist_rad, _ = tree.query(cell_coords, k=1)
cell_centroids["dist_to_central_42_km"] = nearest_dist_rad * 6371.0
cell_centroids["near_central_42"]       = nearest_dist_rad <= radius_rad

central_cells = set(
    cell_centroids.loc[cell_centroids["near_central_42"], "h3_cell"]
)

# ── Filter ─────────────────────────────────────────────────────
cell_day_central = cell_day_clean[
    cell_day["h3_cell"].isin(central_cells)
].copy()

Days 300 days away from any strike have very different seasonal characteristics from strike days, and have no causal relevance to the comparison. Including them forces the date fixed effects to span a wide seasonal range, and the cell fixed effects are estimated from a period that is not directly relevant to the comparison. By restricting to a local window of 45 days around each strike date we can create a cleaner experiment: the control days look more like the counterfactual for the treated days, and seasonal confounding is reduced.

sub = cell_day_central[cell_day_central["days_to_nearest"] <= 45].copy()

We now have 4 different versions of basefile, each with an increasingly powerful signal to noise ratio.

|   Basefile Version                        | Rows       |  Treatment % |
|------------------------------------------:|:-----------|-------------:|
|       Only cells within 500m of tube stop | 66,039     |    0.82      |
|      Only cells close to Central Stations | 34,590     |    0.94      |
|   Only days within 45 days of strike days | 16,799     |    1.95      |

The plot shows the different TWFE estimates across the different basefile specifications. With the most causally powerful set up of our panel data achieves an estimated treatment effect of 3.95% with a p-value of 0.097.

Our p-value above the standard p=0.05 that is used as standard. This means that our results of a 3.95% increase would be achieved randomly 9.7% of the time. Even though our p-value is below the standardly used benchmark, we can see that our three estimates are consistently positive, and the width of the confidence interval reflects the limited number of strike events in the FOI data, not the absence of an effect.

Causal Inference Assumptions

Before getting too carried away with these results, we have to stop and consider the assumptions that have to be made for the TWFE estimate to have a casual interpretation.

Positivity/Overlap requires that every unit has to have a non zero chance of being treated. We have addressed this by making sure every cell in the panel must have at least one tube stop within 500m.

Parallel trends requires that in the absence of strikes, treated and control cells would have experienced the same time trend in bike usage. This is plausible in our setting because strike timing is determined by labour negotiation dynamics — the decision to strike on a particular date is driven by bargaining outcomes between TfL management and unions, not by anything related to the underlying trajectory of bike usage.

No anticipation requires that cells do not change their behaviour before treatment occurs — that the announcement of a strike does not itself alter bike usage in the days before the strike. This is partially addressed by the inclusion of days_to_next_strike as a covariate in the controlled specification, which captures any systematic pre-strike trend. We note that for truly unannounced strikes, the no-anticipation assumption is automatically satisfied.

SUTVA (Stable Unit Treatment Value Assumption, Rubin 1980) requires that the potential outcomes of one cell do not depend on the treatment status of other cells. This is the assumption most likely to be violated in our setting: a strike displaces commuters across a wide geographic area, potentially affecting bike usage at cells beyond those directly adjacent to striking lines. SUTVA violations will attenuate our estimate toward zero, meaning our +3.95% should be interpreted as a lower bound on the true effect for the most directly exposed cells.

Closing Remarks

This article set out to answer a simple question: do London tube strikes push commuters onto Santander Bikes? The answer, based on a two-way fixed effects analysis of four years of TfL open data, is yes, but arriving at that answer was considerably less straightforward than the clean result might suggest.

Working with real life data is never straightforward. To get the trip data into a format which was usable for me to answer the questions While parsing 144 weekly CSVs, I had to reconcile inconsistent column schemas across data releases, correct a silent naming mismatch between strike line identifiers, and rebuilt the spatial mapping between bike stations and tube stops several times.

This was all before considering the difference causal assumptions necessary to build a credible argument. Coming from an ML background, I also spent a non-trivial amount of time investigating meta-learners (S, T, and X learners, which are a set of predictive machine learning methods to estimate treatment effects) for this problem. This would have given us richer insight – the conditional average treatment affect, or CATE, which would tell us how the treatment effect varies across London.

I learned the hard the way that the tool did not fit the problem. Panel data with recurring binary treatment and a strong geographic identification story wants a fixed effects regression, not a cross-sectional ML estimator.