RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production

Contents

TL;DR Who this is for Why Vector Search Has No Sense of Time Three Time Problems, Three Different Fixes 1. Expiration: a fact that is now false 2. Temporality: facts that are only true right now 3. Versioning: a fact that has been replaced How This Relates to Existing Research What I Built: A Temporal Layer The Core Design: Two Orthogonal Axes The Scoring Formula The Failure That Revealed the EVENT Relevance Gate Four Scenarios: Before and After Scenario 1 — API Rate Limits (expired answer is dangerous)Scenario 2: LLM Scaling Research Scenario 3 — Company Health (one story vs the full picture)Scenario 4 — Live Outages (urgent signal buried)What broke next — and how I fixed it Not All Content Decays at the Same Rate What This Does Not Solve The Takeaway Run It Yourself References Foundational RAG Temporal Reasoning in Language Models Dense Retrieval and Reranking Scaling Laws (referenced in Scenario 2)Information Retrieval Fundamentals Disclosure

, a learner messaged me about a wrong answer.

She had asked the tutor about a concept from one of my Generative AI tutorials. The response looked fine. But it wasn’t. I had already rewritten that content two months earlier. My RAG system pulled a version from six months ago — not obviously wrong, just wrong enough to mislead.

She thought she had misunderstood. She hadn’t. My own system was teaching her from lessons I had already replaced.

I’m building a RAG-powered assistant for EmiTechLogic, my tech education platform — turning a content library into a system that generates answers directly from my own articles. I wrote about the initial architecture here. The initial architecture was manageable. The real challenge begins when real learners hit a live system.

When I pulled the retrieval logs, I saw exactly what happened. Both versions were in the vector store. The old one ranked first because it had more matching tokens and a higher cosine similarity score. The updated version came in second. Sometimes third.

I expected the newer document to win automatically. That’s not how cosine similarity works.

The system was doing exactly what it was designed to do, which turned out to be the problem.

The pattern held across other queries too. Python tutorials I had updated, model comparison guides I had revised. Old versions kept surfacing first. The AI tool I was building was quietly teaching people from lessons I had already replaced.

Here’s what that looked like in practice, same query, same corpus, naive RAG:

QUERY: What are the API rate limits? Will I get a 429 error?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.447
     "API rate limits are set to 100 requests per minute..."
  2. [announcement_today] age=0d   | valid   | sim=0.329
  3. [tutorial_old]       age=600d | EXPIRED | sim=0.303

A 540-day-old expired document was sitting at the top. The live announcement from 48 hours ago was ranked second. The retriever didn’t care about freshness. It only matched words.

I assumed freshness would be handled somewhere in the pipeline. It wasn’t. Nobody had thought to add it.

This article is about how I fixed that. I built a temporal layer, a layer that sits between the vector search results and the LLM, and makes the system care about time.

TL;DR

If you’re short on time: vector search has no concept of when something was true. I fixed this by adding a reranking step between the retriever and the LLM — one that hard-removes expired facts, boosts active time-bounded signals, and uses exponential decay to prefer newer documents. The tricky part was making sure “fresh” didn’t override “relevant.”

The one-line version: naive RAG finds what’s similar, temporal RAG finds what’s still true.

Complete code: https://github.com/Emmimal/temporal-rag/

Who this is for

Any RAG system where the knowledge base changes over time. If your system has ever given a confident answer from a document you had already updated, deprecated, or replaced — this is for you.

It matters most for API and product documentation, incident and outage management, customer support knowledge bases, internal wikis and policy systems, and education platforms where content evolves.

Skip it if your knowledge base is static and never changes. Skip it if your content has no concept of expiry, versions, or time-bounded signals. Skip it if a stale answer carries no real consequence.

Why Vector Search Has No Sense of Time

The standard RAG pipeline embeds documents, embeds the query, finds the closest matches, and sends them to the model [1, 2]. That works fine if your information never changes. But if you are constantly publishing new guides and rewriting old ones, this fails silently. You might not even notice until a user complains.

The vector store just knows the angle between the vectors [10]. It has no idea which document is six months old and which one I published last week.

The usual fixes are deleting old documents or adding metadata filters. I tried both. They helped for about two weeks, and then I updated my content again and the same problem returned. A document with a 20% penalty can still rank first if its word overlap is strong enough.

When I looked closer, I realized this wasn’t one big problem. It was actually three separate problems, and each one needs a different fix.

I had been collapsing all three into one bucket called “stale content” and applying the same fix to all of them. That’s why nothing was sticking.

Three Time Problems, Three Different Fixes

1. Expiration: a fact that is now false

Some documents have an expiry date. Showing them after that date isn’t a freshness issue. It’s a lie. You can’t just down-rank these. You have to remove them completely before the model ever sees them.

2. Temporality: facts that are only true right now

Some information matters intensely for a short window. A live notice about a site outage or a 48-hour policy change isn’t just extra context. It is the most important document in your knowledge base while its window is open. An hour after it closes, it is false.

3. Versioning: a fact that has been replaced

This was my biggest problem. When I updated a document, both versions stayed in the vector store. The old one kept winning because it had more matching words. The fix here is neither removal nor boosting. Let time decay handle it. The newer document should naturally outscore the older one when recency is part of the ranking signal.

Problem	Nature	Wrong fix	Right fix
Expiration	Fact is now false	Down-weight	Hard remove before ranking
Temporality	Fact is active and urgent	Treat as normal	Boost while window is open
Versioning	Fact is superseded	Hard remove	Time decay ranks newer higher

I kept seeing the same pattern: old documents, expired documents, and temporary alerts were all treated like the same problem. In practice, it behaved more like a collection of rules than an actual temporal retrieval model.

How This Relates to Existing Research

I looked at existing approaches — graph-based retrieval, timestamped embeddings, recency priors baked into the retriever itself. Time-aware language models bake temporal signals directly into the model weights [5], while internet-augmented approaches fetch live documents at query time [3]. RealTime QA [4] frames the problem as one of answer currency rather than retrieval ranking. All of them required rebuilding infrastructure I did not have. I needed something I could drop into the system I already had running.

So I built a post-retrieval layer instead — a reranking step [6] applied downstream of dense passage retrieval. No retriever changes. No new embedding model. No new infrastructure. All it requires is a timestamp on each document and one reranking step at query time.

I needed something running within days on a live platform, not a rebuild. This was that.

What I Built: A Temporal Layer

What I ended up building was a temporal layer that sits between the retriever and the LLM. The retriever stays unchanged. It still pulls the top 20 candidates by cosine similarity. The temporal layer receives those candidates, reclassifies them, and reranks them before any reach the model.

The multi-stage temporal retrieval architecture showing the transformation of raw semantic search results into time-aware context through hybrid scoring and validity filtering. Image by Author

That gap between retriever and LLM is where all the real work happens.

The Core Design: Two Orthogonal Axes

The key design decision is two independent classification axes, not one.

Axis 1: Validity State (3 States)

EXPIRED  -> was true, is no longer true. Hard remove before ranking.
VALID    -> true with no active time constraint. Normal scoring.
TEMPORAL -> true within a currently active time window. Boost.

Most systems run on two states: valid and expired. What I was missing was a separate TEMPORAL state for active time-bound signals. A maintenance notice isn’t the same as a permanent rule. It’s urgent and needs to surface first. Once maintenance is over, the notice moves to EXPIRED and is removed.

You can find the full code for how this works in my project folder. Here is a simplified version of the main logic:

TEMPORAL state is gated on document kind.

# TEMPORAL state is gated on document kind.
# Only EVENT documents reach TEMPORAL — not VERSIONED, not STATIC.
if self.kind == DocumentKind.EVENT:
    return ValidityState.TEMPORAL

return ValidityState.VALID  # VERSIONED docs with valid_from are still just VALID

The complete implementation with all edge cases is linked in the “Run It Yourself” section.

Axis 2: Document Kind (3 Types)

STATIC    -> timeless fact (definitions, math, reference material)
VERSIONED -> replaced by newer information (policies, tutorials, specs)
EVENT     -> true only within a time window (announcements, outages)

This distinction matters a lot. Without it, my first version classified a new company policy as a temporary event and boosted it to the top of every search. A policy update behaves differently from a live outage notice, even if both are recent. It should be ranked normally and lose points slowly over time.

The fix: I made it so only “Events” (like news or alerts) can get the “Urgent” boost. Normal documents are never treated that way.

policy_v2:          kind=VERSIONED  state=valid    window=supersedes policy_v1
announcement_today: kind=EVENT      state=temporal  window=42h remaining

Even with identical timestamps, those documents behave differently because they represent different kinds of information.

Grid showing how document validity state (Expired, Valid, Temporal) intersects with document kind (Static, Versioned, Event) to determine retrieval treatment in a temporal RAG system. The only cell that receives a boost is Temporal × Event. All Expired cells are hard removed. Static and Versioned documents cannot reach Temporal state. — The two-axis classification system that drives the temporal layer — validity
state decides what to do with a document, document kind decides why. Image by Author

The Scoring Formula

The final score for each document combines vector similarity with temporal signals:

final_score = semantic_penalty
            × [(1 − w) × vector_score
               + w × (decay_score × recency_score
                      × validity_multiplier × event_relevance_multiplier)]

Where:

vector_score: cosine similarity, normalized to fall between 0 and 1 relative to the candidate pool.

decay_score: exponential decay based on document age, a technique applied to document freshness ranking in information retrieval [10].

decay = 0.5 ^ (age_in_days / half_life_days)

You can also change how fast the score should drops based on what the document is. For example, news fades away in just 7 days, but legal documents stay strong for 365 days.

recency_score: a relative comparison within the current pool. The newest document gets the top score, the oldest gets the bottom. This ensures the system always prefers the freshest option available, not just the freshest option in absolute terms.

validity_multiplier — applied based on validity state:

EXPIRED  -> 0.0  (safety net; should already be filtered)
VALID    -> 1.0  (normal)
TEMPORAL -> 1.2  (boost for active EVENT signals)

event_relevance_multiplier — applied to EVENT documents only

EVENT + TEMPORAL + raw_cosine >= floor -> 1.0  (full boost)
EVENT + TEMPORAL + raw_cosine <  floor -> 0.5  (boost halved)

semantic_penalty — applied to all document kinds:

normalized_score >= min_threshold -> 1.0  (no penalty)
normalized_score <  min_threshold -> 0.3  (relevance penalty)

w is temporal_weight — the balance between semantic relevance and temporal signals. I run it at 0.40 on my platform’s tutor, meaning 60% of the score still comes from meaning, 40% from time.

Flowchart showing how the temporal RAG hybrid scoring formula is assembled. The final score splits into semantic penalty on the left and temporal component on the right, weighted 60% vector and 40% temporal. The temporal component feeds into four sub-components: decay score, recency score, validity multiplier, and event relevance gate. The decay score links down to all seven half-life profiles. — Every candidate document passes through this pipeline before reaching the LLM —
60% semantic match, 40% time, with freshness never allowed to override relevance. Image by Author.

The Failure That Revealed the EVENT Relevance Gate

After the first version was running, I noticed a new problem. A user asked about “engineering team health,” but the top result was a notice about website maintenance.

The notice was new. But it had nothing to do with the question. It won simply because it was the freshest thing in the system. Being new isn’t enough. The document also needs to be relevant.

Without some relevance gating, fresh alerts started showing up in unrelated queries.

So I added a hard requirement: an event only gets its boost if its raw cosine score clears a minimum floor. If the content doesn’t talk about the right topic, the recency advantage disappears.

def _event_relevance_multiplier(self, doc, state, raw_vector_score) -> float:
    if doc.kind != DocumentKind.EVENT:
        return 1.0
    if state != ValidityState.TEMPORAL:
        return 1.0
    floor = self.config.event_min_raw_vector_score
    return 1.0 if raw_vector_score >= floor else 0.5

Why raw cosine and not normalized? Because it acts as an absolute ruler.

Normalized scores are relative. If all your results are weak, the “least bad” one might still score 80%. That’s dangerous. Raw cosine doesn’t care about the other documents. If a query about “team health” has almost nothing in common with a “technical update,” the score stays near zero regardless.

reason: EVENT signal present but low query relevance
        (raw sim 0.101 < 0.2) — temporal boost halved

Threshold calibration note: The number you use as your “security guard” threshold depends on the type of AI model you use.

TF-IDF / sparse embeddings: use a floor around 0.20. Word-match scores are naturally lower.
Dense models like text-embedding-3-small or all-MiniLM-L6-v2 [7]: use 0.35 to 0.50. These models score higher by default, so the floor needs to move up.

Four Scenarios: Before and After

These are the actual outputs from running demo.py on the same queries, two ways: naive RAG and temporal RAG.

Scenario 1 — API Rate Limits (expired answer is dangerous)

QUERY: What are the API rate limits? Will I get a 429 error?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.447
  2. [announcement_today] age=0d   | valid   | sim=0.329
  3. [tutorial_old]       age=600d | EXPIRED | sim=0.303

TEMPORAL RAG
  [announcement_today]
    age          : 0.3 days  |  kind: EVENT  |  state: temporal (active)
    window       : 42h remaining
    reason       : Active EVENT signal (42h remaining) — overrides static sources
    FINAL SCORE  : 1.079

  [policy_v2]
    age          : 175.0 days  |  kind: VERSIONED  |  state: ✓ valid
    reason       : Latest version — supersedes policy_v1
    FINAL SCORE  : 0.573

  [news_recent]
    age          : 30.0 days  |  kind: STATIC  |  state: ✓ valid
    reason       : Fresh, open-ended fact — high confidence
    FINAL SCORE  : 0.509

  removed  : ['policy_v1', 'tutorial_old']
  surfaced : ['policy_v2', 'news_recent']

Naive RAG tells the user they’ll hit 429 errors at 100 requests per minute. The actual limit is 1,000. Temporal RAG leads with the live maintenance announcement (rate limiting is currently suspended) and follows with the current policy.

Scenario 2: LLM Scaling Research

QUERY: Do larger language models keep improving with scale?

NAIVE RAG
  1. [tutorial_old]   age=600d | EXPIRED | sim=0.226
  2. [research_2022]  age=730d | valid   | sim=0.141
  3. [research_2026]  age=120d | valid   | sim=0.136

TEMPORAL RAG
  [research_2026]  STATIC ✓ valid  score=0.662
    reason: Stale — semantically relevant but low freshness weight
  [research_2022]  STATIC ✓ valid  score=0.600
    reason: Stale — semantically relevant but low freshness weight
  [news_old]       STATIC ✓ valid  score=0.476
    reason: Stale — semantically relevant but low freshness weight

  removed : tutorial_old
  surfaced: news_old

Naive RAG ranks a dead document first by word overlap. Temporal RAG removes it and puts the 2026 research at the top, where it belongs. The corpus documents in this scenario reflect the real shift in scaling research: the earlier plateau finding [8] was later revised by compute-optimal scaling studies [9].

Scenario 3 — Company Health (one story vs the full picture)

QUERY: What is the current state of the engineering team and company health?

NAIVE RAG
  1. [news_old]      age=400d | valid   | sim=0.600
  2. [tutorial_new]  age=85d  | valid   | sim=0.385
  3. [tutorial_old]  age=600d | EXPIRED | sim=0.304

TEMPORAL RAG
  [news_old]     STATIC    ✓ valid  score=0.602
    reason: Stale — semantically relevant but low freshness weight
  [news_recent]  STATIC    ✓ valid  score=0.543
    reason: Fresh, open-ended fact — high confidence
  [tutorial_new] VERSIONED ✓ valid  score=0.519
    reason: Latest version — supersedes tutorial_old

  removed  : tutorial_old
  surfaced : news_recent

The live announcement didn’t appear here because it failed the relevance gate. Its raw cosine was 0.165, below the 0.20 floor. But both news articles showed up, which is exactly right. The LLM can now read both and understand how things have changed over time. Naive RAG only surfaced the old story and two unrelated guides.

Scenario 4 — Live Outages (urgent signal buried)

QUERY: Are there any current API outages or limit suspensions I should know about?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.390
  2. [policy_v2]          age=175d | valid   | sim=0.267
  3. [announcement_today] age=0d   | valid   | sim=0.101

TEMPORAL RAG
  [policy_v2]           VERSIONED ✓ valid     score=0.641
    reason: Latest version — supersedes policy_v1
  [announcement_today]  EVENT     temporal  score=0.465
    reason: EVENT signal present but low query relevance
            (raw sim 0.101 < 0.2) — temporal boost halved
  [news_recent]         STATIC    ✓ valid     score=0.082
    reason: Penalized: normalized vector score 0.000 below relevance threshold

  removed : policy_v1

Naive RAG buries the live update at position 3 behind an expired policy. Temporal RAG moves it to position 2. It didn’t reach first because the word overlap between “outages” and “upgrades” was low. With dense embeddings instead of TF-IDF, it would have taken the top spot easily.

What broke next — and how I fixed it

Once the core temporal layer was working, real queries surfaced more surprises. Here’s what broke next.

When a document is too old to stand alone but too useful to drop

Some documents weren’t wrong, just old enough that I didn’t want them answering alone. So I added a third action between SOLO and DROP: Weak documents get retrieved only if a fresher source comes with them. Invalid ones never reach the model.

[Invalid] research_old     decay=0.100  → DO NOT RETRIEVE
[Weak]    research_weak    decay=0.351  → PAIR WITH research_fresh (gain=+0.540)
[Good]    research_fresh   decay=0.891  → RETRIEVE

When the score looks good but the answer isn’t certain

A high score doesn’t mean high confidence. When two documents score 0.73 and 0.72 but contradict each other, the system shouldn’t act certain. I added confidence tiers that check the margin and flag conflicts — a close race or contradiction drops the result to LOW regardless of the raw score.

policy_v3 — clear winner:         confidence 0.7485  → HIGH
policy_v3 — conflict, narrow margin: confidence 0.4727  → LOW
math_theorem:                     confidence 0.6992  → MEDIUM

The second policy_v3 row is the one that matters: score went up from the conflict boost, confidence went down because the conflict is a warning signal.

Knowing why something was rejected

When the system rejects a document I want to know exactly which rule fired and on which query. I added a failure log keyed by query_id.

Failure summary (3 rejections — query_id=d211ffdc)
  EXPIRED_VERSIONED_DOC   × 1   doc=expired_policy
  STALE_STATIC_DOC        × 1   doc=stale_reference
  BELOW_RELEVANCE_GATE    × 1   doc=fresh_irrelevant

Codes in use: EXPIRED_VERSIONED_DOC, STALE_STATIC_DOC, HARD_EXPIRED_EVENT, BELOW_RELEVANCE_GATE, OUT_OF_TIME_RANGE, PAIR_PARTNER_NOT_FOUND. This is what I open first when something surfaces the wrong document.

When the fact changed significantly between versions

Replacing “100 requests per minute” with “1,000 requests per minute” is not a wording change. I added conflict severity detection that boosts the winner’s score and simultaneously lowers its confidence — so the right answer surfaces but the model stays cautious.

'100'  → '5000'    severity=0.980   boost=+0.196   conf_pen=-0.098   (50× — severe)
'1000' → '500'     severity=0.500   boost=+0.100   conf_pen=-0.050
'1000' → '1000'    severity=0.000   boost=0         conf_pen=0

When the user specifies a time range

A learner typed “show me research from 2021 to 2023.” The system returned the three most recent documents — none from that range. Temporal decay made it worse, ranking newer documents higher when older ones were exactly what was asked for.

I added a time-range parser that applies a strict filter when the query signals a date window, and steps aside entirely when it doesn’t. I did not want it to guess.

'Show me research from 2021-2023'  → kept: research_2022
'What were the findings in 2019?'  → kept: research_2019
'Latest embeddings research'       → no filter, all docs pass

When the query tells you how much recency should matter

“What is the current rate limit?” needs the freshest answer available. “How does cosine similarity work?” doesn’t care if I wrote it three years ago. I was applying the same temporal weight to both. The weight now adjusts based on signal words in the query.

'What is the current rate limit?'          → temporal_weight: 0.70
'Has the rate limit changed recently?'     → temporal_weight: 0.55
'How does cosine similarity work?'         → temporal_weight: 0.20 (baseline)

Seeing inside the system — and keeping version conflicts out of context

I wanted to know not just where each document ranked, but what to do with it. The freshness report gives kind-aware advice per document:

fresh_event    [EVENT]     grade: A   → Verify before serving, window closes soon
current_policy [VERSIONED] grade: D   → Check for a newer version
math_theorem   [STATIC]    grade: F   → May have been superseded

The final problem was subtler. Even with good reranking, the LLM produced hedged or averaged answers when v1 and v3 of the same policy both ended up in context. It doesn’t know which version to trust — it tries to reconcile everything it sees. What solved it was deduplicating by version chain before documents reached the temporal layer at all.

Input: policy_v1 (v1), policy_v2 (v2), policy_v3 (v3)
  policy_v1 — EXPIRED → removed
  policy_v2 — superseded by v3 → removed
  policy_v3 — kept ✓

Result: ['policy_v3']

Policy v3 goes in. The conflict never comes up.

Not All Content Decays at the Same Rate

One thing became clear quickly when I applied this to the platform: a single half-life value doesn’t work for all content types. A breaking update and a mathematical definition age very differently, and treating them the same way was quietly sabotaging the rankings.

breaking_news:  half_life=1d,     temporal_weight=0.70
news:           half_life=7d,     temporal_weight=0.55
policy:         half_life=90d,    temporal_weight=0.45
research:       half_life=180d,   temporal_weight=0.35
legal:          half_life=365d,   temporal_weight=0.25
reference:      half_life=1825d,  temporal_weight=0.10
mathematics:    half_life=36500d, temporal_weight=0.01

Line chart showing exponential decay scores over 365 days for seven content types used in a temporal RAG system. Breaking news drops to near zero within days. News follows within weeks. Policy, research, legal, and reference decay progressively slower. Mathematics stays nearly flat across the entire year, reflecting a 36500-day half-life. — Not all content ages the same way — a breaking news post and a math theorem
are both “old” after a year, but only one of them is wrong. Image by Author.

For breaking news, being new is basically the whole point. For a math proof, age doesn’t matter — a theorem from 70 years ago is just as valid as one from last week. On EmiTechLogic I group my content into bands: tutorials use the “policy” setting since newer is usually better, and reference material uses the “reference” setting since the facts don’t expire. Getting this distinction right is what actually made the whole thing work.

There is one more constraint layered on top of half-life: a decay floor. Without it, a math theorem from 1954 gets a decay score near zero — not because it’s wrong, but simply because it’s old. The temporal component then drags its final score down even when the semantic match is strong. The floor prevents that. In the implementation, DECAY_FLOORS maps a (doc_type, kind) pair to a minimum decay value — mathematics/STATIC floors at 0.95, reference/STATIC at 0.70, research/STATIC at 0.10. Documents without a floor entry decay freely; documents with one never drop below their minimum. A cosine-similarity winner that happens to be old still competes on meaning rather than losing automatically on age.

The implementation cost is lower than you’d expect. The temporal reranking step adds roughly 15 to 30 milliseconds per search — negligible next to the 1 to 4 seconds LLM inference typically takes. You don’t need to change your search engine, your data, or your embedding model. The entire temporal layer is a pure Python post-processing step that runs downstream of whatever vector search you’re already using.

The only real upfront requirement is metadata on your documents. At minimum, every document needs a created_at timestamp. valid_from, valid_until, and kind give you the best results, but they’re optional — documents without any metadata fall back to STATIC/VALID with standard time-decay scoring, which is already better than nothing. On my platform I automated the tagging entirely. The system now distinguishes between an update, an alert, and a permanent fact without me labeling anything manually.

What This Does Not Solve

A few honest caveats before you build this.

Implicit expiration is the one I still haven’t fully solved. Most documents don’t announce when they go stale — a tutorial for a deprecated endpoint has no expiry date, so the system can’t know it’s rotting. My heuristic rules catch the obvious cases, but edge cases slip through, and I find them the same way I found the original problem: a learner gets an answer that’s quietly wrong.

Conflicting sources are outside the temporal layer’s scope entirely. It surfaces the most recent and relevant documents — resolving disagreements between them is the LLM’s problem, not the retriever’s.

Calibration is model-specific in ways that will bite you. The 0.20 raw cosine floor is tuned for TF-IDF. Dense models like text-embedding-3-small score higher in absolute terms, so that floor needs to move to 0.35–0.50. Test against your own queries before you trust any threshold I’ve listed.

The half-life profiles are starting points, not constants. What “stale” means for a legal team is not what it means for a news site. Run the system on real queries from your domain and tune from there.

The Takeaway

The problem isn’t that RAG systems retrieve wrong documents — it’s that they have no concept of when a document was true, only how similar it is to the query.

Two axes drove the whole design — the kind axis was the one I almost missed entirely. Validity state — whether a document is expired (remove it), valid (score normally), or temporal (boost it while its window is active). Document kind — whether it is a timeless fact (STATIC), something that has been replaced (VERSIONED), or something that is only true within a time window (EVENT).

Without the kind axis, a versioned policy with an effective date looks identical to a time-bounded event and gets mislabeled. The system produces the wrong result for a right-sounding reason. That’s the hardest class of bug to catch in production, because nothing looks broken.

The semantic threshold closes the last gap. Fresh-but-irrelevant documents can take over ranking when temporal scores are high. A minimum raw cosine floor for EVENT documents makes sure freshness never fully overrides relevance.

Similarity alone wasn’t enough anymore. I needed the retriever to care about whether the information was still valid.

Run It Yourself

The full implementation (temporal_rag.py, demo.py, and advanced.py) is available at:

GitHub: https://github.com/Emmimal/temporal-rag/

The repository includes the complete validity_state implementation, all decay profiles, the SequenceAwareRetriever, and the freshness report API. The demo runs without any API key using a deterministic TF-IDF embedder so you can reproduce the exact output shown above on any machine.

git clone https://github.com/Emmimal/temporal-rag/
cd temporal-rag
pip install numpy
python demo.py

References

Foundational RAG

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://arxiv.org/abs/2005.11401

[2] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. https://doi.org/10.48550/arXiv.2312.10997

Temporal Reasoning in Language Models

[3] Lazaridou, A., Gribovskaya, E., Stokowiec, W., & Grigorev, N. (2022). Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115. https://doi.org/10.48550/arXiv.2203.05115

[4] Kasai, J., Sakaguchi, K., Takahashi, Y., Le Bras, R., Asai, A., Yu, X., Radev, D., Smith, N. A., Choi, Y., & Inui, K. (2022). RealTime QA: What’s the answer right now? arXiv preprint arXiv:2207.13332. https://doi.org/10.48550/arXiv.2207.13332

[5] Dhingra, B., Cole, J. R., Eisenschlos, J. M., Gillick, D., Eisenstein, J., & Cohen, W. W. (2022). Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10, 257–273. https://doi.org/10.1162/tacl_a_00459

Dense Retrieval and Reranking

[6] Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085. https://doi.org/10.48550/arXiv.1901.04085

[7] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410

Scaling Laws (referenced in Scenario 2)

[8] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. https://doi.org/10.48550/arXiv.2001.08361

[9] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. https://doi.org/10.48550/arXiv.2203.15556

Information Retrieval Fundamentals

[10] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/

Disclosure

All code in this article was written by me and is original work, developed and tested on Python 3.12.6. Benchmark numbers and retrieval outputs are from actual demo runs on my local machine (Windows 11, CPU only) and are reproducible by cloning the repository and running demo.py and advanced.py. The temporal layer, scoring formulas, document classification system, and all design decisions are independent implementations not derived from any cited codebase. The demo runs without any API key using a deterministic TF-IDF embedder; numpy is the only external dependency required to reproduce all outputs shown. I have no financial relationship with any tool, library, or company mentioned in this article.