Context Payload Optimization for ICL-Based Tabular Foundation Models

Contents

Inference-Time Trade-Offs Context Payload Optimization Strategies Methods of Optimization Moments of Optimization Hands-On Demo: KNN‑Based Context Prefiltering Setup Prefiltering Logic Running Experiments The Wrap

past couple of years have seen a surge of investment in open‑source and commercial tabular foundation models built around in‑context learning (ICL). In 2025, for example, the software giant SAP released the SAP-RPT-1 suite of models, targeting ERP-centric tasks in areas such as financial planning, sales and procurement order processing, and supply chain management. Unlike traditional supervised machine learning – where models are trained and fine-tuned for specific tasks – ICL allows a single, generically pretrained model to adapt on the fly using relatively small amounts of task-specific data provided in the context payload, which acts as a kind of ephemeral training set.

While the shift to ICL eliminates the need for costly (re)training of task-specific tabular models, it introduces an important accuracy-latency trade‑off at inference time, especially for centrally hosted models like SAP-RPT-1. On the one hand, the time required to send the context payload to the model server, and for the model to interpret and learn from that payload, directly contributes to overall response latency. Smaller payloads can reduce latency. On the other hand, the model may need to infer complex schemas and data distributions from heterogenous contextual data that potentially contains outliers, missing values, and long-tail patterns. Accurate predictions typically depend on large, well-curated context payloads. In practice, this means finding ways to distill the context payload to reduce response time without degrading the model’s predictive performance. Secondary trade‑offs involve factors such as model service throughput, response stability, and the monetary cost of model usage. All these challenges make context payload optimization a central architectural concern in ICL‑based workflows.

In the following sections, we will examine the inference‑time trade‑offs entailed by ICL-based tabular foundation models in more detail, outline practical strategies for optimizing context payloads, and demonstrate the use of KNN‑based context prefiltering as a payload optimization technique with an end-to-end example in Python.

Inference-Time Trade-Offs

An effective approach to analyzing the inference‑time trade-offs of ICL‑based tabular foundation models is to apply the so-called “iron triangle” framework discussed in this previous article. There, we showed how customers and users of AI systems must navigate the inherent tensions between response quality, inference cost, and latency, which is an inference‑time analog of the classic, design-time “triple constraint” in project management. Crucially, improving any one of these dimensions typically puts pressure on the others: higher‑quality responses tend to be more computationally intensive, which increases both latency and cost; reducing latency often requires sacrificing quality or paying more for faster hardware; and lowering cost usually means accepting slower or lower‑quality AI responses.

We encounter this same triangular tension in the context of ICL‑based tabular foundation models. The primary trade‑off is the need to balance response quality (measured in terms of precision, recall, etc.) against latency. Consider a real‑time fraud detection system deployed at ATMs: both precision and speed are critical, yet they pull the system in different directions when it comes to constructing the context payload. Larger, richer payloads give the AI model more examples from which to infer the underlying schema, recognize rare and long‑tail patterns, and thus deliver higher‑quality predictions. At the same time, each additional row or feature increases the volume of data that must be sent to the model server and interpreted during inference, which can introduce a measurable overhead to the end-to-end response time. In real-time applications, even a small increase in payload size can noticeably degrade system responsiveness and, ultimately, damage user experience.

Furthermore, a number of related, secondary trade‑offs emerge in practice. A larger context payload not only slows down the inference but also consumes more tokens. Under token-based billing, this creates a tension between response latency and monetary cost of model usage for customers, which becomes especially salient for centrally hosted models like SAP-RPT-1. A larger payload can increase the compute time per request, creating a latency-throughput trade‑off that may force the AI system’s development team to make tough scaling decisions. There is also a potential quality-stability trade‑off: increasing the volume and variety of the context data can improve predictive accuracy but may reduce determinism by introducing noise and making outputs more sensitive to small variations in the data. Finally, more sophisticated payload selection methods such as KNN-based retrieval can improve prediction quality but also increase payload construction time, adding to the overall latency.

Context Payload Optimization Strategies

In general, strategies to optimize the context payload span two orthogonal dimensions: the method and the moment of optimization. The method of optimization determines how exactly the payload is curated, i.e., the specific filtering, clustering, or embedding techniques used to compress the rows in the raw context. The moment of optimization concerns when and where the optimization is carried out, e.g., whether it is precomputed offline or derived on the fly at inference time, and whether this is done by the client or the model service. Choosing a particular moment for constructing the optimized payload can have important consequences for inference latency and maintainability. The method and moment of payload optimization should be aligned with the scope, budget, latency threshold, and quality requirements of a given AI use case.

Methods of Optimization

We can broadly distinguish between task-agnostic and task-aware methods of payload optimization. Task‑agnostic methods rely on techniques such as random sampling and recency‑based sampling, which do not require knowledge of the specific prediction task or the semantic structure of the data. Random sampling is easy to implement, fast, and unbiased, making it a useful baseline or fallback strategy. However, it may inadvertently discard rows that capture rare yet important patterns crucial for model performance. Recency‑based sampling assumes that timestamps are recorded in the data, and retrieves the most recent rows, which can be valuable for data distributions that are time‑bound (e.g., seasonal) or susceptible to temporal drift. However, recency-based sampling ignores the broader structure of the dataset and may overweight short‑term noise. Overall, task‑agnostic methods offer simplicity and speed but provide limited control over the representativeness and relevance of the resulting payload.

By contrast, task‑aware methods can incorporate information about the prediction task, the query rows, and the underlying data distribution to select the most relevant rows for the context payload. A common approach is K‑nearest neighbors (KNN) sampling, which identifies rows in the historical data that are similar to the query rows. This can yield highly relevant contextual data and strong empirical performance, but requires distance metrics (e.g., cosine), and auxiliary models to vectorize or embed the data, and can thus be computationally expensive at scale. Another class of techniques uses clustering algorithms (e.g., K‑means, hierarchical clustering, DBSCAN) to draw representative samples from clusters pertaining to the query rows. This can ensure sufficient coverage of diverse patterns in the data while avoiding redundancy, though it typically requires offline computation of clusters and periodic re-computation to ensure that the clusters remain up to date.

More sophisticated task‑aware methods are also possible. For example, the raw context and query rows can be embedded in a low-dimensional vector space – encoded in the request, and decoded in the response of the foundation model API; this amounts to a form of lossy compression that sacrifices some accuracy for the latency and cost benefits of a smaller payload. Retrieval‑augmented generation (RAG) techniques can further enrich the payload with domain‑specific grounding to boost response relevance.

In sum, task‑aware methods generally produce higher‑quality context payloads but come with greater engineering and computational overhead.

Moments of Optimization

One key moment-related decision is about whether some of the payload optimization steps can be pre-computed offline (i.e., the “when”). For example, a curated, “golden” dataset can be pre-computed from historical data, optimized for informational density, and enriched with metadata (e.g., cluster IDs, hashtags, etc.). Relevant rows can be selected from this leaner, golden dataset to quickly construct and send the context payload at inference time. Golden datasets are well-suited for stable schemas and repetitive tasks (e.g., auto-completion of common sales orders in the ERP domain), but their curation and maintenance can create additional overhead for the development team. In contrast, on‑the‑fly optimization derives the payload at inference time based on the current query rows and available historical data. This approach is more adaptive but can increase the compute cost and latency for each inference call. On‑the‑fly optimization also does not necessarily reduce the development team’s overhead – the savings from not maintaining a golden dataset may be offset by the prompt engineering effort required to optimize the context payload dynamically.

Another moment-related decision concerns whether the optimization occurs on the client or service side (i.e., the “where”). Client‑side optimization gives the consuming application full control, allowing bespoke preprocessing, local caching, and easier debugging. But it also makes each client responsible for implementing and maintaining its own optimization logic – an effort that may be duplicated across applications and teams. Client‑side processing also requires sufficient compute resources, which may be hard for applications running on resource‑constrained IoT or edge devices. Service‑side optimization, by contrast, benefits from economies of scale: with sufficient usage across clients, the AI service provider can justify more sophisticated algorithms and higher‑end hardware than any single client would deploy on its own. The provider can also leverage deep, model‑specific expertise and visibility into how the model performs across multiple client environments – compounding over time – to develop a more refined and harmonized strategy. Service‑side processing also simplifies governance, since software updates, privacy controls, audit logging, and compliance checks can be enforced uniformly. Downsides include reduced transparency for clients, higher load on the provider’s infrastructure, and the ongoing cost to the AI service provider of developing and maintaining the optimization logic.

Of course, ICL-based tabular AI workflows can also adopt a hybrid strategy that combines the strengths of different options. One useful pattern consists of coarse client‑side filtering to reduce the payload to a manageable size (e.g., selecting the top‑K nearest neighbors or applying some other simple heuristics), paired with fine‑grained service‑side pruning using model‑aware signals to refine the final context before inference. Hybrid approaches can strike a good balance between transparency, flexibility, governance, and performance.

Hands-On Demo: KNN‑Based Context Prefiltering

In the following example Python code, we will use the Solar Flare dataset and the playground version of the SAP-RPT-1 model. See this article for an introduction to the model API.

Setup

First, install the necessary third-party packages using the requirements.txt file:

pandas
numpy
requests
scikit-learn
ucimlrepo

Next, create a file called demo.py and add the following import statements:

import pandas as pd
import numpy as np
import time
import json
import requests
import sys
import os
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import pairwise_distances
from ucimlrepo import fetch_ucirepo

Add these configuration parameters:

EXPERIMENT_ORDER = ["without_prefiltering", "with_prefiltering"]

API_URL = "https://rpt.cloud.sap/api/predict"
ACCESS_TOKEN_PATH = "access_token.json"  # File containing your API token

with open(ACCESS_TOKEN_PATH, "r") as f:
    token = json.load(f)["access_token"]

n_test_rows = 20  # Number of query rows to use
mask_proportion = 0.3  # Proportion of column values to mask (simulating a prediction scenario)
max_masked_columns = 4  # Playground model limitation
random_seed = 3  # Ensure reproducibility
rng = np.random.default_rng(random_seed)  # Create a random number generator

ctx_max_rows = 600  # Max rows allowed in context window

Add this code to enable output logging:

class Tee(object):
    """A simple stdout tee: Prints to console and writes to a log file."""
    def __init__(self, logfile_path):
        self.terminal = sys.stdout
        self.log = open(logfile_path, "a", encoding="utf-8")

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

    def flush(self):
        self.terminal.flush()
        self.log.flush()

script_dir = os.path.dirname(os.path.abspath(__file__))

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

log_filename = f"log_knn_seed{random_seed}_{"".join([x[0] for x in EXPERIMENT_ORDER])}_{timestamp}.log"

log_path = os.path.join(script_dir, log_filename)

sys.stdout = Tee(log_path)

print(f"Logging enabled. Output is being written to: {log_path}\n")

Next, we will add helper functions for diagnostics, constructing the SAP-RPT-1 model payload, calling the model, and exporting the prediction results to a CSV file.

An example function for computing feature statistics of the dataset:

def compute_feature_stats(df, random_seed):
    """
    Computes cardinality and HHI concentration metric for each feature.
    Saves results to: feature_stats_knn_seed<seed>_<timestamp>.csv
    """
    stats = []

    for col in df.columns:
        if col == "id":
            continue

        cardinality = df[col].nunique()

        # Normalized value counts
        vc = df[col].value_counts(normalize=True)

        # Herfindahl-Hirschman Index
        # HHI = 1.0 implies perfectly concentrated (only one value appears)
        # HHI = 0.01 implies very uniform distribution
        # Higher HHI implies higher feature concentration
        hhi = float((vc ** 2).sum())

        # Dominant category proportion (share of most common feature value)
        max_prop = float(vc.max())

        stats.append({
            "feature": col,
            "cardinality": cardinality,
            "hhi": hhi,
            "max_proportion": max_prop
        })

    stats_df = pd.DataFrame(stats)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"feature_stats_knn_seed{random_seed}_{timestamp}.csv"

    stats_df.to_csv(filename, index=False)
    print(f"Saved feature stats to {filename}\n")

Functions for constructing the SAP-RPT-1 model payload by simulating a prediction scenario, and safely calling the model API:

def mask_row_values(row, allowed_mask_columns, p, rng):
    row = row.copy()
    mask_candidates = [c for c in allowed_mask_columns if rng.random() < p]
    for c in mask_candidates:
        row[c] = "[PREDICT]"
    return row


def build_payload(df, index_column="id"):
    return {"rows": df.to_dict(orient="records"), "index_column": index_column}


def safe_call_rpt1(payload, token):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}"
    }

    try:
        response = requests.post(API_URL, json=payload, headers=headers)

        try:
            response_json = response.json()
        except ValueError:
            print("\nNon-JSON response from RPT-1:")
            print(response.text)
            return False, {"error": "Non-JSON response"}

        if "error" in response_json:
            print("\nRPT-1 API returned an error:")
            print(json.dumps(response_json, indent=2))
            return False, response_json

        if "aiApiResponsePayload" not in response_json:
            print("\nMissing aiApiResponsePayload:")
            print(json.dumps(response_json, indent=2))
            return False, response_json

        payload = response_json["aiApiResponsePayload"]

        if "predictions" not in payload:
            print("\nMissing predictions in aiApiResponsePayload:")
            print(json.dumps(response_json, indent=2))
            return False, response_json

        return True, response_json

    except requests.exceptions.RequestException as e:
        print("\nHTTP request failed:")
        print(str(e))
        return False, {"error": str(e)}

Functions for prediction post-processing:

def flatten_predictions(pred_list):
    flat = {}
    for entry in pred_list:
        row = {}
        for key, value in entry.items():
            if key == "id":
                row["id"] = str(value)
            else:
                if isinstance(value, list) and len(value) > 0:
                    row[key] = value[0].get("prediction")
                else:
                    row[key] = None
        flat[row["id"]] = row
    return pd.DataFrame(flat.values()).set_index("id")


def evaluate_accuracy(pred_df, true_df, masked_df):
    correct = 0
    total = 0
    for idx in masked_df.index:
        for col in masked_df.columns:
            # Does not count predictions for unmasked columns
            if masked_df.loc[idx, col] == "[PREDICT]":
                total += 1
                if pred_df.loc[idx, col] == true_df.loc[idx, col]:
                    correct += 1
    return correct, total, correct / total if total > 0 else np.nan


def export_predictions_dynamic(true_rows, masked_rows, pred_df, filename):
    """
    Export a NaN-free CSV where:
      - masked columns get model predictions
      - unmasked columns keep their true values
      - pred_df is aligned to true_rows by id
    """

    # Ensure pred_df is indexed by id
    pred_df = pred_df.copy()
    pred_df.index = pred_df.index.astype(int)

    # Reindex pred_df to match true_rows
    pred_df = pred_df.reindex(true_rows.index)

    # Start with true rows
    merged = true_rows.reset_index().copy()

    # Align mask by id
    masked_by_id = masked_rows.copy()

    # Add prediction columns dynamically
    for col in pred_df.columns:
        pred_col = f"pred_{col}"

        # Start with true values
        merged[pred_col] = merged[col]

        # Overwrite only where masked
        mask = masked_by_id[col] == "[PREDICT]"
        merged.loc[mask.values, pred_col] = pred_df.loc[mask.values, col]

    # Save CSV
    merged.to_csv(
        filename,
        index=False,
        encoding="utf-8",
        quoting=1
    )

    print(f"Saved results to {filename}\n")

Next, load and prepare the Solar Flare dataset:

solar_flare_data = fetch_ucirepo(id=89)

df = pd.concat([solar_flare_data.data.features, solar_flare_data.data.targets], axis=1)

df.columns = [
    "zurich_class",
    "spot_size",
    "spot_dist",
    "activity",
    "evolution",
    "prev24_fac",
    "hist_complex",
    "region_complex",
    "area",
    "area_largest_spot",
    "c_class",
    "m_class",
    "x_class",
]

if "id" not in df.columns:
    df["id"] = df.index.astype(str)

# Convert numeric codes to words to force categorical behavior
replacement_map = {"0": "zero", "1": "one", "2": "two", "3": "three"}
for col in df.columns:
    if col != "id":
        df[col] = df[col].astype(str)
        df[col] = df[col].replace(replacement_map)

Save feature statistics:

compute_feature_stats(df, random_seed)

Now add code to simulate the prediction scenario. First, split the Solar Flare dataset into context and query/test rows:

df_test_rows = df.sample(n=n_test_rows, random_state=random_seed).reset_index(drop=True)

df_context_full = df.drop(df_test_rows.index).reset_index(drop=True)

Then randomly mask some columns in the query/test rows:

all_columns = [c for c in df.columns if c != "id"]

allowed_mask_columns = rng.choice(all_columns, size=max_masked_columns, replace=False)

df_test_rows_masked = df_test_rows.apply(
    lambda row: mask_row_values(row, allowed_mask_columns, mask_proportion, rng),
    axis=1
)

df_test_rows_masked["id"] = df_test_rows["id"]

Prefiltering Logic

Add the following code to derive an optimized set of context rows (df_context_prefiltered) on the fly using KNN-based prefiltering:

start_prefilter = time.time()

n_test = df_test_rows.shape[0]
budget_per_row = max(1, (ctx_max_rows - n_test) // n_test)

print(f"Context max rows: {ctx_max_rows}")
print(f"Number of test rows: {n_test}")
print(f"KNN budget per test row: {budget_per_row}\n")

# Encode using LabelEncoder (can use more sophisticated vectorizers and embedding models in practice)
encoders = {}
df_context_enc = df_context_full.copy()
df_test_enc = df_test_rows.copy()

for col in df_context_full.columns:
    if col == "id":
        continue
    le = LabelEncoder()
    df_context_enc[col] = le.fit_transform(df_context_full[col].astype(str))
    df_test_enc[col] = le.transform(df_test_rows[col].astype(str))
    encoders[col] = le

X_context = df_context_enc.drop(columns=["id"]).to_numpy()
X_test = df_test_enc.drop(columns=["id"]).to_numpy()

selected_indices = []
for x_test in X_test:
    dists = pairwise_distances([x_test], X_context)[0]
    nearest = np.argsort(dists)[:budget_per_row]
    selected_indices.extend(nearest)

df_context_prefiltered = (
    df_context_full.iloc[selected_indices]
    .drop_duplicates()
    .reset_index(drop=True)
)

end_prefilter = time.time()
prefilter_time = end_prefilter - start_prefilter

print(f"Prefiltering time: {prefilter_time:.3f} seconds")
print(
    f"Prefiltered rows: {len(df_context_prefiltered)} "
    f"({100 * len(df_context_prefiltered) / len(df_context_full):.2f}% of full context)\n"
)

Running Experiments

Add the following functions to call the model with and without context optimization (i.e., KNN-based prefiltering).

def run_without_prefiltering():
    print("=== CASE 1: NO PREFILTERING ===")
    
    start = time.time()

    df_context_without_prefiltering = pd.concat(
        [df_context_full, df_test_rows_masked], ignore_index=True
    )

    payload = build_payload(df_context_without_prefiltering)

    success, response = safe_call_rpt1(payload, token)

    end = time.time()

    inference_time = end - start
    print(f"Case 1 inference time: {inference_time:.3f} seconds")

    acc = np.nan
    if success:
        pred_df = flatten_predictions(response["aiApiResponsePayload"]["predictions"])
        pred_df = pred_df.astype(str)

        true_rows = df_test_rows.set_index("id")
        masked_rows = df_test_rows_masked.set_index("id")

        correct, total, acc = evaluate_accuracy(pred_df, true_rows, masked_rows)
        print(f"Case 1 accuracy: {correct}/{total} = {acc:.3f}\n")

        # Use helper for NaN-free export
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"results_knn_seed{random_seed}_c_{timestamp}.csv"
        export_predictions_dynamic(true_rows, masked_rows, pred_df, filename)

    else:
        print("Skipping accuracy evaluation.\n")

    return inference_time, acc


def run_with_prefiltering():
    print("=== CASE 2: KNN-BASED PREFILTERING ===")
    
    start = time.time()
    
    df_context_with_prefiltering = pd.concat(
        [df_context_prefiltered, df_test_rows_masked], ignore_index=True
    )

    payload = build_payload(df_context_with_prefiltering)

    success, response = safe_call_rpt1(payload, token)

    end = time.time()

    inference_time = end - start
    print(f"Case 2 inference time (RPT-1 call): {inference_time:.3f} seconds")

    acc = np.nan
    if success:
        pred_df = flatten_predictions(response["aiApiResponsePayload"]["predictions"])
        pred_df = pred_df.astype(str)

        true_rows = df_test_rows.set_index("id")
        masked_rows = df_test_rows_masked.set_index("id")

        correct, total, acc = evaluate_accuracy(pred_df, true_rows, masked_rows)
        print(f"Case 2 accuracy: {correct}/{total} = {acc:.3f}\n")

        # Use helper for NaN-free export
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"results_knn_seed{random_seed}_t_{timestamp}.csv"
        export_predictions_dynamic(true_rows, masked_rows, pred_df, filename)

    else:
        print("Skipping accuracy evaluation.\n")

    return inference_time, acc

Finally, run the experiments and print/log the results:

def run_experiments(order):
    results = {}
    for exp in order:
        if exp == "without_prefiltering":
            results["without_prefiltering"] = run_without_prefiltering()
        elif exp == "with_prefiltering":
            results["with_prefiltering"] = run_with_prefiltering()
        else:
            print(f"Unknown experiment type: {exp}")
    return results

print("=== RUNNING EXPERIMENTS ===\n")
results = run_experiments(EXPERIMENT_ORDER)

print("\n=== FINAL RESULTS ===")
print(results)

Note that the first call to the model API may take noticeably longer because the service needs to warm up. This can involve loading the model into memory, initializing runtime kernels, and establishing network connections. Subsequent calls reuse the initialized state and thus tend to run faster. Changing the order of experiments will shift which one absorbs the initial warm‑up cost. To see this in action, try changing the order of experiments in the EXPERIMENT_ORDER configuration parameter (e.g., running the experiment with prefiltering before the one without prefiltering).

The Wrap

As ICL‑based tabular foundation models become more widely adopted, the locus of optimization will shift from traditional supervised model training to context payload construction. The quality, cost, and latency characteristics of an ICL‑based system depend less on how the foundation model was trained and far more on how effectively the context payload is leveraged at inference time. This shift will likely push organizations toward repeatable, reusable patterns for managing context payloads. Just as the industry eventually standardized around feature stores, data pipelines, and prompt‑engineering conventions, we can expect a similar consolidation of best practices for context payload design. Over time, these patterns could become part of the shared vocabulary for development teams working with ICL-based tabular foundation models, elevating context optimization to a first‑class architectural concern.