Architecting GPUaaS for Enterprise AI On-Prem

Editor
24 Min Read


Contents
AI is evolving rapidly, and software engineers no longer need to memorize syntax. However, thinking like an architect and understanding the technology that enables systems to run securely at scale is becoming increasingly valuable. I also want to reflect on being in my role a year now as an AI Solutions Engineer at Cisco. I work with customers daily across different verticals — healthcare, financial services, manufacturing, law firms, and they are all trying to answer largely the same set of questions: What’s our AI strategy? What use cases actually fit our data? Cloud vs. on-prem vs. hybrid? How much will it cost — not just today, but at scale? How do we secure it? These are the real practical constraints that show up immediately once you try to operationalize AI beyond a POC. Recently, we added a Cisco UCS C845A to one of our labs. It has 2x NVIDIA RTX PRO 6000 Blackwell GPUs, 3.1TB NVMe, ~127 allocatable CPU cores, and 754GB RAM. I decided to build a shared internal platform on top of it — giving teams a consistent, self-service environment to run experiments, validate ideas, and build hands-on GPU experience. I deployed the platform as a Single Node OpenShift (SNO) cluster and layered a multi-tenant GPUaaS experience on top. Users reserve capacity through a calendar UI, and the system provisions an isolated ML environment prebuilt with PyTorch/CUDA, JupyterLab, VS Code, and more. Within that environment, users can run on-demand inference, iterate on model training and fine-tuning, and prototype production grade microservices. This post walks through the architecture — how scheduling decisions are made, how tenants are isolated, and how the platform manages itself. The decisions that went into this lab platform are the same ones any organization faces when they’re serious about AI in production. This is the foundation for enterprise AI at scale. Multi-agent architectures, self-service experimentation, secure multi-tenancy, cost-predictable GPU compute, it all starts with getting the platform layer right. High level platform architecture diagram. Image created by author. Initial SetupBootstrapping the NodeConfiguring GPUs with the NVIDIA GPU OperatorFoundational StorageSystem ArchitectureThe Scheduling PlaneThe Control PlaneThe Runtime PlaneGPU SchedulingMIG & Time-slicingTwo Allocation ModesGPU “Tokenomics”What Real Deployments Look LikeQuantifying ItConclusion

Initial Setup

Before there’s a platform, there’s a bare metal server and a blank screen.

Bootstrapping the Node

The node ships with no operating system. When you power it on you’re dropped into a UEFI shell. For OpenShift, installation typically starts in the Red Hat Hybrid Cloud Console via the Assisted Installer. The Assisted Installer handles cluster configuration through a guided setup flow, and once complete, generates a discovery ISO — a bootable RHEL CoreOS image preconfigured for your environment. Map the ISO to the server as virtual media through the Cisco IMC, set boot order, and power on. The node will phone home to the console, and you can kick off the installation process. The node writes RHCOS to NVMe and bootstraps. Within a few hours you have a running cluster.

This workflow assumes internet connectivity, pulling images from Red Hat’s registries during install. That’s not always an option. Many of the customers I work with operate in air-gapped environments where nothing touches the public internet. The process there is different: generate ignition configs locally, download the OpenShift release images and operator bundles ahead of time, mirror everything into a local Quay registry, and point the install at that. Both paths get you to the same place. The assisted install is much easier. The air-gapped path is what production looks like in regulated industries.

Configuring GPUs with the NVIDIA GPU Operator

Once the GPU Operator is installed (happens automatically using the assisted installer), I configured how the two RTX PRO 6000 Blackwell GPUs are presented to workloads through two ConfigMaps in the nvidia-gpu-operator namespace.

The first — custom-mig-config — defines physical partitioning. In this case it is a mixed strategy, meaning GPU 0 is partitioned into four 1g.24gb MIG slices (~24GB dedicated memory each), GPU 1 stays whole for workloads that need the full ~96GB. MIG partitioning is real hardware isolation. You get dedicated memory, compute units, and L2 cache per slice. Workloads will see MIG instances as separate physical devices.

The second — device-plugin-config — configures time-slicing, which allows multiple pods to share the same GPU or MIG slice through rapid context switching. I set 4 replicas per whole GPU and 2 per MIG slice. This is what enables running multiple inference containers side by side within a single session.

Foundational Storage

The 3.1TB NVMe is managed by the LVM Storage Operator (lvms-vg1 StorageClass). I created two PVCs as a part of the initial provisioning process — a volume backing PostgreSQL and persistent storage for OpenShift’s internal image registry.

With the OS installed, network prerequisites met (DNS, IP allocation, all required A records) which is not covered in this article, GPUs partitioned, and storage provisioned, the cluster is ready for the application layer.

System Architecture

This leads us into the main topic: the system architecture. The platform separates into three planes — scheduling, control, and runtime, with the PostgreSQL database as the single source of truth.

In the platform management namespace, there are four always on deployments:

  • Portal app: a single container running the React UI and FastAPI backend
  • Reconciler (controller): the control loop that continuously converges cluster state to match the database
  • PostgreSQL: persistent state for users, reservations, tokens, and audit history
  • Cache daemon: a node-local service that pre-stages large model artifacts / inference engines so users can start quickly (pulling a 20GB vLLM image over corporate proxy can take hours)

A quick note on the development lifecycle, because it’s easy to complicate shipping Kubernetes systems. I write and test code locally, but the images are built in the cluster using OpenShift build artifacts (BuildConfigs) and pushed to the internal registry. The deployments themselves just point at those images.

The first time a component is introduced, I apply the manifests to create the Deployment/Service/RBAC. After that, most changes are just building a new image in-cluster, then trigger a restart so the Deployment pulls the updated image and rolls forward:

oc rollout restart deployment/<deployment-name> -n <namespace>

That’s the loop: commit → in-cluster build → internal registry → restart/rollout.

The Scheduling Plane

This is the user facing entry point. Users see the resource pool — GPUs, CPU, memory, they pick a time window, choose their GPU allocation mode (more on this later), and submit a reservation.

GPUs are expensive hardware with a real cost per hour whether they’re in use or not. The reservation system treats calendar time and physical capacity as a combined constraint. The same way you’d book a conference room, except this room has 96GB of VRAM and costs considerably more per hour.

Under the hood, the system queries overlapping reservations against pool capacity using advisory locks to prevent double booking. Essentially it is just adding up reserved capacity and subtracting it from total capacity. Each reservation tracks through a lifecycle: APPROVED → ACTIVE → COMPLETED, with CANCELED and FAILED as terminal states.

The FastAPI server itself is intentionally thin. It validates input, persists the reservation, and returns. It never talks to the Kubernetes API.

The Control Plane

At the heart of the platform is the controller. It is Python based and runs in a continuous loop on a 30-second cadence. You can think of it like a cron job in terms of timing, but architecturally it’s a Kubernetes-style controller responsible for driving the system toward a desired state.

The database holds the desired state (reservations with time windows and resource requirements). The reconciler reads that state, compares it against what actually exists in the Kubernetes cluster, and converges the two. There are no concurrent API calls racing to mutate cluster state; just one deterministic loop making the minimum set of changes needed to reach the desired state. If the reconciler crashes, it restarts and continues exactly where it left off, because the source of truth (desired state) remains intact in the database.

Each reconciliation cycle evaluates four concerns in order:

  1. Stop expired or canceled sessions and delete the namespace (which cascades cleanup of all resources inside it).
  2. Repair failed sessions and remove orphaned resources left behind by partially completed provisioning.
  3. Start eligible sessions when their reservation window arrives — provision, configure, and hand the workspace to the user.
  4. Maintain the database by expiring old tokens and enforcing audit log retention.

Starting a session is a multi-step provisioning sequence, and every step is idempotent, meaning it is designed to be safely re-run if interrupted midway:

Controller in depth. Image created by author.

The reconciler is the only component that talks to the Kubernetes API.

Garbage collection is also baked into the same loop. At a slower cadence (~5 minutes), the reconciler sweeps for cross namespace orphans such as stale RBAC bindings, leftover OpenShift security context entries, namespaces stuck in terminating, or namespaces that exist in the cluster but have no matching database record.

The design assumption throughout is that failure is normal. For example, we had a power supply failure on the node that took the cluster down mid-session and when it came back, the reconciler resumed its loop, detected the state discrepancies, and self-healed without manual intervention.

The Runtime Plane

When a reservation window starts, the user opens a browser and lands in a full VS Code workspace (code-server) pre-loaded with the entire AI/ML stack, and kubectl access within their session namespace.

Workspace screenshot. Image taken by author.

Popular inference engines such as vLLM, Ollama, TGI, and Triton are already cached on the node, so deploying a model server is a one-liner that starts in seconds. There’s 600GB of persistent NVMe backed storage allocated to the session, including a 20GB home directory for notebooks and scripts, and a 300GB model cache.

Each session is a fully isolated Kubernetes namespace, its own blast radius boundary with dedicated resources and zero visibility into any other tenant’s environment. The reconciler provisions namespace scoped RBAC granting full admin powers within that boundary, enabling users to create and delete pods, deployments, services, routes, secrets — whatever the workload requires. But there’s no cluster level access. Users can read their own ResourceQuota to see their remaining budget, but they can’t modify it.

ResourceQuota enforces a hard ceiling on everything. A runaway training job can’t OOM the node. A rogue container can’t fill the NVMe. LimitRange injects sane defaults into every container automatically, so users can kubectl run without specifying resource requests. There is a proxy ConfigMap injected into the namespace so user deployed containers get corporate network egress without manual configuration.

Users deploy what they want — inference servers, databases, custom services, and the platform handles the guardrails.

When the reservation window ends, the reconciler deletes the namespace and everything inside it.

GPU Scheduling

Node multi-tenancy diagram. Image created by author.

Now the fun part — GPU scheduling and actually running hardware-accelerated workloads in a multi-tenant environment.

MIG & Time-slicing

We covered the MIG configuration in the initial setup, but it’s worth revisiting from a scheduling perspective. GPU 0 is partitioned into four 1g.24gb MIG slices — each with ~24GB of dedicated memory, enough for most 7B–14B parameter models. GPU 1 stays whole for workloads that need the full ~96GB VRAM for model training, full-precision inference on 70B+ models, or anything that simply doesn’t fit in a slice.

The reservation system tracks these as distinct resource types. Users book either nvidia.com/gpu (whole) or nvidia.com/mig-1g.24gb (up to four slices). The ResourceQuota for each session hard denies the opposite type. If you reserved a MIG slice, you physically cannot request a whole GPU, even if one is sitting idle. In a mixed MIG environment, letting a session accidentally consume the wrong resource type would break the capacity math for every other reservation on the calendar.

In our configuration, 1 whole GPU appears as 4 schedulable resources. Each MIG slice appears as 2.

What that means is a user reserves one physical GPU and can run up to four concurrent GPU-accelerated containers within their session — a vLLM instance serving gpt-oss, an Ollama instance with Mistral, a TGI server running a reranker, and a custom service orchestrating across all three.

Two Allocation Modes

At reservation time, users choose how their GPU budget is initially distributed between the workspace and user deployed containers.

Interactive ML — The workspace pod gets a GPU (or MIG slice) attached directly. The user opens Jupyter, imports PyTorch, and has immediate CUDA access for training, fine-tuning, or debugging. Additional GPU pods can still be spawned via time-slicing, but the workspace is consuming one of the virtual slots.

Inference Containers — The workspace is lightweight with no GPU attached. All time-sliced capacity is available for user deployed containers. With a whole GPU reservation, that’s four full slots for inference workloads.

There is a real throughput tradeoff with time-slicing, workloads share VRAM and compute bandwidth. For development, testing, and validating multi-service architectures, which is exactly what this platform is for, it’s the right trade-off. For production latency sensitive inference where every millisecond of p99 matters, you’d use dedicated slices 1:1 or whole GPUs.

GPU “Tokenomics”

One of the first questions in the introduction was: How much will it cost — not just today, but at scale? To answer that, you have to start with what the workload actually looks like in production.

What Real Deployments Look Like

When I work with customers on their inference architecture, nobody is running a single model behind a single endpoint. The pattern that keeps emerging is a fleet of models sized to the task. You have a 7B parameter model handling simple classification and extraction, runs comfortably on a MIG slice. A 14B model doing summarization and general purpose chat. A 70B model for complex reasoning and multi-step tasks, and maybe a 400B model for the hardest problems where quality is non-negotiable. Requests get routed to the appropriate model based on complexity, latency requirements, or cost constraints. You’re not paying 70B-class compute for a task a 7B can handle.

In multi-agent systems, this gets more interesting. Agents subscribe to a message bus and sit idle until called upon — a pub-sub pattern where context is shared to the agent at invocation time and the pod is already warm. There’s no cold start penalty because the model is loaded and the container is running. An orchestrator agent evaluates the inbound request, routes it to a specialist agent (retrieval, code generation, summarization, validation), collects the results, and synthesizes a response. Four or five models collaborating on a single user request, each running in its own container within the same namespace, communicating over the internal Kubernetes network.

Network policies add another dimension. Not every agent should have access to every tool. Your retrieval agent can talk to the vector database. Your code execution agent can reach a sandboxed runtime. But the summarization agent has no business touching either, it receives context from the orchestrator and returns text. Network policies enforce these boundaries at the cluster level, so tool access is controlled by infrastructure, not application logic.

This is the workload profile the platform was designed for. MIG slicing lets you right size GPU allocation per model, a 7B doesn’t need 96GB of VRAM. Time-slicing lets multiple agents share the same physical device. Namespace isolation keeps tenants separated while agents within a session communicate freely. The architecture directly supports these patterns.

Quantifying It

To move from architecture to business case, I developed a tokenomics framework that reduces infrastructure cost to a single comparable unit: cost per million tokens. Each token carries its amortized share of hardware capital (including workload mix and redundancy), maintenance, power, and cooling. The numerator is your total annual cost. The denominator is how many tokens you actually process, which is entirely a function of utilization.

Utilization is the most powerful lever on per-token cost. It doesn’t reduce what you spend, the hardware and power bills are fixed. What it does is spread those fixed costs across more processed tokens. A platform running at 80% utilization produces tokens at nearly half the unit cost of one at 40%. Same infrastructure, dramatically different economics. This is why the reservation system, MIG partitioning, and time-slicing matter beyond UX — they exist to keep expensive GPUs processing tokens during as many available hours as possible.

Because the framework is algebraic, you can also solve in the other direction. Given a known token demand and a budget, solve for the infrastructure required and immediately see whether you’re over-provisioned (burning money on idle GPUs), under-provisioned (queuing requests and degrading latency), or right-sized.

For the cloud comparison, providers have already baked their utilization, redundancy, and overhead into per-token API pricing. The question becomes: at what utilization does your on-prem unit cost drop below that rate? For consistent enterprise GPU demand, the kind of steady-state inference traffic these multi-agent architectures generate, on-prem wins.

However, for testing, demos, and POCs, cloud is cheaper.

Engineering teams often need to justify spend to finance with clear, defensible numbers. The tokenomics framework bridges that gap.

Conclusion

At the beginning of this post I listed the questions I hear from customers constantly — AI strategy, use-cases, cloud vs. on-prem, cost, security. They all eventually require the same thing: a platform layer that can schedule GPU resources, isolate tenants, and give teams a self-service path from experiment to production without waiting on infrastructure.

That’s what this post walked through. Not a product and not a managed service, but an architecture built on Kubernetes, PostgreSQL, Python, and the NVIDIA GPU Operator — running on a single Cisco UCS C845A with two NVIDIA RTX PRO 6000 Blackwell GPUs in our lab. It’s a practical starting point that addresses scheduling, multi-tenancy, cost modeling, and the day-2 operational realities of keeping GPU infrastructure reliable.

Scale this to multiple Cisco AI Pods and the scheduling plane, reconciler pattern, and isolation model carry over directly. The foundation is the same.

If you’re working through these same decisions — how to schedule GPUs, how to isolate tenants, how to build the business case for on-prem AI infrastructure, I’d welcome the conversation.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.