AI · Cloud Infrastructure · March 13, 2026
§ AI · AWS × Cerebras · Disaggregated Inference

AWS Just Put the Largest Chip Ever Made Inside Bedrock. Cerebras Will Run the Decode. Trainium Will Run the Prefill.

On March 13, 2026, AWS and Cerebras Systems announced a cross-vendor disaggregated inference architecture inside Amazon Bedrock — the first hyperscale deployment to put two competing custom AI chips on the same model and route the workload to whichever one is faster at each step. AWS Trainium handles prefill (compute-bound prompt processing). Cerebras CS-3— the system built around the wafer-scale WSE chip, the largest single piece of silicon ever manufactured — handles decode (memory-bandwidth-bound output generation). The two are stitched together inside AWS data centers via Elastic Fabric Adapter networking on the Nitro System. AWS’s claim: an “order of magnitude faster” than what Bedrock customers can get today. Service rolls out in the second half of 2026.

3,000
Output tokens per second
Cerebras inference ceiling on supported open-source LLMs
25×
Faster than leading GPU at the decode stage
Cerebras CS-3 vs Nvidia · per Cerebras benchmarks
More high-speed token capacity in the same hardware footprint
AWS disaggregated architecture vs all-GPU
H2 2026
Service launch window
Amazon Bedrock + AWS Marketplace · open LLMs + Amazon Nova
Editorial illustration: a 12-inch silicon wafer (the Cerebras Wafer-Scale Engine) glowing edge-on against a dark data-center backdrop, cyan light streams flowing out into rows of AWS server racks.
Editorial illustration · The Cerebras WSE arrives in AWS data centers — Civic Intelligence
§ 01 / The Architecture — Disaggregated Inference

Two chips. Two different bottlenecks. One token stream.

Inference on a transformer model is two different workloads pretending to be one. First the model has to prefill — read the entire prompt, compute the key/value cache, and warm up the attention state. That phase is highly parallel and compute-bound: lots of matrix multiplications, modest memory traffic per FLOP. Then the model has to decode — emit output tokens one at a time, each one re-reading the entire key/value cache. That phase is serial and memory-bandwidth-bound: modest compute per token, enormous memory traffic. On a single GPU you pay for the worse of the two.

AWS × Cerebras disaggregated inference — the workload split
Prefill
AWS Trainium
Reads the prompt. Builds the KV cache. Highly parallel matrix-multiply work — Trainium's dense compute cores are optimized for exactly this. AWS's purpose-built training chip earns its keep on the fastest part of inference.
Handoff
Elastic Fabric Adapter (EFA)
AWS's high-speed RDMA fabric on the Nitro System. The KV cache is shipped from Trainium to the Cerebras CS-3 over EFA at low latency — fast enough that the cross-chip handoff doesn't dominate the user-perceived response time.
Decode
Cerebras CS-3 / WSE
Emits tokens. Memory-bandwidth-bound — and the WSE is the largest chip ever made, with all model weights stored on-chip in SRAM. Cerebras claims thousands of times more memory bandwidth than the fastest GPU, and tokens-per-second numbers (969 on Llama 3.1 405B; 2,600 on Llama 4 Scout vs 137 on leading GPU) that the GPU economy can't currently match.
Why It's Novel
AWS and Cerebras are not the first to disaggregate prefill from decode — research and individual frontier labs have been doing it inside their own datacenters for two years. They are the first to do it across vendors at hyperscale, inside a public-cloud product surface. A Bedrock customer hitting an inference endpoint in H2 2026 will not know — and arguably shouldn’t need to know — that prefill runs on Amazon’s silicon and decode runs on Cerebras’s. They will know the latency dropped by an order of magnitude and the cost per output token came down with it.
§ 02 / What the Executives Actually Said

“Each system does what it’s best at.”

Each system does what it's best at. The result will be inference that's an order of magnitude faster and higher performance than today.

David Brown · VP, Compute & Machine Learning Services, AWS · March 13, 2026

Every enterprise around the world will be able to benefit from blisteringly fast inference within their AWS environment.

Andrew Feldman · Founder & CEO, Cerebras Systems · March 13, 2026

The unsaid context behind both quotes: AWS already has the world’s largest enterprise inference book of business via Bedrock, but its in-house silicon (Inferentia, Trainium) has been outclassed at the decode stage by both Nvidia and Cerebras’s wafer-scale engine. Anthropic — Amazon’s primary Trainium training partner — and OpenAI — which has committed to two gigawatts of Trainium capacity — give AWS the prefill workload at hyperscale already. Adding Cerebras for decode is how AWS closes the speed gap to the GPU economy without acknowledging it had one.

§ 03 / The Chip — Why a Whole Wafer

One chip, twelve inches across. All the model weights, on-chip.

The Cerebras Wafer-Scale Engine is the largest single chip ever manufactured. Where TSMC normally cuts ~80 separate die out of a 12-inch wafer, Cerebras keeps the wafer whole and treats it as one chip. The result is enough on-chip SRAM to store the weights of a frontier model without leaving the die. That eliminates the off-chip-memory round trip that gates GPU decode speed. Cerebras’s public benchmarks — Llama 3.1 405B at 969 output tokens/second, Llama 4 Scout at 2,600 tokens/second versus 137 on the leading GPU — are direct consequences of that architectural choice.

What changes inside Bedrock is access. Cerebras already powers fast inference for OpenAI, Cognition, Mistral, and Meta on workloads where token-throughput is the bottleneck — particularly agentic coding, where a single user query can trigger 15× the token volume of a normal chat exchange because the model is autonomously writing, reviewing, and revising code in a loop. Until March 2026 you had to call Cerebras directly to get those speeds. After H2 2026 you call Bedrock.

§ 04 / What It Means for the Cloud Inference Market
The Strategic Frame
Inference is now where the AI margin lives. Training spend is front-loaded and gets amortized over a model’s lifetime; inference spend recurs per query, scales with usage, and dwarfs training cost over any frontier model’s deployment. The major hyperscalers — AWS, Microsoft Azure, Google Cloud — are now competing primarily on tokens per second per dollar, not on training throughput. AWS’s move with Cerebras is the cleanest public statement yet that the company has decided it cannot close the decode-speed gap on its own silicon and will not wait. Azure has a deep Nvidia partnership and exclusive first-look on certain GPU generations. Google has TPUs and Gemini. AWS now has Trainium plus Cerebras’s wafer-scale weapon, with the AWS sales motion in front of both. That is a meaningful shift in the cloud-inference balance of power, and it landed in March with surprisingly little market drama.
§ 05 / What We Know · What We Don't
Confirmed
  • Date of announcement: March 13, 2026. Joint press release from AWS and Cerebras Systems.
  • Service surface: Amazon Bedrock + AWS Marketplace. Models: open-source LLMs + Amazon Nova.
  • Architecture: disaggregated inference. AWS Trainium (prefill) + Cerebras CS-3/WSE (decode), connected via Elastic Fabric Adapter on the AWS Nitro System.
  • Performance claim: 'order of magnitude faster' than current Bedrock inference (David Brown, AWS); up to 3,000 output tokens/sec; 25× faster than leading GPU at decode (Cerebras benchmarks).
  • Capacity claim: 5× more high-speed token capacity in the same hardware footprint vs all-GPU.
  • Existing Cerebras customers: OpenAI, Cognition, Mistral, Meta — primarily for agentic coding workloads where token throughput is the bottleneck.
  • Existing Trainium customers: Anthropic (primary AWS training partner); OpenAI (committed to 2 gigawatts of Trainium capacity).
  • Launch window: H2 2026 (per AWS), 'in the next couple of months' (per March 13 announcement).
Still Unknown
  • ?Pricing per output token. Neither AWS nor Cerebras has disclosed how the disaggregated workload will be priced relative to all-Trainium or third-party GPU inference inside Bedrock.
  • ?Specific Bedrock regions where the service will launch. AWS has said only 'global data center footprint' over time.
  • ?Which open-source LLMs will be available at GA. Llama 3.1 405B and Llama 4 Scout are confirmed Cerebras-supported but not confirmed Bedrock-day-one.
  • ?How the latency-arbitrage interacts with Bedrock's existing per-model SLAs and quotas.
  • ?Whether Bedrock customers will be able to opt INTO the Cerebras path explicitly, or whether the disaggregation is invisible (AWS routes the workload at request time).
  • ?Whether other custom-silicon inference vendors (Groq, SambaNova, Tenstorrent) get equivalent Bedrock integration, or whether Cerebras has an exclusivity window.
§ 06 / The Bottom Line
Why This Matters
Six weeks after the announcement, the market is still under-pricing how much this changes. AWS just used Cerebras to turn its biggest weakness — slower decode than the GPU economy — into a marketing claim that decode is now an order of magnitude faster than the GPU economy. Cerebras, in turn, just got the AWS sales force in front of every Bedrock customer without giving up its own platform. The cloud-inference market in 2027 is going to look meaningfully different from the Nvidia-CUDA-everywhere world it’s in today, and this announcement is the date historians will mark as the start.
§ 07 / Sources
Last updated: May 4, 2026 · 11:00 PM ET