Why did OpenAI deploy Codex on Cerebras chips?

Question

Hans Steiner · Accepted Answer

The technical shift

OpenAI moved a production coding model onto hardware from Cerebras, a vendor known for very large, wafer‑scale accelerators. The company launched a stripped-down version of its Codex-style coding model that runs on Cerebras’ CS3-style accelerators to cut inference latency and speed up interactive code generation. Early reports put the new deployment’s throughput and responsiveness far ahead of previous setups, with claims of order-of-magnitude gains in coding throughput.

Why they did it

There are three core drivers behind the change:

Performance: Developers expect near-instant responses when using AI tools to write or iterate on code. Faster hardware reduces wait times and improves interactivity.
Cost and efficiency: Alternative accelerators can shift the economics of inference, especially for workloads where latency matters more than raw model scale.
Supplier diversification: Relying exclusively on one major GPU vendor concentrates risk; adopting non‑Nvidia silicon spreads procurement and technical risk.

Consequences for the AI stack

The move signals a maturing inference market where large model makers match models to specialized silicon to optimize user experience and cost. Short-term impacts include tighter competition among accelerator vendors and pressure on incumbents to respond with improved latency and pricing. For developers, the change promises faster coding assistance and a more fluid edit–run cycle. Longer term, expect more experimentation with heterogeneous hardware—different chips for different workloads—alongside trade-offs: some designs may favor raw throughput while others trade off model size or certain fidelity constraints to hit latency targets.

OpenAI’s deployment is notable not just for the speed claims but because it shows major model providers are now comfortable shipping production services on non‑Nvidia hardware at scale.