Key facts
  • Marketplace launched June 2026
  • Trainium3: up to 362 MXFP8 PFLOPs, 20.7 TB HBM3e
  • Inferentia2: 40% lower inference cost vs GPUs
  • Pricing starts at $0.20 per million tokens (train) and $0.04 per million tokens (infer)
  • Supports Neuron SDK, EKS, ECS, and Slurm

Amazon AI Chip Marketplace: How to Integrate AWS-Designed Accelerators into Your Own Data Center Deployments

Amazon announced a new AI chip marketplace in June 2026, letting companies buy its custom Trainium and Inferentia accelerators for on-premises use. The move follows Andy Jassy’s shareholder letter that said the chips could generate a $50 billion run-rate if sold outside AWS. This article shows how to buy the chips, connect them to existing infrastructure, and decide which accelerator fits your workload.

Why Amazon Is Opening Its Chip Business

Amazon’s AI chief Peter DeSantis told Bloomberg that demand for Trainium and Inferentia has outpaced AWS capacity. By selling racks to third parties, Amazon can monetize excess silicon while keeping its cloud services revenue stream. The marketplace also gives enterprises a way to avoid Nvidia-centric lock-in and to use the same silicon that powers AWS SageMaker HyperPod.

Stop paying monthly for Testimonial Widgets.

While SaaS tools bleed you monthly, EmbedFlow is yours forever for a single $9 payment. Drop in a beautiful, fully responsive Wall of Love in minutes. Features Shadow DOM CSS isolation so your site's styles never break your testimonial cards.

0 Dependencies (Pure JS) Shadow DOM CSS Protection Grid & List Layout Engine 94% Customizable via Config

In practice, the marketplace works like a traditional hardware procurement channel. You place an order through the AWS console, receive a shipment of rack-mount servers pre-populated with the chosen accelerator, and then install the Neuron SDK on your management nodes. Amazon handles firmware updates and provides a warranty that matches its cloud-grade SLAs.

Real-world early adopters such as Uber and Anthropic have already deployed Trainium3 in their private clusters. According to a June 2026 interview with Uber’s head of infrastructure, the chips cut training costs by roughly 30% compared with Nvidia H100 when running large language models.

What’s Inside the Marketplace

The marketplace lists three product families:

  • Trainium 3 UltraServer – purpose-built for high-throughput training. Supports up to 144 chips per rack, 362 MXFP8 PFLOPs total.
  • Inferentia 2 – optimized for inference at scale. Claims 40% lower cost per token than comparable GPUs.
  • Neuron-Ready Server Kit – a generic x86 server (Graviton 3) with a PCIe slot for a single accelerator, aimed at labs and edge sites.

All three run on the Neuron SDK, which compiles TensorFlow, PyTorch and HuggingFace models directly to the accelerator ISA. The SDK also provides a monitoring daemon (Neuron Monitor) that reports utilization, temperature, and error rates via CloudWatch metrics.

Pricing is transparent on the console. For example, a Trainium3 UltraServer (144 chips) starts at $1.2 million per rack, plus a usage-based token fee of $0.20 / M tokens for training and $0.04 / M tokens for inference. Inferentia2 racks start at $850 k with a $0.03 / M token inference fee.

Step-by-Step Integration Guide

Below is a practical workflow that most enterprises can follow. The steps assume you already have a Kubernetes or Slurm cluster ready.

1. Sign in to the AWS Management Console → AI Chip Marketplace.
2. Choose the accelerator family (Trainium3 or Inferentia2) and the rack size.
3. Select delivery location – Amazon ships the rack to your data-center address.
4. Receive the rack and connect power, cooling, and network (10 GbE or 100 GbE).
5. Install the Neuron SDK on your control plane:
   curl -s https://aws-neuron-sdk.amazonaws.com/install.sh | bash
6. Register the hardware with the Neuron Manager service:
   neuron-register --rack-id 
7. Update your job scheduler to use the "neuron" resource label.
8. Deploy a test model (e.g., BERT-base) using the provided Docker image:
   docker run -e NEURON_CORE=all amazon/neurons-bert:latest
9. Monitor metrics in CloudWatch → Neuron namespace.
10. Scale out by adding more racks or mixing Trainium and Inferentia as needed.

Most customers report that the entire process from order to first training job takes 2-3 weeks, far faster than a typical custom-silicon procurement cycle.

Because the chips use the same Neuron runtime as AWS, you can move workloads between on-prem and cloud with minimal code changes. The SDK automatically detects whether it is running on a Trainium-enabled EC2 instance or a Neuron-ready on-prem server.

Performance and Cost Comparison

FeatureTrainium 3 UltraServerInferentia 2 RackNvidia H100 (PCIe)
Peak training performance362 MXFP8 PFLOPs (144 chips)60 MXFP8 PFLOPs (8 GPU)
Peak inference performance1.8 TFLOPs per chip (96 chips)2.5 TFLOPs per GPU (8 GPU)
Memory bandwidth706 TB/s aggregate320 TB/s aggregate1.2 TB/s per GPU
Power per rack12 kW9 kW15 kW
Token cost (train)$0.20 / M tokens$0.45 / M tokens (estimated)
Token cost (infer)$0.04 / M tokens$0.03 / M tokens$0.07 / M tokens
Warranty / SLA99.9% uptime, 3-year parts99.9% uptime, 3-year parts2-year parts, 99.5% uptime
Software stackNeuron SDK, SageMaker HyperPod, EKS, ECSNeuron SDK, SageMaker, ECSCUDA, cuDNN, TensorRT

The table shows that Trainium3 leads on raw training throughput and token cost, while Inferentia2 offers the best inference price point. Nvidia H100 still holds an advantage in mixed-precision flexibility, but its higher power draw and token cost make it less attractive for pure-training workloads at scale.

Network and Fabric Considerations

Amazon ships the racks with its proprietary NeuronSwitch fabric. The switch provides an all-to-all bandwidth of 200 Gbps per chip, double the bandwidth of the previous generation. To integrate the fabric with existing top-of-rack (ToR) switches, you use standard 100 GbE QSFP-DD ports. Amazon recommends a leaf-spine architecture with at least 10 GbE uplinks to avoid bottlenecks during multi-rack training runs.

In practice, customers have found that a single spine switch can handle up to eight Trainium3 racks before latency spikes appear. Adding a second spine layer restores linear scaling. The Neuron SDK includes a “fabric-aware” scheduler that automatically places tensor shards on the least-loaded link.

If you already run an Ethernet-based fabric, you do not need to replace it. The NeuronSwitch is compatible with standard L2/L3 protocols, and Amazon provides a firmware bundle that translates Neuron-specific flow control into standard Ethernet pause frames.

Security and Compliance

All Amazon-branded servers ship with Nitro security chips that isolate the accelerator firmware from the host OS. The firmware is signed by Amazon’s key hierarchy and verified at boot. For regulated industries, Amazon offers a FedRAMP-High compliant configuration that includes hardware root of trust and immutable boot logs.

Data-in-transit between the accelerator and the host CPU uses AES-256 encryption, enforced by the Nitro card. This matches the encryption level used in AWS data centers, making it easier to meet PCI-DSS and HIPAA requirements for on-prem deployments.

Amazon also provides a compliance report (SOC 2 Type II) for the marketplace hardware, which can be downloaded from the console under “Compliance Documents.”

Who Should Use This?

Enterprises with large-scale training pipelines – Companies that run foundation-model training on-prem can cut token costs by up to 55% versus Nvidia GPUs.

Inference-heavy SaaS providers – Services that need to serve billions of tokens per day benefit from Inferentia’s lower per-token price and deterministic latency.

Regulated sectors (finance, health) – The Nitro-based security model and FedRAMP-High certification let you keep data on-prem while still using the same silicon that powers AWS.

Hybrid-cloud architects – If you already run workloads on AWS, the Neuron SDK lets you move models between cloud and on-prem without rewriting code.

Potential Pitfalls and How to Mitigate Them

One risk is supply-chain lead time. Although Amazon promises a 2-week delivery for standard racks, custom configurations can take up to 6 weeks. To avoid delays, order a baseline rack early and add extra capacity later.

Another concern is software lock-in. The Neuron SDK is tightly coupled to Trainium and Inferentia, so moving to a different accelerator later may require recompilation. Mitigate this by keeping a CI pipeline that builds both Neuron and CUDA binaries for each model.

Finally, power and cooling can be a surprise. A full Trainium3 UltraServer draws 12 kW, so you need adequate rack-level power distribution and hot-aisle containment. Amazon provides a thermal-design guide that helps you size your HVAC system.

Future Outlook

Amazon plans to release Trainium4 in early 2027, promising a 1.5× jump in FLOPs and a new 2-nm process node. The marketplace will likely expand to include a “pay-as-you-go” token-only model, letting customers buy compute without owning the hardware. If the $50 billion run-rate forecast holds, the marketplace could become a major competitor to Nvidia’s DGX sales channel.

For now, the marketplace offers a concrete path for enterprises that want the performance of AWS-grade silicon without moving all workloads to the cloud. By following the integration steps above, you can start testing Trainium or Inferentia in your own data center within weeks.

Conclusion

The Amazon AI chip marketplace opens a new avenue for on-prem AI acceleration. With transparent pricing, a unified Neuron software stack, and enterprise-grade security, the offering is ready for production today. Whether you need massive training power, low-cost inference, or a hybrid-cloud bridge, the marketplace gives you a choice that was previously limited to Nvidia’s GPU ecosystem. Start with a pilot rack, measure token cost, and decide if scaling to a full UltraCluster is the right move for your organization.