|
|
Today, Amazon Bedrock introduces new service tiers that give you more control over your AI workload costs while maintaining the performance levels your applications need.
I’m working with customers building AI applications. I’ve seen firsthand how different workloads require different performance and cost trade-offs. Many organizations running AI workloads face challenges balancing performance requirements with cost optimization. Some applications need rapid response times for real-time interactions, whereas others can process data more gradually. With these challenges in mind, today we’re announcing additional options pricing that give you more flexibility in matching your workload requirements with cost optimization.
Amazon Bedrock now offers three service tiers for workloads: Priority, Standard, and Flex. Each tier is designed to match specific workload requirements. Applications have varying response time requirements based on the use case. Some applications—such as financial trading systems—demand the fastest response times, others need rapid response times to support business processes like content generation, and applications such as content summarization can process data more gradually.
The Priority tier processes your requests ahead of other tiers, providing preferential compute allocation for mission-critical applications like customer-facing chat-based assistants and real-time language translation services, though at a premium price point. The Standard tier provides consistent performance at regular rates for everyday AI tasks, ideal for content generation, text analysis, and routine document processing. For workloads that can handle longer latency, the Flex tier offers a more cost-effective option with lower pricing, which is well suited for model evaluations, content summarization, and multistep analysis and agentic workflows.
You can now optimize your spending by matching each workload to the most appropriate tier. For example, if you’re running a customer service chat-based assistant that needs quick responses, you can use the Priority tier to get the fastest processing times. For content summarization tasks that can tolerate longer processing times, you can use the Flex tier to reduce costs while maintaining reliable performance. For most models that support Priority Tier, customers can realize up to 25% better output tokens per second (OTPS) latency compared to standard tier.
Check the Amazon Bedrock documentation for an up-to-date list of models supported for each service tier.
Choosing the right tier for your workload
Here is a mental model to help you choose the right tier for your workload.
| Category | Recommended service tier | Description |
|---|---|---|
| Mission-critical | Priority | Requests are handled ahead of other tiers. Lower latency responses for user-facing apps (for example, customer service chat assistants, real-time language translation, interactive AI assistants) |
| Business-standard | Standard | Responsive performance for important workloads (for example, content generation, text analysis, routine document processing) |
| Business-noncritical | Flex | Cost-efficient for less urgent workloads (for example, model evaluations, content summarization, multistep agentic workflows) |
Start by reviewing with application owners your current usage patterns. Next, identify which workloads need immediate responses and which ones can process data more gradually. You can then begin routing a small portion of your traffic through different tiers to test performance and cost benefits.
The AWS Pricing Calculator helps you estimate costs for different service tiers by entering your expected workload for each tier. You can estimate your budget based on your specific usage patterns.
To monitor your usage and costs, you can use the AWS Service Quotas console or turn on model invocation logging in Amazon Bedrock and observe the metrics with Amazon CloudWatch. These tools provide visibility into your token usage and help you track performance across different tiers.

You can start using the new service tiers today. You choose the tier on a per-API call basis. Here is an example using the ChatCompletions OpenAI API, but you can pass the same service_tier parameter in the body of InvokeModel, InvokeModelWithResponseStream, Converse, andConverseStream APIs (for supported models):
from openai import OpenAI
client = OpenAI(
base_url="https://bedrock-runtime.us-west-2.amazonaws.com/openai/v1",
api_key="$AWS_BEARER_TOKEN_BEDROCK" # Replace with actual API key
)
completion = client.chat.completions.create(
model= "openai.gpt-oss-20b-1:0",
messages=[
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
service_tier= "priority" # options: "priority | default | flex"
)
print(completion.choices[0].message)
To learn more, check out the Amazon Bedrock User Guide or contact your AWS account team for detailed planning assistance.
I’m looking forward to hearing how you use these new pricing options to optimize your AI workloads. Share your experience with me online on social networks or connect with me at AWS events.
— seb
