← All posts

The right way to rate-limit AI API calls in your SaaS

Most articles about rate limiting AI API calls focus on request counts. Limit each user to 100 requests per day, done.

This is the wrong abstraction.

Why request limits are wrong

A request to gpt-4.1-nano might cost $0.0001. A request to gpt-5.5 might cost $0.05. Treating these as equivalent — "one request" — means your rate limiter is measuring the wrong thing.

A user who hits your 100-request-per-day limit with nano is actually very cheap to serve. A user who sends 10 requests with GPT-5.5 has used 50× more of your AI budget. Request-count limits create the wrong incentives and protect you from the wrong thing.

Spend-based limits are the right primitive. Define a budget per user, track actual USD cost in real-time, and block when they've spent their allocation.

The naive implementation

Here's what most developers end up building:

async function checkAndTrackSpend(userId: string, model: string, tokens: number): Promise<void> {
  const cost = calculateCost(model, tokens)
  const key = `spend:${userId}:${currentMonth()}`
  
  const spend = await redis.incrbyfloat(key, cost)
  await redis.expire(key, 60 * 60 * 24 * 32) // ~1 month
  
  if (spend > USER_BUDGET) {
    await blockUser(userId)
  }
}

This works, but notice what it doesn't do: it checks spend after the AI call completes. So the call that pushes a user over budget still goes through. For a user on a $5/month budget, that last call might cost $2 — a 40% overage.

The right architecture: intercept before the call

A proper spend-based limit has two phases:

Phase 1 — Pre-call intercept. Before sending anything to the AI provider, check whether the user is currently blocked. This is a fast Redis lookup (typically <5ms) that prevents any call from going through for an already-blocked user.

Phase 2 — Post-call accounting. After the response comes back with actual token counts, update the running spend total. If the new total exceeds the budget, set the blocked flag for subsequent calls.

The key insight: you can't prevent the blocking call from going through (you don't know the final token count before making the call), but you can prevent all subsequent calls from going through once a user is blocked. For most use cases, this is acceptable — the overage is bounded by a single call's cost.

// Phase 1: check before calling
const isBlocked = await redis.get(`blocked:${userId}`)
if (isBlocked) throw new BudgetExceededError()

// Make the AI call
const result = await openai.chat.completions.create(params)

// Phase 2: update spend after
const cost = calculateCost(result.model, result.usage)
const newSpend = await redis.incrbyfloat(`spend:${userId}:${month}`, cost)
if (newSpend > budget) {
  await redis.set(`blocked:${userId}`, '1')
}

What you have to build to do this yourself

Implementing this properly requires:

  • Redis infrastructure — managed Redis with global low-latency (Upstash is the right choice)
  • Model pricing table — current per-token costs for every model you support, kept up to date
  • Token-to-cost calculation — handling the OpenAI/Anthropic field name differences (prompt_tokens vs input_tokens)
  • Streaming support — extracting token usage from the final chunk, which differs between providers
  • Blocked-user persistence — the blocked flag needs to survive server restarts
  • Dashboard — you want to see per-user spend, not just total API costs

This is a week or two of work if you're doing it properly, and it's not directly related to your product.

The alternative

Nasca handles all of this. You define spend budgets per tier in the dashboard, integrate the SDK in three lines, and catch NascaBlockedError to show your upgrade prompt. The Redis infrastructure, pricing tables, intercept logic, and streaming support are all handled.

It's free for the first 250 users, which covers most apps in the early stages when this infrastructure would otherwise be a distraction from actually building the product.

Get started →