Issues Rate Limits

Provider rate limits restrict the number of requests or tokens you can consume within a time window. When you exceed these limits, the provider returns an HTTP 429 response and your request fails. Polyglot provides built-in retry policies to handle these transient failures, but sustained rate limiting requires application-level strategies.

Symptoms¶

HTTP status code 429 (Too Many Requests)
Error messages containing "rate limit exceeded," "too many requests," or "quota exceeded"
Requests that work in isolation but fail under load

Use the Built-In Retry Policy¶

Polyglot can automatically retry failed requests with exponential backoff and jitter. Retries are opt-in and explicit -- you must attach an InferenceRetryPolicy to the inference builder:

<?php

use Cognesy\Messages\Messages;
use Cognesy\Polyglot\Inference\Config\InferenceRetryPolicy;
use Cognesy\Polyglot\Inference\Inference;

$text = Inference::using('openai')
    ->withRetryPolicy(new InferenceRetryPolicy(
        maxAttempts: 4,
        baseDelayMs: 250,
        maxDelayMs: 8000,
        jitter: 'full',
    ))
    ->withMessages(Messages::fromString('What is the capital of France?'))
    ->get();

Retry Policy Parameters¶

Parameter	Default	Description
`maxAttempts`	`1`	Total number of attempts (1 means no retries)
`baseDelayMs`	`250`	Base delay in milliseconds before the first retry
`maxDelayMs`	`8000`	Maximum delay cap in milliseconds
`jitter`	`'full'`	Jitter strategy: `none`, `full`, or `equal`
`retryOnStatus`	`[408, 429, 500, 502, 503, 504]`	HTTP status codes that trigger a retry
`retryOnExceptions`	`[TimeoutException, NetworkException]`	Exception classes that trigger a retry

The delay between retries uses exponential backoff: baseDelayMs * 2^(attempt - 1), capped at maxDelayMs. The jitter strategy adds randomness to avoid thundering herd problems:

none -- no randomness, uses the exact computed delay
full -- random delay between 0 and the computed delay
equal -- half the computed delay plus a random value up to half the computed delay

Length Recovery¶

The retry policy also supports automatic recovery when a response is truncated due to token limits:

<?php

use Cognesy\Messages\Messages;
use Cognesy\Polyglot\Inference\Config\InferenceRetryPolicy;
use Cognesy\Polyglot\Inference\Inference;

$text = Inference::using('openai')
    ->withRetryPolicy(new InferenceRetryPolicy(
        maxAttempts: 3,
        lengthRecovery: 'continue',       // or 'increase_max_tokens'
        lengthMaxAttempts: 2,
        lengthContinuePrompt: 'Continue.',
        maxTokensIncrement: 512,
    ))
    ->withMessages(Messages::fromString('Write a detailed essay about climate change.'))
    ->get();

Retry Policy for Embeddings¶

Embeddings requests use a separate policy class with the same interface:

<?php

use Cognesy\Polyglot\Embeddings\Config\EmbeddingsRetryPolicy;

$retryPolicy = new EmbeddingsRetryPolicy(
    maxAttempts: 3,
    baseDelayMs: 500,
    maxDelayMs: 10000,
    jitter: 'full',
);

Application-Level Throttling¶

When retries alone are not enough, implement request throttling in your application to stay within the provider's rate limits:

<?php

use Cognesy\Messages\Messages;
use Cognesy\Polyglot\Inference\Inference;

class RateLimiter
{
    private float $lastRequestTime = 0;
    private float $minTimeBetweenRequests;

    public function __construct(int $requestsPerMinute = 60) {
        $this->minTimeBetweenRequests = 60.0 / $requestsPerMinute;
    }

    public function waitIfNeeded(): void {
        $elapsed = microtime(true) - $this->lastRequestTime;

        if ($elapsed < $this->minTimeBetweenRequests) {
            usleep((int) (($this->minTimeBetweenRequests - $elapsed) * 1_000_000));
        }

        $this->lastRequestTime = microtime(true);
    }
}

$limiter = new RateLimiter(requestsPerMinute: 30);

for ($i = 0; $i < 10; $i++) {
    $limiter->waitIfNeeded();

    $text = Inference::using('openai')
        ->withMessages(Messages::fromString("This is request $i"))
        ->get();

    echo "Response $i: $text\n";
}

Batch Requests to Reduce Volume¶

Instead of making many small requests, combine related questions into a single prompt when the use case allows:

<?php

use Cognesy\Messages\Messages;
use Cognesy\Polyglot\Inference\Inference;

// Instead of N separate requests...
$questions = [
    'What is the capital of France?',
    'What is the capital of Germany?',
    'What is the capital of Japan?',
];

// ...combine them into one request
$batchPrompt = "Answer each question on its own line:\n";
foreach ($questions as $i => $q) {
    $batchPrompt .= ($i + 1) . ". $q\n";
}

$text = Inference::using('openai')
    ->withMessages(Messages::fromString($batchPrompt))
    ->get();

This reduces the number of API calls from N to 1, dramatically lowering rate limit pressure.

Additional Strategies¶

Switch providers or models. Different providers and models have different rate limits. If one provider is heavily throttled, route some requests to another.
Upgrade your API plan. Most providers offer higher rate limits on paid tiers.
Cache responses. If the same prompts recur frequently, cache the results to avoid redundant API calls.
Use off-peak hours. Some providers have lower contention during off-peak hours, reducing the likelihood of rate limiting.
Monitor usage. Track your request volume and token consumption to anticipate rate limit issues before they affect users.