Skip to content

Optimization

Embeddings requests are typically fast and inexpensive, but at scale the details matter. This page covers the key patterns for keeping your embeddings pipeline efficient and reliable.

Batch Inputs

The single most impactful optimization is batching. Instead of making one request per document, send multiple texts in a single call. This reduces HTTP overhead and is often cheaper per token:

<?php

use Cognesy\Polyglot\Embeddings\Embeddings;

$response = Embeddings::using('openai')
    ->withInputs([
        'Document one',
        'Document two',
        'Document three',
    ])
    ->get();

$vectors = $response->toValuesArray();

Each provider has a maximum number of inputs per request (configured as maxInputs in the preset). For OpenAI this defaults to 2048; for Cohere it is 96. When processing large datasets, chunk your documents to stay within these limits.

Processing Large Datasets

When you have more documents than a single batch can handle, process them in chunks:

<?php

use Cognesy\Polyglot\Embeddings\Embeddings;

$embeddings = Embeddings::using('openai');
$allDocuments = [/* hundreds or thousands of documents */];

$batchSize = 25; // Stay well within provider limits
$vectors = [];

for ($i = 0; $i < count($allDocuments); $i += $batchSize) {
    $batch = array_slice($allDocuments, $i, $batchSize);

    try {
        $response = $embeddings->withInputs($batch)->get();
        $vectors = array_merge($vectors, $response->toValuesArray());

        $batchNum = (int) floor($i / $batchSize) + 1;
        $totalBatches = (int) ceil(count($allDocuments) / $batchSize);
        echo "Processed batch {$batchNum} of {$totalBatches}\n";
    } catch (\Exception $e) {
        echo "Error processing batch: " . $e->getMessage() . "\n";
    }

    // Small delay to avoid hitting rate limits
    usleep(100_000); // 100ms
}

echo "Processed " . count($vectors) . " embeddings in total.\n";

Retry Policies

Network failures and rate limits are inevitable in production. Polyglot provides an EmbeddingsRetryPolicy that implements exponential backoff with configurable jitter:

<?php

use Cognesy\Polyglot\Embeddings\Config\EmbeddingsRetryPolicy;
use Cognesy\Polyglot\Embeddings\Embeddings;

$response = Embeddings::using('openai')
    ->withInputs(['Document one'])
    ->withRetryPolicy(new EmbeddingsRetryPolicy(
        maxAttempts: 3,
        baseDelayMs: 250,
        maxDelayMs: 8000,
        jitter: 'full',
        retryOnStatus: [408, 429, 500, 502, 503, 504],
    ))
    ->get();

Retry Policy Parameters

Parameter Default Description
maxAttempts 1 Total number of attempts (1 = no retries)
baseDelayMs 250 Base delay in milliseconds before the first retry
maxDelayMs 8000 Maximum delay cap in milliseconds
jitter 'full' Jitter strategy: 'none', 'full', or 'equal'
retryOnStatus [408, 429, 500, 502, 503, 504] HTTP status codes that trigger a retry
retryOnExceptions [TimeoutException, NetworkException] Exception classes that trigger a retry

The delay for each attempt is calculated as baseDelayMs * 2^(attempt-1), capped at maxDelayMs, then jitter is applied:

  • none -- Exact calculated delay, no randomization.
  • full -- Random value between 0 and the calculated delay. Best for reducing thundering herd.
  • equal -- Half the calculated delay plus a random value up to half. A middle ground.

Important: Set maxAttempts to at least 3 in production to handle transient failures gracefully. The default of 1 means no retries.

Caching Embeddings

Embedding the same text repeatedly is wasteful. For applications that frequently re-embed identical strings (such as search queries or template documents), a caching layer pays for itself quickly:

<?php

use Cognesy\Polyglot\Embeddings\Embeddings;

class CachedEmbeddings
{
    private Embeddings $embeddings;
    /** @var array<string, float[]> */
    private array $cache = [];

    public function __construct(?Embeddings $embeddings = null)
    {
        $this->embeddings = $embeddings ?? Embeddings::using('openai');
    }

    /**
     * Get the embedding for a single text, using cache when available.
     *
     * @return float[]
     */
    public function embed(string $text, array $options = []): array
    {
        $key = $this->cacheKey($text, $options);

        if (isset($this->cache[$key])) {
            return $this->cache[$key];
        }

        $vector = $this->embeddings
            ->withInputs($text)
            ->withOptions($options)
            ->first()
            ->values();

        $this->cache[$key] = $vector;

        return $vector;
    }

    /**
     * Embed multiple texts, fetching only uncached ones from the API.
     *
     * @param string[] $texts
     * @return float[][]
     */
    public function embedMany(array $texts, array $options = []): array
    {
        $results = [];
        $uncachedTexts = [];
        $uncachedIndices = [];

        foreach ($texts as $i => $text) {
            $key = $this->cacheKey($text, $options);
            if (isset($this->cache[$key])) {
                $results[$i] = $this->cache[$key];
            } else {
                $uncachedTexts[] = $text;
                $uncachedIndices[] = $i;
            }
        }

        if ($uncachedTexts !== []) {
            $response = $this->embeddings
                ->withInputs($uncachedTexts)
                ->withOptions($options)
                ->get();

            foreach ($response->toValuesArray() as $j => $vector) {
                $i = $uncachedIndices[$j];
                $results[$i] = $vector;
                $this->cache[$this->cacheKey($texts[$i], $options)] = $vector;
            }
        }

        ksort($results);
        return $results;
    }

    private function cacheKey(string $text, array $options): string
    {
        return md5($text . serialize($options));
    }
}

Usage:

<?php

$cached = new CachedEmbeddings(Embeddings::using('openai'));

// First call hits the API
$vector = $cached->embed('What is machine learning?');

// Second call returns from cache instantly
$vector = $cached->embed('What is machine learning?');

// Batch with partial cache hits
$vectors = $cached->embedMany([
    'What is machine learning?',  // cached
    'How do neural networks work?', // API call
]);

Tip: For persistent caching across requests, replace the in-memory array with Redis, Memcached, or a database-backed store.

Choosing the Right Model

Model selection has a direct impact on both cost and quality. Here are the key trade-offs:

Factor Smaller Models Larger Models
Dimensions Fewer (e.g., 256-1536) More (e.g., 3072)
Speed Faster response times Slower response times
Cost Lower per-token cost Higher per-token cost
Quality Good for general use Better for nuanced similarity
Storage Less memory per vector More memory per vector

Some providers (like OpenAI's text-embedding-3 models) support requesting a specific number of dimensions, letting you trade precision for storage efficiency:

<?php

use Cognesy\Polyglot\Embeddings\Embeddings;

// Full-dimension embedding (3072 dimensions)
$full = Embeddings::using('openai')
    ->withModel('text-embedding-3-large')
    ->withInputs('Sample text')
    ->first();

// Reduced-dimension embedding (256 dimensions, less storage)
$compact = Embeddings::using('openai')
    ->withModel('text-embedding-3-large')
    ->withInputs('Sample text')
    ->withOptions(['dimensions' => 256])
    ->first();

echo "Full: " . count($full->values()) . " dimensions\n";
echo "Compact: " . count($compact->values()) . " dimensions\n";

Best Practices

Batch whenever possible. A single request with 100 texts is faster and cheaper than 100 individual requests.

Set retry policies in production. Rate limits (HTTP 429) and transient server errors are common. Configure at least 3 attempts with jitter to handle them gracefully.

Cache aggressively. Embeddings for the same text and model are deterministic. Cache them to avoid redundant API calls and reduce latency.

Monitor token usage. Use the usage() method on responses to track consumption and detect unexpected spikes:

<?php

use Cognesy\Polyglot\Embeddings\Embeddings;

$response = Embeddings::using('openai')
    ->withInputs($documents)
    ->get();

$usage = $response->usage();
echo "Tokens used: " . $usage->total() . "\n";

Match dimensions to your storage. If you are storing millions of vectors, reducing dimensions from 3072 to 256 can cut storage costs by over 90% with only modest quality loss.