Request Response

Polyglot normalizes all provider interactions into a small set of data objects. These objects are immutable -- every mutation returns a new instance, making them safe to pass around and branch from.

InferenceRequest¶

InferenceRequest encapsulates everything needed for an LLM call. It stores the conversation messages, model selection, tools, response format, options, and caching/retry configuration.

Namespace: Cognesy\Polyglot\Inference\Data\InferenceRequest

Key Properties¶

Property	Type	Description
`id`	`InferenceRequestId`	Unique identifier, auto-generated
`createdAt`	`DateTimeImmutable`	Timestamp of creation
`updatedAt`	`DateTimeImmutable`	Timestamp of last mutation
`messages`	`Messages`	The conversation messages
`model`	`string`	Model identifier
`tools`	`ToolDefinitions`	Tool/function definitions
`toolChoice`	`ToolChoice`	Tool selection strategy
`responseFormat`	`ResponseFormat`	Structured output format
`options`	`array`	Additional options (e.g. `stream`, `max_tokens`, `temperature`)
`cachedContext`	`CachedInferenceContext`	Shared context for prompt caching
`responseCachePolicy`	`ResponseCachePolicy`	Controls response caching behavior
`retryPolicy`	`?InferenceRetryPolicy`	Retry configuration

Reading Values¶

$request->messages();             // Messages -- the message list
$request->model();                // string
$request->isStreamed();           // bool -- checks options['stream']
$request->tools();               // ToolDefinitions
$request->toolChoice();          // ToolChoice
$request->responseFormat();      // ResponseFormat
$request->options();             // array
$request->cachedContext();       // ?CachedInferenceContext
$request->responseCachePolicy(); // ResponseCachePolicy
$request->retryPolicy();         // ?InferenceRetryPolicy
$request->id();                  // InferenceRequestId

Predicate methods are also available: hasMessages(), hasModel(), hasTools(), hasToolChoice(), hasResponseFormat(), hasNonTextResponseFormat(), hasTextResponseFormat(), hasOptions().

Modifying a Request¶

All mutators return a new instance, preserving the original request ID and creation timestamp:

$updated = $request
    ->withMessages(Messages::fromString('New prompt'))
    ->withModel('gpt-4.1')
    ->withStreaming(true)
    ->withOptions(['temperature' => 0.7])
    ->withTools($toolDefinitions)
    ->withToolChoice('auto')
    ->withResponseFormat(['type' => 'json_object'])
    ->withRetryPolicy(new InferenceRetryPolicy(maxAttempts: 3))
    ->withResponseCachePolicy(ResponseCachePolicy::Memory);

The with(...) method allows setting multiple fields in a single call:

$updated = $request->with(
    messages: Messages::fromString('New prompt'),
    model: 'gpt-4.1',
    options: ['temperature' => 0.7],
);

Cached Context¶

The cached context mechanism allows you to separate stable parts of a prompt (system messages, tool definitions, response format) from the dynamic parts (user messages). When withCacheApplied() is called, the cached context is merged into the request:

$request = new InferenceRequest(
    messages: Messages::fromString('What is 2+2?'),
    cachedContext: new CachedInferenceContext(
        messages: [['role' => 'system', 'content' => 'You are a math tutor.']],
        tools: $toolDefinitions,
        responseFormat: ['type' => 'json_object'],
    ),
);

// Merges cached messages before request messages,
// cached tools/format used if request has none
$merged = $request->withCacheApplied();

After applying, the merged request has an empty cached context to prevent double-application.

Serialization¶

Requests can be serialized to and from arrays for storage or transport:

$array = $request->toArray();
$restored = InferenceRequest::fromArray($array);

PendingInference¶

PendingInference is a lazy handle for a single inference operation. It does not execute the request until you access the results. This enables the fluent Inference API to defer execution to the moment of consumption.

Namespace: Cognesy\Polyglot\Inference\PendingInference

Consuming Results¶

// Get plain text content
$text = $pending->get();

// Get the full response object
$response = $pending->response();

// Stream the response (requires streaming to be enabled)
$stream = $pending->stream();

// Extract JSON from the response content
$json = $pending->asJson();          // string
$data = $pending->asJsonData();      // array

// Extract tool call arguments as JSON
$json = $pending->asToolCallJson();      // string
$data = $pending->asToolCallJsonData();  // array

// Check if streaming is enabled for this request
$isStreamed = $pending->isStreamed();

The underlying InferenceExecutionSession handles retry logic, event dispatching, and response caching. Once execution completes, the response is cached for the lifetime of the PendingInference instance.

Important: Calling stream() on a non-streaming request will throw an InvalidArgumentException. Enable streaming via withStreaming(true) on the facade before calling create().

InferenceResponse¶

InferenceResponse is a final readonly value object that normalizes the provider's result into a consistent shape.

Namespace: Cognesy\Polyglot\Inference\Data\InferenceResponse

Reading the Response¶

$response->content();           // string -- the generated text
$response->reasoningContent();  // string -- chain-of-thought (if available)
$response->toolCalls();         // ToolCalls collection
$response->usage();             // InferenceUsage object with token counts
$response->finishReason();      // InferenceFinishReason enum
$response->responseData();      // HttpResponse -- the raw HTTP response
$response->isPartial();         // bool -- true for intermediate streaming results

Predicate methods: hasContent(), hasReasoningContent(), hasToolCalls(), hasFinishReason().

JSON Extraction¶

The response provides convenience methods for extracting structured data:

// Find JSON in the response content
$json = $response->findJsonData();           // Json object
$data = $response->findJsonData()->toArray(); // array
$str = $response->findJsonData()->toString(); // string

// Extract tool call arguments
$json = $response->findToolCallJsonData();  // Json object

When a response has a single tool call, findToolCallJsonData() returns the arguments of that call. When there are multiple tool calls, it returns an array of all tool call data.

Reasoning Content Fallback¶

Some providers embed reasoning in <think> tags within the content rather than in a dedicated field. The withReasoningContentFallbackFromContent() method handles this:

$response = $response->withReasoningContentFallbackFromContent();
// Now $response->reasoningContent() contains the extracted reasoning
// And $response->content() has the <think> tags removed

This is a no-op if the response already has dedicated reasoning content or if no <think> tags are present.

Finish Reason¶

The finishReason() method returns an InferenceFinishReason enum. The hasFinishedWithFailure() method checks whether the response ended with an error, content filter, or length limit:

if ($response->hasFinishedWithFailure()) {
    // Handle error, content_filter, or length finish reasons
}

Serialization¶

Responses support round-trip serialization:

$array = $response->toArray();
$restored = InferenceResponse::fromArray($array);

PartialInferenceDelta¶

During streaming, the driver emits PartialInferenceDelta objects for each SSE event. Each delta carries only the incremental change from that event.

Namespace: Cognesy\Polyglot\Inference\Data\PartialInferenceDelta

Fields¶

Field	Type	Description
`contentDelta`	`string`	Incremental text content
`reasoningContentDelta`	`string`	Incremental reasoning content
`toolId`	`ToolCallId\\|string\\|null`	Tool call identifier
`toolName`	`string`	Tool name (first delta of a tool call)
`toolArgs`	`string`	Incremental tool call arguments
`finishReason`	`string`	Set on the final delta
`usage`	`?InferenceUsage`	Token usage (typically on the last delta)
`usageIsCumulative`	`bool`	Whether usage represents total (true) or incremental (false)
`responseData`	`?HttpResponse`	Raw response data for this event
`value`	`mixed`	Optional provider-specific value

The InferenceStream accumulates these deltas internally using InferenceStreamState and assembles the final InferenceResponse when the stream completes. A VisibilityTracker ensures that only deltas with meaningful content changes are yielded to the caller.

InferenceUsage¶

The InferenceUsage object tracks token consumption across several categories:

Namespace: Cognesy\Polyglot\Inference\Data\InferenceUsage

$usage = $response->usage();

$usage->inputTokens;       // int -- prompt tokens
$usage->outputTokens;      // int -- completion tokens
$usage->cacheWriteTokens;  // int -- tokens written to cache
$usage->cacheReadTokens;   // int -- tokens read from cache
$usage->reasoningTokens;   // int -- tokens used for reasoning

// Aggregate accessors
$usage->total();   // sum of all token categories
$usage->input();   // input tokens only
$usage->output();  // output + reasoning tokens
$usage->cache();   // cache write + cache read tokens

// String representation
$usage->toString(); // "Tokens: 150 (i:100 o:40 c:0 r:10)"

Cost Calculation¶

Cost is calculated externally using a calculator rather than through methods on the usage object. Pricing is specified in USD per 1 million tokens:

use Cognesy\Polyglot\Inference\Data\InferencePricing;
use Cognesy\Polyglot\Pricing\FlatRateCostCalculator;

$calculator = new FlatRateCostCalculator();
$cost = $calculator->calculate($usage, new InferencePricing(
    inputPerMToken: 0.15,
    outputPerMToken: 0.60,
));

// The Cost value object
$cost->total;          // float -- total cost in USD
$cost->breakdown;      // array -- per-category breakdown
$cost->toString();     // string representation
$cost->toArray();      // array representation

Accumulation¶

Usage and cost can be accumulated across multiple requests:

$total = $usage1->withAccumulated($usage2);
$totalCost = $cost1->withAccumulated($cost2);

Embeddings Data Objects¶

EmbeddingsRequest¶

Holds the input texts, model, options, and retry policy for an embeddings call:

$request = new EmbeddingsRequest(
    input: ['Hello world', 'Another text'],
    model: 'text-embedding-3-small',
    options: ['dimensions' => 256],
);

$request->inputs();     // array of strings
$request->model();      // string
$request->options();    // array
$request->hasInputs();  // bool
$request->retryPolicy(); // ?EmbeddingsRetryPolicy

// Immutable mutations
$updated = $request->withInputs('New text');
$updated = $request->withModel('text-embedding-3-large');
$updated = $request->withOptions(['dimensions' => 1024]);

EmbeddingsResponse¶

Normalizes the provider's embeddings result:

$response->vectors();       // Vector[] -- all embedding vectors
$response->first();         // ?Vector -- first vector
$response->last();          // ?Vector -- last vector
$response->all();           // Vector[] -- alias for vectors()
$response->usage();         // InferenceUsage
$response->toValuesArray(); // array of float arrays
$response->split($index);   // [Vector[], Vector[]] -- split at index

PendingEmbeddings¶

A lazy handle similar to PendingInference. Calling get() triggers the HTTP request and returns an EmbeddingsResponse. The response is cached after the first call. Retry logic is handled internally based on the EmbeddingsRetryPolicy attached to the request, using the same exponential backoff pattern as inference retries.

$pending = $embeddings->withInputs('Hello world')->create();
$response = $pending->get();      // triggers HTTP call
$request = $pending->request();   // access the original request