Lifecycle
Understanding the request lifecycle helps when debugging provider issues, implementing custom drivers, or hooking into events for observability. This page traces the complete flow for both inference and embeddings operations.
Inference Lifecycle¶
1. Request Construction¶
The lifecycle begins when the application builds an InferenceRequest through the Inference facade:
$inference = Inference::using('openai')
->withMessages(Messages::fromString('Explain PHP generics.'))
->withModel('gpt-4.1-nano')
->withMaxTokens(1024);
At this point, no HTTP call has been made. The facade holds an InferenceRequestBuilder that accumulates parameters. Every with*() call returns a new immutable copy, so the original instance is never modified.
2. Creating a Pending Handle¶
Calling create() (or a shortcut like get() or response()) builds the InferenceRequest and passes it to the runtime:
The InferenceRuntime wraps the request in an InferenceExecution object and returns a PendingInference handle. Execution is still deferred -- no HTTP call has been sent yet.
The InferenceExecution tracks the full lifecycle state: the original request, retry attempts, usage accumulation, and the final response.
3. Triggering Execution¶
The HTTP call is triggered only when you read from the PendingInference:
$text = $pending->get(); // triggers execution, returns content string
$response = $pending->response(); // triggers execution, returns InferenceResponse
$stream = $pending->stream(); // triggers execution (streaming mode)
Internally, PendingInference delegates to InferenceExecutionSession, which orchestrates the full lifecycle.
4. The Execution Session¶
The InferenceExecutionSession is the heart of the lifecycle. It performs these steps for a non-streaming request:
- Dispatches
InferenceStarted-- signals the beginning of the operation, including the execution ID, request details, and whether streaming is enabled - Dispatches
InferenceAttemptStarted-- signals the beginning of an attempt with the attempt number and model - Calls the driver --
driver->makeResponseFor($request)triggers the full request-response cycle: - The driver's request adapter converts
InferenceRequestinto anHttpRequest - The HTTP client sends the request to the provider
- The driver's response adapter normalizes the raw
HttpResponseinto anInferenceResponse - Checks the response -- if the finish reason indicates a failure (error, content filter, or length limit), the session handles it according to the retry policy
- Dispatches success events:
InferenceResponseCreated-- the response is readyInferenceAttemptSucceeded-- the attempt completed, including finish reason and usageInferenceUsageReported-- token usage (InferenceUsage) is reported with the model nameInferenceCompleted-- the entire operation is done, including total attempt count and timing- Returns
InferenceResponseto the caller
Cost calculation is performed externally using a FlatRateCostCalculator with InferencePricing data from the LLMConfig, rather than being attached to the usage object in the pipeline.
5. Retry Handling¶
If the request fails with a retryable error (transient HTTP status, timeout, network error, or provider-classified retriable exception), the session:
- Records the failure on the execution object
- Dispatches
InferenceAttemptFailed-- with the error details, HTTP status code, partial usage, andwillRetry: true - Waits for the configured delay (exponential backoff with optional jitter)
- Dispatches a new
InferenceAttemptStartedand retries
If all attempts are exhausted, the session dispatches InferenceCompleted with isSuccess: false and throws the terminal error.
Length-limit recovery has special handling. When a response finishes with Length as the finish reason and the retry policy allows length recovery, the session can:
'continue'-- append the partial response as an assistant message, add a continuation prompt, and retry'increase_max_tokens'-- increase themax_tokensoption by the configured increment and retry
This is independent of the regular retry count and controlled by lengthMaxAttempts.
6. Cached Context¶
If the request includes a CachedInferenceContext, the driver applies it before sending. Cached context allows you to pre-configure messages, tools, tool choice, and response format that are prepended to or merged with the request's own values. This is particularly useful for system prompts or shared tool definitions that remain constant across calls.
Streaming Lifecycle¶
When streaming is enabled, the flow diverges after the HTTP request is sent:
PendingInference::stream()validates that streaming was requested, then creates anInferenceStream- The driver produces an iterable of
PartialInferenceDeltaobjects from the SSE event stream viadriver->makeStreamDeltasFor($request) - The
InferenceStreamtracks visibility state through aVisibilityTrackerand yields only deltas with meaningful changes (filtering out empty or duplicate deltas)
$stream = $inference->withMessages(Messages::fromString('Hello'))->stream();
foreach ($stream->deltas() as $delta) {
echo $delta->contentDelta; // incremental text
}
$finalResponse = $stream->final(); // assembled InferenceResponse
Stream Events¶
The stream dispatches events as deltas arrive:
StreamFirstChunkReceived-- when the first visible delta arrives, including the request start time for TTFC measurementPartialInferenceDeltaCreated-- for each visible deltaInferenceResponseCreated-- when the stream finishes and the final response is assembled from accumulated state
Stream Processing¶
The stream supports functional-style processing through map(), reduce(), and filter():
// Map deltas to extracted values
$contents = $stream->map(fn($delta) => $delta->contentDelta);
// Reduce deltas into a single value
$fullText = $stream->reduce(fn($carry, $delta) => $carry . $delta->contentDelta, '');
// Filter deltas
$toolDeltas = $stream->filter(fn($delta) => $delta->toolName !== '');
// Collect all visible deltas
$allDeltas = $stream->all();
Delta Callback¶
You can register a callback that fires for every visible delta:
Stream Finalization¶
Calling final() on a stream that has not been fully consumed will drain the remaining deltas first, ensuring the final response is complete. A stream can only be consumed once -- calling deltas() a second time throws a LogicException.
The final response assembled from the stream goes through the same event dispatch as a synchronous response.
Embeddings Lifecycle¶
The embeddings lifecycle is simpler since streaming is not involved:
Embeddingsbuilds anEmbeddingsRequestfrom the configured inputs, model, and optionscreate()returnsPendingEmbeddings-- a lazy handle that holds the request, driver, and event dispatcherget()triggers execution:- The driver's
handle()method sends the HTTP request - The response body is decoded and passed to
driver->fromData()to build anEmbeddingsResponse EmbeddingsResponseReceivedis dispatchedEmbeddingsResponseis returned -- containing vectors and usage
$response = Embeddings::using('openai')
->withInputs(['Hello', 'World'])
->get();
$vectors = $response->vectors(); // Vector[]
$first = $response->first(); // first Vector
$usage = $response->usage(); // InferenceUsage
Retry logic is handled internally by PendingEmbeddings based on the EmbeddingsRetryPolicy attached to the request. The retry loop follows the same exponential backoff pattern as inference retries.
Response Caching¶
Both the inference and embeddings lifecycles support response caching. When ResponseCachePolicy is set on the request, the InferenceExecutionSession caches the response after the first successful execution. Subsequent calls to response() or get() on the same PendingInference return the cached result without making another HTTP call.
use Cognesy\Polyglot\Inference\Enums\ResponseCachePolicy;
$pending = $inference
->withMessages(Messages::fromString('Hello'))
->withResponseCachePolicy(ResponseCachePolicy::Memory)
->create();
$first = $pending->response(); // makes HTTP call
$second = $pending->response(); // returns cached response
For streaming, the stream itself cannot be replayed -- calling deltas() a second time will throw a LogicException. However, final() always returns the assembled response, which is stored in the execution object.