Overview
Polyglot is built on a modular, layered architecture that separates concerns and promotes extensibility. Each layer has a clear responsibility, and dependencies flow in one direction -- from the public API down to the HTTP transport.
Understanding these layers will help you extend the library, contribute to its development, or build your own integrations with new LLM providers.
The Four Layers¶
Public Layer¶
This is what application code usually touches. Two facade classes provide a unified interface for all provider interactions:
Inference-- for chat completions and text generationEmbeddings-- for generating vector embeddings
These facades build request objects, delegate execution to runtimes, and return normalized responses regardless of the underlying provider. Both facades follow an immutable, fluent interface pattern -- every method that modifies state returns a new instance, so you can safely branch configurations from a shared base.
Runtime Layer¶
Runtimes assemble the moving parts needed for a provider call. They wire together the configuration, driver, HTTP client, and event dispatcher, and they own the execution lifecycle including retry logic and response caching.
The key classes are:
InferenceRuntime-- coordinates inference execution and createsPendingInferencehandlesEmbeddingsRuntime-- coordinates embeddings execution and createsPendingEmbeddingshandles
Each runtime can be constructed from a config object, a provider, or injected directly. When no HTTP client is provided, the runtime builds a default one via HttpClientBuilder. Runtimes also expose onEvent() and wiretap() methods for hooking into the event system.
Request and Response Layer¶
Requests and responses are normalized into package data objects that are provider-agnostic:
InferenceRequest-- messages, model, tools, tool choice, response format, options, cached context, retry policy, response cache policyInferenceResponse-- content, reasoning content, tool calls, usage, finish reason, raw HTTP response dataPartialInferenceDelta-- a single streaming event delta with content, reasoning content, tool call fragments, finish reason, and usageEmbeddingsRequest-- input texts, model, options, retry policyEmbeddingsResponse-- vectors and usage
These objects isolate your application from provider-specific response shapes. Both request types support immutable with*() mutators for building modified copies.
Driver Layer¶
Drivers translate Polyglot requests into provider-native HTTP payloads and normalize the results back. Each driver implements CanProcessInferenceRequest (for inference) or CanHandleVectorization (for embeddings) and is composed of smaller adapter responsibilities:
- Request adapters (
CanTranslateInferenceRequest) -- convertInferenceRequestinto anHttpRequest - Response adapters (
CanTranslateInferenceResponse) -- convert rawHttpResponsedata intoInferenceResponseor stream ofPartialInferenceDelta - Message formatters (
CanMapMessages) -- map typedMessagesto provider-specific structures, composing aMessageMapperutility for iteration - Body formatters (
CanMapRequestBody) -- assemble the full request body with mode-specific adjustments - Usage formatters (
CanMapUsage) -- extract token usage from provider responses
Most inference drivers extend BaseInferenceRequestDriver, which provides the standard HTTP execution flow and stream handling. Provider-specific classes like OpenAIDriver, AnthropicDriver, and GeminiDriver compose the appropriate adapters and formatters for their API.
How the Layers Connect¶
+---------------------+ +---------------------+
| Inference | | Embeddings | Public Layer
+---------------------+ +---------------------+
| |
+---------------------+ +---------------------+
| InferenceRuntime | | EmbeddingsRuntime | Runtime Layer
+---------------------+ +---------------------+
| |
+---------------------+ +---------------------+
| InferenceRequest | | EmbeddingsRequest |
| PendingInference | | PendingEmbeddings | Request/Response
| InferenceResponse | | EmbeddingsResponse | Layer
+---------------------+ +---------------------+
| |
+---------------------+ +---------------------+
| Inference Drivers | | Embeddings Drivers | Driver Layer
| (OpenAI, Anthropic, | | (OpenAI, Cohere, |
| Gemini, etc.) | | Gemini, etc.) |
+---------------------+ +---------------------+
| |
+------------------------------------------------+
| HTTP Client (shared) | Transport
+------------------------------------------------+
The public facade creates a request and hands it to the runtime. The runtime delegates to a driver, which translates the request into an HTTP call and normalizes the response. Events are dispatched at each stage for observability. The result flows back up as a normalized data object.
Key Design Decisions¶
Immutability. Both the public facades and the request/response objects are immutable. Calling withMessages() or withModel() always returns a new instance rather than modifying the original. This makes it safe to reuse a configured Inference or Embeddings instance across multiple concurrent calls.
Lazy execution. Calling create() on a facade returns a PendingInference or PendingEmbeddings handle without triggering the HTTP call. Execution is deferred until the application reads from the handle via get(), response(), or stream().
Driver registry. Inference drivers are resolved through InferenceDriverRegistry, which maps string names (like 'openai' or 'anthropic') to driver factory functions. Embeddings drivers use EmbeddingsDriverFactory with a similar pattern. Both support registering custom drivers at runtime.
Provider-agnostic data. The InferenceResponse and EmbeddingsResponse objects present a uniform shape regardless of which provider produced them. Provider-specific details are accessible through responseData() when needed, but the primary accessors (content(), toolCalls(), usage(), etc.) work identically across all providers.