Json Extraction

LLMs do not always return clean JSON. Responses may arrive wrapped in markdown code blocks, surrounded by explanatory text, or with minor formatting errors such as trailing commas or unbalanced braces. Instructor includes a multi-strategy extraction pipeline that handles these edge cases transparently.

Extraction Pipeline¶

When processing an LLM response, Instructor tries multiple extraction strategies in order until one succeeds.

1. Direct JSON Parsing¶

The response content is parsed directly as JSON. This handles the common case where the LLM returns a well-formed JSON object.

LLM response:
{"name": "John", "age": 30}

Result: Parsed successfully

2. Markdown Code Block Extraction¶

Extracts JSON from fenced code blocks. Some providers (particularly Claude) tend to wrap JSON responses in markdown.

LLM response:
Here's the data you requested:

```json
{"name": "John", "age": 30}

Result: Content extracted from between json and markers

### 3. Bracket Matching

Finds the first `{` and last `}` in the response to extract JSON from surrounding text.

```text
LLM response:
The user data is {"name": "John", "age": 30} as extracted from the text.

Result: JSON extracted from first { to last }

4. Smart Brace Matching¶

Handles complex cases with nested braces and escaped quotes inside string values.

LLM response:
Here is {"user": {"name": "John \"The Great\"", "age": 30}} extracted.

Result: Correctly handles nested braces and escaped quotes

Resilient Parsing¶

After extraction, if standard json_decode fails, Instructor applies automatic repairs before parsing:

Balance quotes -- adds missing closing quotes
Remove trailing commas -- fixes {"a": 1,} patterns
Balance braces -- adds missing } or ] characters

This is especially valuable during streaming, where partial JSON chunks arrive before the response is complete. A dedicated partial JSON parser handles incomplete data by filling in null values for missing fields.

Default Extractors¶

The built-in extractor chain includes these extractors, tried in order:

Extractor	Purpose
`DirectJsonExtractor`	Parse content directly as JSON
`ResilientJsonExtractor`	Handle malformed JSON (trailing commas, unbalanced braces)
`MarkdownBlockExtractor`	Extract from ```json ``` blocks
`BracketMatchingExtractor`	Find first `{` to last `}`
`SmartBraceExtractor`	Handle nested braces and escaped quotes in strings

Most responses succeed on the first strategy. The subsequent strategies add negligible overhead and only activate when needed.

Custom Extractors¶

You can replace the default extractor with your own by calling withExtractor() on the StructuredOutputRuntime. Use ResponseExtractor::fromExtractors() to compose multiple extractors into a chain.

use Cognesy\Instructor\Extraction\Contracts\CanExtractResponse;
use Cognesy\Instructor\Extraction\Data\ExtractionInput;
use Cognesy\Instructor\Extraction\Exceptions\ExtractionException;

class XmlCdataExtractor implements CanExtractResponse
{
    public function extract(ExtractionInput $input): array
    {
        if (!preg_match('/<!\[CDATA\[(.*?)\]\]>/s', $input->content, $matches)) {
            throw new ExtractionException('No CDATA found');
        }

        $json = trim($matches[1]);

        try {
            $decoded = json_decode($json, associative: true, flags: JSON_THROW_ON_ERROR);
        } catch (\JsonException $e) {
            throw new ExtractionException('Invalid JSON in CDATA', $e);
        }

        if (!is_array($decoded)) {
            throw new ExtractionException('Expected object or array in CDATA');
        }

        return $decoded;
    }

    public function name(): string
    {
        return 'xml_cdata';
    }
}

Using Custom Extractors¶

Custom extractors are configured on the runtime and apply to both synchronous and streaming responses.

use Cognesy\Instructor\StructuredOutput;
use Cognesy\Instructor\StructuredOutputRuntime;
use Cognesy\Instructor\Extraction\Extractors\DirectJsonExtractor;
use Cognesy\Instructor\Extraction\ResponseExtractor;

$runtime = StructuredOutputRuntime::fromDefaults()
    ->withExtractor(ResponseExtractor::fromExtractors(
        new DirectJsonExtractor(),
        new XmlCdataExtractor(),
    ));

$result = (new StructuredOutput($runtime))
    ->with(messages: 'Extract user data', responseModel: User::class)
    ->get();

The extractors are tried in the order you provide them. When an extractor throws an ExtractionException, the next extractor in the chain is attempted. If all extractors fail, Instructor returns an empty result, triggers a validation error, and initiates the retry mechanism (if configured).

Error Handling¶

When extraction fails across all strategies, Instructor follows this sequence:

Returns an empty array from the extraction pipeline
Triggers a validation error on the deserialized object
If retries are configured, sends the error feedback to the LLM for self-correction
Repeats until the retry limit is reached or extraction succeeds