JSON Extraction Strategies¶
InstructorPHP uses multiple strategies to extract JSON from LLM responses, handling various edge cases where the LLM might return JSON wrapped in markdown, text, or malformed.
Extraction Pipeline¶
When processing an LLM response, InstructorPHP tries multiple extraction strategies in order:
1. Direct Parsing (Try As-Is)¶
Attempts to parse the response directly as JSON:
2. Markdown Code Block Extraction¶
Extracts JSON from markdown fenced code blocks:
✅ Extracts content between json and
### 3. Bracket Matching
Finds first `{` and last `}` to extract JSON:
```text
LLM response:
The user data is {"name": "John", "age": 30} as extracted from the text.
✅ Extracts from first { to last }
4. Smart Brace Matching¶
Handles nested braces and escaped quotes:
LLM response:
Here is {"user": {"name": "John \"The Great\"", "age": 30}} extracted.
✅ Correctly handles:
- Nested braces
- Escaped quotes
- String boundaries
Parsing Strategies¶
After extraction, multiple parsers attempt to handle malformed JSON:
1. Standard JSON Parser¶
Native json_decode with strict error handling.
2. Resilient Parser¶
Applies automatic repairs before parsing:
- Balance quotes - Adds missing closing quotes
- Remove trailing commas - Fixes
{"a": 1,} - Balance braces - Adds missing
}or]
Malformed JSON:
{"name": "John", "age": 30
Resilient parser repairs:
{"name": "John", "age": 30} // ✅ Added missing }
3. Partial JSON Parser¶
Handles incomplete JSON during streaming:
Implementation Details¶
Location: packages/utils/src/Json/JsonParser.php
class JsonParser {
public function findCompleteJson(string $input): string {
$extractors = [
fn($text) => [$text], // Direct
fn($text) => $this->findByMarkdown($text), // Markdown
fn($text) => [$this->findByBrackets($text)], // Brackets
fn($text) => $this->findJSONLikeStrings($text),// Smart braces
];
foreach ($extractors as $extractor) {
foreach ($extractor($input) as $candidate) {
if ($parsed = $this->tryParse($candidate)) {
return json_encode($parsed);
}
}
}
return '';
}
private function tryParse(string $maybeJson): mixed {
$parsers = [
fn($json) => json_decode($json, true, 512, JSON_THROW_ON_ERROR),
fn($json) => (new ResilientJsonParser($json))->parse(),
fn($json) => (new PartialJsonParser)->parse($json),
];
// ... try each parser
}
}
Why This Matters¶
LLMs don't always return clean JSON:
- Claude sometimes wraps in markdown
- GPT-4 may add explanations
- Gemini might include partial responses during streaming
- Custom prompts can lead to unexpected formats
InstructorPHP's multi-strategy approach ensures maximum compatibility.
Common Scenarios¶
Scenario 1: LLM Adds Explanation¶
LLM response:
Based on the text, I extracted the following information:
{"name": "John Doe", "age": 30, "email": "john@example.com"}
This represents the user data found in the document.
✅ Strategy 3 (Bracket Matching) extracts the JSON successfully
Scenario 2: Markdown Wrapped Response¶
I've extracted the user information as requested.
✅ **Strategy 2 (Markdown Extraction)** handles this case
### Scenario 3: Malformed JSON
```text
LLM response:
{"name": "Bob", "age": 35, "active": true,}
✅ Resilient Parser removes the trailing comma and parses successfully
Scenario 4: Streaming Partial Response¶
✅ Partial Parser completes to:
Error Handling¶
If all strategies fail, InstructorPHP:
- Returns an empty string from
findCompleteJson() - Triggers a validation error
- Initiates retry mechanism (if configured)
- Provides error feedback to LLM for self-correction
Performance Considerations¶
Extraction overhead: - Direct parsing: ~0.1ms - Markdown extraction: ~0.5ms (regex) - Bracket matching: ~0.2ms (string ops) - Smart brace matching: ~1-2ms (character iteration)
Most responses succeed on first strategy (direct parsing).
Custom Content Extractors¶
You can add custom extractors to handle non-standard response formats:
use Cognesy\Instructor\Extraction\Contracts\CanExtractResponse;
use Cognesy\Instructor\Extraction\Data\ExtractionInput;
use Cognesy\Instructor\Extraction\Exceptions\ExtractionException;
class XmlCdataExtractor implements CanExtractResponse
{
public function extract(ExtractionInput $input): array
{
if (preg_match('/<!\[CDATA\[(.*?)\]\]>/s', $input->content, $matches)) {
$json = trim($matches[1]);
try {
$decoded = json_decode($json, associative: true, flags: JSON_THROW_ON_ERROR);
} catch (\JsonException $e) {
throw new ExtractionException('Invalid JSON in CDATA', $e);
}
if (!is_array($decoded)) {
throw new ExtractionException('Expected object or array in CDATA');
}
return $decoded;
}
throw new ExtractionException('No CDATA found');
}
public function name(): string
{
return 'xml_cdata';
}
}
Using Custom Extractors¶
Custom extractors work for both sync and streaming responses:
use Cognesy\Instructor\StructuredOutput;
use Cognesy\Instructor\Extraction\Extractors\DirectJsonExtractor;
$result = (new StructuredOutput)
->withExtractors(
new DirectJsonExtractor(),
new XmlCdataExtractor(),
)
->withResponseClass(User::class)
->with(messages: 'Extract user')
->get();
The same extractors are automatically used for streaming:
$stream = (new StructuredOutput)
->withExtractors(
new DirectJsonExtractor(),
new XmlCdataExtractor(),
)
->withResponseClass(User::class)
->with(messages: 'Extract user')
->stream();
See: Output Formats - Pluggable Extraction for comprehensive documentation and examples.
Related Documentation¶
- Response Models - How schemas are processed
- Validation - What happens after extraction
- Retry Mechanisms - Error handling and retries