Methodology
How the pipeline works, what each model and method does, and known limitations that affect the quality of extracted priorities.
Pipeline Architecture
The pipeline processes governance documents through three stages, implemented as DSPy modules that can be optimised with GEPA (reflective prompt evolution).
- Extract — Each document chunk is processed independently. The model identifies governance priorities, stakeholders, sentiment, and supporting evidence quotes from that section.
- Aggregate — All chunk-level extractions are merged into a unified, deduplicated, ranked list of priorities. The model resolves overlaps and assigns importance scores.
- Synthesise — The ranked priorities are turned into an executive summary, actionable recommendations, and deliberation questions.
Extraction Methods
Two DSPy module types are used, producing meaningfully different extraction behaviour:
dspy.ChainOfThought (CoT)
Explicitly adds a "reasoning" field to the prompt and asks the model to think step-by-step before producing output. Works with any general-purpose LLM. The reasoning is part of the structured output and visible in the response.
Used with: GPT-4o-mini, Claude Sonnet 4, GPT-5.4, Hermes 3 70B
dspy.RLM (Reasoning Language Model)
Relies on the model's native internal reasoning capabilities rather than explicitly prompting for chain-of-thought. Designed for reasoning-native models (o1, o3-mini, DeepSeek R1) that have built-in extended thinking. The reasoning happens internally and is extracted from the API response.
Used with: o3-mini
Model Diagnostics
How each model performed on the current sample. "Evidence match rate" measures what percentage of unified priorities can be traced back to supporting quotes from the source documents.
| Model | Method | Chunks | Extracted | Unified | Evidence | Evidence Match |
|---|---|---|---|---|---|---|
| gpt-5.4 | CoT | 316 | 2128 | 14 | 3860 |
Known Issues
GPT-5.4: Evidence traceability broken (0% match)
GPT-5.4 completely rephrases priorities during the Aggregate step. While it extracts 52 chunk-level priorities with 68 evidence quotes, the 10 unified priorities use entirely new language that doesn't match any extraction. Supporting quotes cannot be displayed for any GPT-5.4 priority.
Fix: Use semantic similarity (embeddings) instead of substring matching to connect unified priorities back to chunk-level evidence. The same approach already used for cross-model clustering.
RLM / o3-mini: Hyper-synthesis loses granularity
o3-mini with dspy.RLM produces only 3 highly abstract priorities from 10 diverse source chunks. The extended reasoning collapses all nuance into macro themes, losing the specific governance insights that make the pipeline valuable. The extraction step also consistently hits the RLM max iteration limit, suggesting the reasoning chains are too long for structured output extraction.
Fix: Experiment with RLM configuration — higher max_tokens, custom instructions to preserve granularity, or a hybrid approach using RLM for extraction but ChainOfThought for aggregation.
Claude Sonnet 4: Partial evidence match (70%)
Claude produces well-synthesised priorities but rephrases 3 out of 10 enough that substring matching fails. Evidence exists but is unreachable for these priorities.
Fix: Same as GPT-5.4 — semantic matching would resolve this.
Hermes 3 70B: Insufficient aggregation
Produces 46 priorities from 40 chunk-level extractions — the aggregation step barely deduplicates or synthesises. Many priorities are near-verbatim from individual chunks, including formatting artefacts (trailing quotes, truncated text). High evidence match rate (96%) but low synthesis quality.
Fix: GEPA optimisation should improve aggregation prompts. May also benefit from a stronger model for the aggregate step specifically.
Small sample size
Current comparison uses 10 chunks from 10 sources (out of 154 total chunks from 21 sources). Results may not be representative of the full corpus. This is intentional — the workflow is to iterate on a small sample first, then expand once the best model/method is identified.
No transcript-specific handling
The pipeline treats all documents identically. Transcripts, threads, and formal reports may need different chunking strategies and extraction prompts to capture their distinct structures.