Methodology

How the pipeline works, what each model and method does, and known limitations that affect the quality of extracted priorities.

Pipeline Architecture

The pipeline processes governance documents through three stages, implemented as DSPy modules that can be optimised with GEPA (reflective prompt evolution).

Source Documents Ingest & Chunk 1. Extract 2. Aggregate 3. Synthesise
  1. Extract — Each document chunk is processed independently. The model identifies governance priorities, stakeholders, sentiment, and supporting evidence quotes from that section.
  2. Aggregate — All chunk-level extractions are merged into a unified, deduplicated, ranked list of priorities. The model resolves overlaps and assigns importance scores.
  3. Synthesise — The ranked priorities are turned into an executive summary, actionable recommendations, and deliberation questions.

Extraction Methods

Two DSPy module types are used, producing meaningfully different extraction behaviour:

dspy.ChainOfThought (CoT)

Explicitly adds a "reasoning" field to the prompt and asks the model to think step-by-step before producing output. Works with any general-purpose LLM. The reasoning is part of the structured output and visible in the response.

Used with: GPT-4o-mini, Claude Sonnet 4, GPT-5.4, Hermes 3 70B

dspy.RLM (Reasoning Language Model)

Relies on the model's native internal reasoning capabilities rather than explicitly prompting for chain-of-thought. Designed for reasoning-native models (o1, o3-mini, DeepSeek R1) that have built-in extended thinking. The reasoning happens internally and is extracted from the API response.

Used with: o3-mini

Model Diagnostics

How each model performed on the current sample. "Evidence match rate" measures what percentage of unified priorities can be traced back to supporting quotes from the source documents.

Model Method Chunks Extracted Unified Evidence Evidence Match
gpt-5.4 CoT 316 2128 14 3860
14%

Known Issues

High Open

GPT-5.4: Evidence traceability broken (0% match)

GPT-5.4 completely rephrases priorities during the Aggregate step. While it extracts 52 chunk-level priorities with 68 evidence quotes, the 10 unified priorities use entirely new language that doesn't match any extraction. Supporting quotes cannot be displayed for any GPT-5.4 priority.

Fix: Use semantic similarity (embeddings) instead of substring matching to connect unified priorities back to chunk-level evidence. The same approach already used for cross-model clustering.

High Open

RLM / o3-mini: Hyper-synthesis loses granularity

o3-mini with dspy.RLM produces only 3 highly abstract priorities from 10 diverse source chunks. The extended reasoning collapses all nuance into macro themes, losing the specific governance insights that make the pipeline valuable. The extraction step also consistently hits the RLM max iteration limit, suggesting the reasoning chains are too long for structured output extraction.

Fix: Experiment with RLM configuration — higher max_tokens, custom instructions to preserve granularity, or a hybrid approach using RLM for extraction but ChainOfThought for aggregation.

Medium Open

Claude Sonnet 4: Partial evidence match (70%)

Claude produces well-synthesised priorities but rephrases 3 out of 10 enough that substring matching fails. Evidence exists but is unreachable for these priorities.

Fix: Same as GPT-5.4 — semantic matching would resolve this.

Medium Open

Hermes 3 70B: Insufficient aggregation

Produces 46 priorities from 40 chunk-level extractions — the aggregation step barely deduplicates or synthesises. Many priorities are near-verbatim from individual chunks, including formatting artefacts (trailing quotes, truncated text). High evidence match rate (96%) but low synthesis quality.

Fix: GEPA optimisation should improve aggregation prompts. May also benefit from a stronger model for the aggregate step specifically.

Low By design

Small sample size

Current comparison uses 10 chunks from 10 sources (out of 154 total chunks from 21 sources). Results may not be representative of the full corpus. This is intentional — the workflow is to iterate on a small sample first, then expand once the best model/method is identified.

Low Planned

No transcript-specific handling

The pipeline treats all documents identically. Transcripts, threads, and formal reports may need different chunking strategies and extraction prompts to capture their distinct structures.

Back to Dashboard