Section mapper¶

Companion to normalize_batch for structural templates that ship with named-but-empty sections (OBJETIVO, APLICAÇÃO, ...) and rely on heading hierarchy + tables + cell layout instead of explicit {{X}} tokens.

Two modes ship side-by-side:

rules engine (mode="rules") — deterministic, free, zero LLM calls. Hardcoded heuristics tuned to Brazilian-PT industrial procedures (Engeman, NR-12 / NR-13). DOcStream parity on the first real-world Engeman pair.
LLM-driven mapper (mode="llm" / "hybrid") — vendor-agnostic. ONE multimodal LLM call (template rendered as PNG + structural JSON + source content) returns a complete MappingPlan covering header substitutions, section content, paragraph rewrites, table data, and cell-level fills. Validated against:
The original Engeman pair (PT-BR industrial).
Five synthetic adversarial pairs (English corporate, ABNT academic, bilingual gov form, legal contract, mega-table layout).
Two real-world templates downloaded from public Brazilian institution sites (UNIFAP POP — universidade federal; Corentocantins POP — regional nursing council).

When to use it¶

Use engine.section_mapper.map_sections instead of normalize_batch when:

The template has no {{placeholder}} markers — only headings + empty body slots + empty tables.
The source carries the same heading taxonomy as the target (possibly under different wording: DESCRIÇÃO ↔ SISTEMÁTICA, ESCOPO ↔ APLICAÇÃO).
You want sub-section markers (6.1., 6.2.1.), list markers (a., b., •), and the template's header (document code, author, approver, date, title) populated automatically from the source.

End-to-end pipeline¶

template.docx ──┬─→ parse_docx ──→ list[DocxSection] (paragraph indices)
                │
                ├─→ detect_default_specs_with_source(template, source) ──→ list[TableSpec]
                │
                └─→ fill_template_header(output, metadata)

source.docx ────┬─→ parse_docx_source ──→ list[TextSection] (numbering resolved)
                │
                └─→ extract_source_metadata ──→ HeaderMetadata
                                                    │
                                                    ▼
similarity (string / embeddings / llm) ─→ list[HeadingMatch]
                                                    │
                                                    ▼
_build_content_map ─→ dict[target_name -> joined source content]
                                                    │
                                                    ▼
render_section_content (line-kind aware: subheading bold, nota italic)
                                                    │
                                                    ▼
fill_tables (header-set match, sub-header row writing)
                                                    │
                                                    ▼
prune empty body slots + collapse empty paragraph runs
                                                    │
                                                    ▼
fill_template_header (XXXX → IT.PRO.URE.387.0005, TITULO → ...)
                                                    │
                                                    ▼
                                         SectionMappingReport

map_sections_async is the same flow with the llm similarity tier wired as a final fallback when string + embeddings still under-cover the target.

Modules¶

Module	Responsibility
`engine.section_mapper.parser`	Heading detection from `.docx` + plain text. `parse_docx` (template), `parse_docx_source` (source with auto-numbering resolution).
`engine.section_mapper.numbering`	`NumberingResolver` reads `word/numbering.xml`, walks `<w:numPr>` paragraphs, returns the rendered marker.
`engine.section_mapper.similarity`	3-tier matcher: string (zero deps) → embeddings (optional) → llm (when provider supplied).
`engine.section_mapper.renderer`	Multi-line content insertion preserving formatting. Sub-heading detection + bold + spacing.
`engine.section_mapper.table_filler`	Header-set table fill with optional `subheaders` for templates that have repeated primary headers.
`engine.section_mapper.auto_tables`	Walks template + source; synthesizes `TableSpec` for canonical empty tables (Histórico Rev/Data/Alteração, Atividades / Responsabilidade).
`engine.section_mapper.header_filler`	Extracts metadata from source header + revision-history table; substitutes `XXXX` / `Rev. 00` / `Elaborado:` / `Aprovado:` / `Data:` / `(TITULO)` placeholders in the template header.
`engine.section_mapper.orchestrator`	`map_sections` and `map_sections_async` glue + `SectionMappingReport`.

Parser — heading detection¶

A heading is detected when a paragraph either:

Has a Word Heading <N> style.
Matches the numbered-heading pattern (1. OBJETIVO, 3.2. Etapas...).
Matches the all-caps unnumbered pattern (OBJETIVO, NORMAS E DOCUMENTOS DE REFERÊNCIA).

Hardening guards (each documented with a regression test):

2+ separators (FAFEN-SE/PR/AM) — rejected.
Single-word ≤4 letters (PE, NA, CFM) — rejected.
All-caps sentences > 60 chars — rejected.
Lines containing : (label syntax EMPRESA: ACME) — rejected.
Lines ending in digit (form-field PROTOCOLO 12345) — rejected.
Parenthesized labels ((TITULO)) — rejected.
Single-token version labels (REV.02, VERSAO_1.0) — rejected.

PDFs commonly emit each heading twice — once in the table of contents, once in the body. The orchestrator deduplicates by richest content per heading name, dropping TOC lines.

Numbering resolver¶

When the source is a .docx, plain-text extraction loses Word's auto-numbering: <w:numPr> references word/numbering.xml and the marker is rendered at display time, never written into <w:t>. The resolver fixes that:

from engine.section_mapper.numbering import load_resolver_from_docx, extract_num_pr

resolver = load_resolver_from_docx(Path("dados.docx"))
for p in doc.paragraphs:
    np = extract_num_pr(p._p.xml)
    if np:
        marker = resolver.marker_for(*np)  # "1.", "5.2.", "a.", "•", ...

State is per-numId; advancing one level resets every deeper level. Faithful to numFmt (decimal, lowerLetter, upperLetter, lowerRoman, upperRoman).

Bullet-as-letters heuristic (default on)¶

bullet_as_letters=True (default) renders bullets at ilvl=0 as Excel-style letters (a., b., ..., z., aa.). Industrial documents use Wingdings/Symbol bullets internally but expect lettered output. Set bullet_as_letters=False for strictly faithful rendering ("•" for every bullet level).

reset_bullet_counters() is called by the parser whenever a structural decimal heading advances, so each sub-section restarts its lettering at a. instead of continuing across boundaries.

Similarity matcher¶

Three tiers, ordered by cost:

Tier	Deps	Speed	Use when
string	none	µs	Source and target use the same vocabulary; synonym table covers wording variants
embeddings	`pip install "template-engine-ia[embeddings]"` (sentence-transformers, ~80 MB)	ms	Wording diverges across templates (cross-vendor docs)
llm	provider supplied	s + $	Long-tail mappings the heuristics still miss

Default mode is "auto": string first; falls back to embeddings (when installed) when target coverage is < 60%; the async path adds llm as final tier when a provider is supplied and embeddings still under-cover.

The synonym table covers the common Brazilian-Portuguese industrial taxonomy:

Canonical	Variants
`OBJETIVO`	FINALIDADE, PROPOSITO, FINALIDADES
`APLICACAO`	ESCOPO, AMBITO, ABRANGENCIA, ALCANCE
`SISTEMATICA`	DESCRICAO, PROCEDIMENTO, METODOLOGIA, DETALHAMENTO, EXECUCAO, PROCESSO
`RESPONSABILIDADE`	RESPONSABILIDADES, ATRIBUICOES, REGISTROS, RESPONSABILIDADES E AUTORIDADES
`HISTORICO`	HISTORICO DE REVISOES, CONTROLE DE REVISOES, REVISOES, HISTORICO DE REVISAO
`DEFINICOES`	TERMOS E DEFINICOES, GLOSSARIO, DEFINICOES SIGLAS

Renderer¶

Inserts source content under the matched template heading:

Find the heading paragraph in the template.
Locate the first empty body paragraph below it (the anchor).
Drop <w:jc> from the anchor's pPr so a multi-line block does not render as justified columns.
Set the anchor's text to line 1 of the content.
For each remaining line, clone the anchor's <w:p>, clear inner <w:t>, set line N, insert via addnext so paragraph order is preserved.

Line-kind decoration (Phase 2)¶

Each inserted line is classified by its prefix and decorated:

Prefix	Kind	Decoration
`^\d+\.\d+\.?\s` (e.g. `6.1. Foo`)	sub-heading	bold + black + `before=240/after=120` twips
`^\d+\.\d+\.\d+\.?\s` (e.g. `6.2.1.`)	sub-sub-heading	bold + black + `before=180/after=80`
`^Nota\s\d[:.]\s`	nota	italic
anything else	body	unchanged

Decoration is applied via direct formatting only — no <w:pStyle> reference — because Word's default Ttulo2/Ttulo3 styles render blue, which is wrong for industrial-procedure documents that expect black bold sub-headings.

Empty-paragraph cleanup¶

After insertion, two passes prevent the visual gaps the template's blank slots would otherwise leave:

Prune unused body slots: walk siblings of every filled anchor; delete empty paragraphs up to the next heading.
Collapse empty runs: walk the document body once; collapse any run of 2+ consecutive empty paragraphs to a single empty. Paragraphs inside table cells are left alone (cell layout depends on paragraph count).

Section-aware post-transforms (Phase 2)¶

After parsing the source, two section-name-driven content transforms run:

Sections named NORMAS / REGISTROS / ANEXOS / DOCUMENTOS DE REFERÊNCIA: every line without a marker gets a leading "• " (reference list auto-bullet).
Sections named DEFINIÇÕES: leading "term: " (up to 3 short tokens) is converted to "term – " (en-dash).

Tables¶

fill_tables(template, output, specs) matches each TableSpec to a template table by header set (order-insensitive). Each spec's rows populate empty rows; extra rows are appended.

TableSpec extras:

subheaders: list[str] | None — when the template's primary header row repeats values (["Atividades", "Responsabilidade", "Responsabilidade"]), supplying ["", "Gerente Setorial", "Supervisores"] writes those into row 1 and uses them for column mapping.

Auto-tables¶

detect_default_specs_with_source(template, source) synthesizes specs without manual configuration:

Histórico de Revisões (Rev. | Data | Alteração): extracts the source's revision-history table (matching any of VERSÃO|DATA|AUTOR|ALTERAÇÕES columns), renumbers from 00, appends a "Migração para o novo modelo padrão" row dated today.
Atribuições e Responsabilidades (Atividades | Responsabilidade | Responsabilidade): extracts source paragraphs under Compete à gerência / Compete aos supervisores (or wording variants); each child paragraph becomes a row tagged X in the correct column. Bucket boundaries are detected via <w:numPr> ilvl so the extractor doesn't spill into the next top-level section.

When an auto-table fills the data for a target section (Responsabilidade / Histórico), the orchestrator drops the prose body for that section so the same info doesn't appear twice.

Header filler¶

extract_source_metadata(source_path) reads the source .docx and gathers:

Field	Source
`document_code`	source `word/header*.xml`, dotted-decimal code reassembled across run fragmentation (`IT.PRO.` + `U` + `RE` + `.387.0005`)
`title`	source header, longest all-caps multi-word run that is not the company name or document code
`version`	source header, `Ver.: NN` / `Rev. NN`
`author`	source body's revision-history table, `AUTOR / REVISOR` column, first non-empty data row
`approver`	source header, `Aprovador (es): <name>` (cut at next page indicator / date)
`source_date`	source body's revision-history table, `DATA` column, first non-empty data row

fill_template_header(output_path, metadata) walks every word/header*.xml inside the output docx zip and substitutes:

Placeholder	Replacement
`XXXX`	`metadata.document_code`
`Rev. 00`	`Rev. <version>`
`Elaborado:`	`Elaborado: <author>`
`Aprovado:`	`Aprovado: <approver>`
`Data:`	`Data: <today_iso>`
`TITULO`	`metadata.title`

When source metadata for a placeholder is missing, the placeholder is left in place so a downstream reviewer can spot the gap.

Document-code reassembly¶

Source headers fragment a code across many <w:t> runs (IT.PRO. + U + RE + .387.0005) AND glue a company tag in without a word boundary (...TRABALHOIT.PRO.URE.387.0005...).

The extractor builds two flavors of the flat header text — glued (no spacing between runs, so dotted codes stay intact) and spaced (single space between runs, so titles like PARTIDA DA ÁREA DE SÍNTESE followed by Ver.: don't merge into SÍNTESEVer). The prefix [A-Z]{2,3}\.[A-Z]{2,5}\. is located in spaced flavor; a state-machine walk over glued flavor consumes the full code, stopping at the first invalid letter↔digit transition (...0005PARTIDA ends the code at 0005).

Quick example¶

from pathlib import Path

from engine.section_mapper import map_sections

report = map_sections(
    template_path=Path("template.docx"),
    source_path=Path("source.docx"),
    output_path=Path("output.docx"),
    # similarity_mode="auto" + auto_tables=True are the defaults
)

print(f"mapped {report.mapped_count} sections")
print(f"tables filled: {report.tables_filled}")
print(f"unmapped source: {report.unmapped_source_headings}")
print(f"unfilled target: {report.unfilled_target_headings}")
print(f"orphan placeholders: {report.orphan_paragraphs}")

SectionMappingReport.to_dict() returns a JSON-serializable summary suitable for audit logs.

Operating modes¶

Mode	When	Cost (Gemini Flash 2.5)
`rules` (default in `map_sections`)	PT-BR / Engeman style; bit-for-bit reproducibility	$0.0000
`llm` (`map_sections_async(mode="llm", llm=...)`)	any vendor / language; needs provider	~$0.001
`hybrid` (`mode="hybrid", llm=...`)	rules first, LLM tops up gaps	~$0.001 when gaps

LLM mode end-to-end¶

import asyncio
from pathlib import Path

from engine.llm.openai_provider import OpenAIProvider
from engine.section_mapper import map_sections_async

async def main() -> None:
    provider = OpenAIProvider(api_key="sk-...", model="gpt-4o", timeout=300.0)
    report = await map_sections_async(
        template_path=Path("template.docx"),
        source_path=Path("source.docx"),
        output_path=Path("output.docx"),
        mode="llm",
        llm=provider,
    )
    print(f"sections in plan: {len(report.matches)}")
    print(f"tables filled: {report.tables_filled}")

asyncio.run(main())

The LLM call returns a MappingPlan covering every detected placeholder (header + body), every template heading, and every empty table. Failure paths fall back to an empty plan so callers can chain a rules-mode retry.

Cross-vendor validation¶

Five fixture pairs and two real-world public templates exercise the LLM mapper:

Pair	Domain	Language	Notable shape
A — Engeman (`dados.docx`)	industrial procedure	PT-BR	`XXXX` / `(TITULO)` / `Elaborado:` / `Atividades \\| Responsabilidade \\| Responsabilidade`
B — English corporate	corporate procedure	EN	`{{DOC_CODE}}` / `[Title]` / `Author:` / `Activity \\| Owner`
C — ABNT academic	thesis	PT-BR Title-case	`<<TITULO_DO_TRABALHO>>` / `§§§§§` / nested `1.2.1` / `__/__/____`
D — Bilingual gov form	government form	PT-BR / EN	`[______]` / `< nome >` / `___.___.___-__` masks
E — Legal contract	contract	PT-BR	parties block (multi-placeholder), numbered clauses 1-6
UNIFAP POP (real-world)	university procedure	PT-BR Title-case	`Descrição` / `Objetivos` / `XXXXXXXX` / contact table
Corentocantins POP (real-world)	nursing-council POP	PT-BR	mega-table 20×8 with merged cells

Result against gpt-4o (mode=llm):

Pair	Sections	Header subs	Tables	Cell fills	Orphans
A	7/7	6	2	n/a	0
B	7/7	5	2	n/a	0
C	6/9	4	2	n/a	0
D	5/5	8	1	n/a	1
E	7/7	7	1	n/a	0
UNIFAP	14 plan keys	12	1	covered	0
Corentocantins	4 sections	5	0	partial (title + procedure rows)	0

Regenerate via:

python scripts/build_vendor_b_fixtures.py
python scripts/build_adversarial_fixtures.py
python scripts/build_real_world_source.py
python scripts/run_adversarial_llm.py
python scripts/run_real_world_llm.py

Multimodal vision¶

The LLM call attaches PNG renders of the template (up to 3 pages) so the model can SEE merged cells, table geometry, embedded logos. Pipeline:

template.docx ──→ docx2pdf (Word COM / Pages) ──→ template.pdf
                                                       │
                                                       ▼
                                           PyMuPDF (fitz) per-page
                                                       │
                                                       ▼
                                              base64 PNG data URLs
                                                       │
                                                       ▼
                                       OpenAI vision (gpt-4o)
                                       multipart user message

engine.section_mapper.template_renderer.render_pages(docx_path, max_pages=3) returns list[PageImage]. Both docx2pdf and pymupdf are optional — when missing, the orchestrator falls back to text-only mode (logged at info level). Install via:

pip install docx2pdf pymupdf

Cell-level fills¶

Mega-table layouts (Corentocantins-style POPs) carry the entire document as one big table. Heading cells and body slot cells live in the same table grid. The rules engine renderer doesn't see inside cells.

LLM-driven mapper adds:

TemplateCell(table_index, row, col, text, is_fillable) — every cell of every body table is profiled with a fillability heuristic (empty / imperative-instruction text / XX mask / parenthesised hint / label-no-value / known template defaults like Fulano de Tal).
MappingPlan.cell_fills: list[CellFill] — LLM addresses each fillable cell by (table_index, row, col).
auto_renderer._apply_cell_fills — writes via cell.text while preserving the first paragraph's run formatting. Mirrors the fill across every sibling cell in the same row that shared the original text (merged-column groups).
A deduplicated FILLABLE CELLS YOU MUST ADDRESS checklist is appended to the prompt, grouping merged columns into one logical entry per row, so the LLM no longer thinks it filled them.

Plan validation + retry¶

After the initial LLM call, _detect_plan_gaps reports:

placeholders the LLM left empty
template headings empty in the plan when the source mentions a matching keyword
empty template tables not addressed in table_data

When gaps exist, a focused retry prompt lists exactly what's missing and asks the LLM to fill ONLY those slots. _merge_plans overlays the retry without erasing existing values. max_retries=1 by default.

Plan cache¶

engine.section_mapper.plan_cache persists every successful MappingPlan to ${XDG_CACHE_HOME:-~/.cache}/template-engine/plans/ keyed by sha256(template_bytes) + sha256(source_bytes) + PROMPT_VERSION. Same template + source pair → 0 LLM calls. Override the cache directory via TEMPLATE_ENGINE_CACHE_DIR=/path.

Real-run benchmark (Vendor E, gpt-4o):

Run	Wall time	LLM calls
First	~20 s	1 (call) + 0-1 (retry)
Second (cache hit)	4.6 s	0

CLI --no-cache skips the cache for one-off runs.

Polymorphic source input¶

profile_source accepts:

Path / str — file path on disk
bytes / bytearray — raw docx bytes (written to a temp file)
BytesIO / any io.IOBase — read & buffered
URL strings (http:// / https://) — downloaded via urllib
existing SourceStructure — passed through (idempotent)

The same applies to source paths inside map_sections_async.

CLI command¶

template-engine map-sections \
    --template ./template.docx \
    --source ./source.docx \
    --output ./output.docx \
    --provider openai --api-key "$OPENAI_API_KEY" --model gpt-4o

Auto-picks mode="llm" when a provider is supplied, "rules" otherwise. --no-cache disables the plan cache. --json <path> emits the SectionMappingReport.

Smart-default mode¶

map_sections_async(... mode=None) (default) auto-picks:

llm provider supplied → mode="llm"
no provider → mode="rules"

Callers don't need to remember mode flags.

Limits¶

See REAL-WORLD-LIMITS.md for the full list. Honest call-outs:

rules engine (rules mode)¶

Scanned PDFs are not OCR'd. Use .docx source whenever possible.
Multi-column PDFs interleave columns at extraction time; convert to single-column first.
Source tables (other than the canonical Histórico / Responsabilidade) come through as flattened text.
Synonym table is Brazilian-Portuguese specific. Install the [embeddings] extra for cross-language matching, or supply an LLM provider for the long tail.
Sub-section hierarchy (3.2.1.) is preserved as text prefix, not as nested heading anchors.

LLM-driven mapper (LLM mode)¶

Determinism lost — gpt-4o varies slightly across runs. The plan cache mitigates this for repeated pairs but not for first runs.
Cost — ~$0.05/doc with gpt-4o, ~$0.001/doc with Gemini Flash 2.5. Cache makes follow-up runs free.
Multimodal optional — when docx2pdf (Word COM) or pymupdf is missing, the orchestrator silently falls back to text-only mode. Install both for best mega-table coverage.
Token-window cap — template JSON capped at 30 000 chars, source JSON at 60 000 chars. Very large templates may be truncated.
Mega-table body slots with imperative hints still partially resist replacement (Corentocantins rows 2-7). The model preserves cells whose current text combines a numbered heading with a parenthesised hint. Closing this is a prompt-tightening target.

Universal¶

Real-world template variance is endless. Five vendors covered now (synthetic + real). Every new vendor is a new failure-mode discovery exercise; bugs reproduce per-template, not in the abstract.
No CI integration test for mode="llm" — the pipeline calls a paid API. Production use requires a smoke run against your own corpus.