Section mapper¶
Companion to normalize_batch for structural templates that ship with named-but-empty sections (OBJETIVO, APLICAÇÃO, ...) and rely on heading hierarchy + tables + cell layout instead of explicit {{X}} tokens.
Two modes ship side-by-side:
- rules engine (
mode="rules") — deterministic, free, zero LLM calls. Hardcoded heuristics tuned to Brazilian-PT industrial procedures (Engeman, NR-12 / NR-13). DOcStream parity on the first real-world Engeman pair. - LLM-driven mapper (
mode="llm"/"hybrid") — vendor-agnostic. ONE multimodal LLM call (template rendered as PNG + structural JSON + source content) returns a completeMappingPlancovering header substitutions, section content, paragraph rewrites, table data, and cell-level fills. Validated against: - The original Engeman pair (PT-BR industrial).
- Five synthetic adversarial pairs (English corporate, ABNT academic, bilingual gov form, legal contract, mega-table layout).
- Two real-world templates downloaded from public Brazilian institution sites (UNIFAP POP — universidade federal; Corentocantins POP — regional nursing council).
When to use it¶
Use engine.section_mapper.map_sections instead of normalize_batch when:
- The template has no
{{placeholder}}markers — only headings + empty body slots + empty tables. - The source carries the same heading taxonomy as the target (possibly under different wording:
DESCRIÇÃO↔SISTEMÁTICA,ESCOPO↔APLICAÇÃO). - You want sub-section markers (
6.1.,6.2.1.), list markers (a.,b.,•), and the template's header (document code, author, approver, date, title) populated automatically from the source.
End-to-end pipeline¶
template.docx ──┬─→ parse_docx ──→ list[DocxSection] (paragraph indices)
│
├─→ detect_default_specs_with_source(template, source) ──→ list[TableSpec]
│
└─→ fill_template_header(output, metadata)
source.docx ────┬─→ parse_docx_source ──→ list[TextSection] (numbering resolved)
│
└─→ extract_source_metadata ──→ HeaderMetadata
│
▼
similarity (string / embeddings / llm) ─→ list[HeadingMatch]
│
▼
_build_content_map ─→ dict[target_name -> joined source content]
│
▼
render_section_content (line-kind aware: subheading bold, nota italic)
│
▼
fill_tables (header-set match, sub-header row writing)
│
▼
prune empty body slots + collapse empty paragraph runs
│
▼
fill_template_header (XXXX → IT.PRO.URE.387.0005, TITULO → ...)
│
▼
SectionMappingReport
map_sections_async is the same flow with the llm similarity tier wired as a final fallback when string + embeddings still under-cover the target.
Modules¶
| Module | Responsibility |
|---|---|
engine.section_mapper.parser |
Heading detection from .docx + plain text. parse_docx (template), parse_docx_source (source with auto-numbering resolution). |
engine.section_mapper.numbering |
NumberingResolver reads word/numbering.xml, walks <w:numPr> paragraphs, returns the rendered marker. |
engine.section_mapper.similarity |
3-tier matcher: string (zero deps) → embeddings (optional) → llm (when provider supplied). |
engine.section_mapper.renderer |
Multi-line content insertion preserving formatting. Sub-heading detection + bold + spacing. |
engine.section_mapper.table_filler |
Header-set table fill with optional subheaders for templates that have repeated primary headers. |
engine.section_mapper.auto_tables |
Walks template + source; synthesizes TableSpec for canonical empty tables (Histórico Rev/Data/Alteração, Atividades / Responsabilidade). |
engine.section_mapper.header_filler |
Extracts metadata from source header + revision-history table; substitutes XXXX / Rev. 00 / Elaborado: / Aprovado: / Data: / (TITULO) placeholders in the template header. |
engine.section_mapper.orchestrator |
map_sections and map_sections_async glue + SectionMappingReport. |
Parser — heading detection¶
A heading is detected when a paragraph either:
- Has a Word
Heading <N>style. - Matches the numbered-heading pattern (
1. OBJETIVO,3.2. Etapas...). - Matches the all-caps unnumbered pattern (
OBJETIVO,NORMAS E DOCUMENTOS DE REFERÊNCIA).
Hardening guards (each documented with a regression test):
- 2+ separators (
FAFEN-SE/PR/AM) — rejected. - Single-word ≤4 letters (
PE,NA,CFM) — rejected. - All-caps sentences > 60 chars — rejected.
- Lines containing
:(label syntaxEMPRESA: ACME) — rejected. - Lines ending in digit (form-field
PROTOCOLO 12345) — rejected. - Parenthesized labels (
(TITULO)) — rejected. - Single-token version labels (
REV.02,VERSAO_1.0) — rejected.
PDFs commonly emit each heading twice — once in the table of contents, once in the body. The orchestrator deduplicates by richest content per heading name, dropping TOC lines.
Numbering resolver¶
When the source is a .docx, plain-text extraction loses Word's auto-numbering: <w:numPr> references word/numbering.xml and the marker is rendered at display time, never written into <w:t>. The resolver fixes that:
from engine.section_mapper.numbering import load_resolver_from_docx, extract_num_pr
resolver = load_resolver_from_docx(Path("dados.docx"))
for p in doc.paragraphs:
np = extract_num_pr(p._p.xml)
if np:
marker = resolver.marker_for(*np) # "1.", "5.2.", "a.", "•", ...
State is per-numId; advancing one level resets every deeper level. Faithful to numFmt (decimal, lowerLetter, upperLetter, lowerRoman, upperRoman).
Bullet-as-letters heuristic (default on)¶
bullet_as_letters=True (default) renders bullets at ilvl=0 as Excel-style letters (a., b., ..., z., aa.). Industrial documents use Wingdings/Symbol bullets internally but expect lettered output. Set bullet_as_letters=False for strictly faithful rendering ("•" for every bullet level).
reset_bullet_counters() is called by the parser whenever a structural decimal heading advances, so each sub-section restarts its lettering at a. instead of continuing across boundaries.
Similarity matcher¶
Three tiers, ordered by cost:
| Tier | Deps | Speed | Use when |
|---|---|---|---|
| string | none | µs | Source and target use the same vocabulary; synonym table covers wording variants |
| embeddings | pip install "template-engine-ia[embeddings]" (sentence-transformers, ~80 MB) |
ms | Wording diverges across templates (cross-vendor docs) |
| llm | provider supplied | s + $ | Long-tail mappings the heuristics still miss |
Default mode is "auto": string first; falls back to embeddings (when installed) when target coverage is < 60%; the async path adds llm as final tier when a provider is supplied and embeddings still under-cover.
The synonym table covers the common Brazilian-Portuguese industrial taxonomy:
| Canonical | Variants |
|---|---|
OBJETIVO |
FINALIDADE, PROPOSITO, FINALIDADES |
APLICACAO |
ESCOPO, AMBITO, ABRANGENCIA, ALCANCE |
SISTEMATICA |
DESCRICAO, PROCEDIMENTO, METODOLOGIA, DETALHAMENTO, EXECUCAO, PROCESSO |
RESPONSABILIDADE |
RESPONSABILIDADES, ATRIBUICOES, REGISTROS, RESPONSABILIDADES E AUTORIDADES |
HISTORICO |
HISTORICO DE REVISOES, CONTROLE DE REVISOES, REVISOES, HISTORICO DE REVISAO |
DEFINICOES |
TERMOS E DEFINICOES, GLOSSARIO, DEFINICOES SIGLAS |
Renderer¶
Inserts source content under the matched template heading:
- Find the heading paragraph in the template.
- Locate the first empty body paragraph below it (the anchor).
- Drop
<w:jc>from the anchor'spPrso a multi-line block does not render as justified columns. - Set the anchor's text to line 1 of the content.
- For each remaining line, clone the anchor's
<w:p>, clear inner<w:t>, set line N, insert viaaddnextso paragraph order is preserved.
Line-kind decoration (Phase 2)¶
Each inserted line is classified by its prefix and decorated:
| Prefix | Kind | Decoration |
|---|---|---|
^\d+\.\d+\.?\s (e.g. 6.1. Foo) |
sub-heading | bold + black + before=240/after=120 twips |
^\d+\.\d+\.\d+\.?\s (e.g. 6.2.1.) |
sub-sub-heading | bold + black + before=180/after=80 |
^Nota\s*\d*[:.]\s |
nota | italic |
| anything else | body | unchanged |
Decoration is applied via direct formatting only — no <w:pStyle> reference — because Word's default Ttulo2/Ttulo3 styles render blue, which is wrong for industrial-procedure documents that expect black bold sub-headings.
Empty-paragraph cleanup¶
After insertion, two passes prevent the visual gaps the template's blank slots would otherwise leave:
- Prune unused body slots: walk siblings of every filled anchor; delete empty paragraphs up to the next heading.
- Collapse empty runs: walk the document body once; collapse any run of 2+ consecutive empty paragraphs to a single empty. Paragraphs inside table cells are left alone (cell layout depends on paragraph count).
Section-aware post-transforms (Phase 2)¶
After parsing the source, two section-name-driven content transforms run:
- Sections named
NORMAS/REGISTROS/ANEXOS/DOCUMENTOS DE REFERÊNCIA: every line without a marker gets a leading"• "(reference list auto-bullet). - Sections named
DEFINIÇÕES: leading"term: "(up to 3 short tokens) is converted to"term – "(en-dash).
Tables¶
fill_tables(template, output, specs) matches each TableSpec to a template table by header set (order-insensitive). Each spec's rows populate empty rows; extra rows are appended.
TableSpec extras:
subheaders: list[str] | None— when the template's primary header row repeats values (["Atividades", "Responsabilidade", "Responsabilidade"]), supplying["", "Gerente Setorial", "Supervisores"]writes those into row 1 and uses them for column mapping.
Auto-tables¶
detect_default_specs_with_source(template, source) synthesizes specs without manual configuration:
- Histórico de Revisões (
Rev. | Data | Alteração): extracts the source's revision-history table (matching any ofVERSÃO|DATA|AUTOR|ALTERAÇÕEScolumns), renumbers from00, appends a"Migração para o novo modelo padrão"row dated today. - Atribuições e Responsabilidades (
Atividades | Responsabilidade | Responsabilidade): extracts source paragraphs underCompete à gerência/Compete aos supervisores(or wording variants); each child paragraph becomes a row taggedXin the correct column. Bucket boundaries are detected via<w:numPr>ilvlso the extractor doesn't spill into the next top-level section.
When an auto-table fills the data for a target section (Responsabilidade / Histórico), the orchestrator drops the prose body for that section so the same info doesn't appear twice.
Header filler¶
extract_source_metadata(source_path) reads the source .docx and gathers:
| Field | Source |
|---|---|
document_code |
source word/header*.xml, dotted-decimal code reassembled across run fragmentation (IT.PRO. + U + RE + .387.0005) |
title |
source header, longest all-caps multi-word run that is not the company name or document code |
version |
source header, Ver.: NN / Rev. NN |
author |
source body's revision-history table, AUTOR / REVISOR column, first non-empty data row |
approver |
source header, Aprovador (es): <name> (cut at next page indicator / date) |
source_date |
source body's revision-history table, DATA column, first non-empty data row |
fill_template_header(output_path, metadata) walks every word/header*.xml inside the output docx zip and substitutes:
| Placeholder | Replacement |
|---|---|
XXXX |
metadata.document_code |
Rev. 00 |
Rev. <version> |
Elaborado: |
Elaborado: <author> |
Aprovado: |
Aprovado: <approver> |
Data: |
Data: <today_iso> |
TITULO |
metadata.title |
When source metadata for a placeholder is missing, the placeholder is left in place so a downstream reviewer can spot the gap.
Document-code reassembly¶
Source headers fragment a code across many <w:t> runs (IT.PRO. + U + RE + .387.0005) AND glue a company tag in without a word boundary (...TRABALHOIT.PRO.URE.387.0005...).
The extractor builds two flavors of the flat header text — glued (no spacing between runs, so dotted codes stay intact) and spaced (single space between runs, so titles like PARTIDA DA ÁREA DE SÍNTESE followed by Ver.: don't merge into SÍNTESEVer). The prefix [A-Z]{2,3}\.[A-Z]{2,5}\. is located in spaced flavor; a state-machine walk over glued flavor consumes the full code, stopping at the first invalid letter↔digit transition (...0005PARTIDA ends the code at 0005).
Quick example¶
from pathlib import Path
from engine.section_mapper import map_sections
report = map_sections(
template_path=Path("template.docx"),
source_path=Path("source.docx"),
output_path=Path("output.docx"),
# similarity_mode="auto" + auto_tables=True are the defaults
)
print(f"mapped {report.mapped_count} sections")
print(f"tables filled: {report.tables_filled}")
print(f"unmapped source: {report.unmapped_source_headings}")
print(f"unfilled target: {report.unfilled_target_headings}")
print(f"orphan placeholders: {report.orphan_paragraphs}")
SectionMappingReport.to_dict() returns a JSON-serializable summary suitable for audit logs.
Operating modes¶
| Mode | When | Cost (Gemini Flash 2.5) |
|---|---|---|
rules (default in map_sections) |
PT-BR / Engeman style; bit-for-bit reproducibility | $0.0000 |
llm (map_sections_async(mode="llm", llm=...)) |
any vendor / language; needs provider | ~$0.001 |
hybrid (mode="hybrid", llm=...) |
rules first, LLM tops up gaps | ~$0.001 when gaps |
LLM mode end-to-end¶
import asyncio
from pathlib import Path
from engine.llm.openai_provider import OpenAIProvider
from engine.section_mapper import map_sections_async
async def main() -> None:
provider = OpenAIProvider(api_key="sk-...", model="gpt-4o", timeout=300.0)
report = await map_sections_async(
template_path=Path("template.docx"),
source_path=Path("source.docx"),
output_path=Path("output.docx"),
mode="llm",
llm=provider,
)
print(f"sections in plan: {len(report.matches)}")
print(f"tables filled: {report.tables_filled}")
asyncio.run(main())
The LLM call returns a MappingPlan covering every detected
placeholder (header + body), every template heading, and every empty
table. Failure paths fall back to an empty plan so callers can chain a
rules-mode retry.
Cross-vendor validation¶
Five fixture pairs and two real-world public templates exercise the LLM mapper:
| Pair | Domain | Language | Notable shape |
|---|---|---|---|
A — Engeman (dados.docx) |
industrial procedure | PT-BR | XXXX / (TITULO) / Elaborado: / Atividades \| Responsabilidade \| Responsabilidade |
| B — English corporate | corporate procedure | EN | {{DOC_CODE}} / [Title] / Author: / Activity \| Owner |
| C — ABNT academic | thesis | PT-BR Title-case | <<TITULO_DO_TRABALHO>> / §§§§§ / nested 1.2.1 / __/__/____ |
| D — Bilingual gov form | government form | PT-BR / EN | [______] / < nome > / ___.___.___-__ masks |
| E — Legal contract | contract | PT-BR | parties block (multi-placeholder), numbered clauses 1-6 |
| UNIFAP POP (real-world) | university procedure | PT-BR Title-case | Descrição / Objetivos / XXXXXXXX / contact table |
| Corentocantins POP (real-world) | nursing-council POP | PT-BR | mega-table 20×8 with merged cells |
Result against gpt-4o (mode=llm):
| Pair | Sections | Header subs | Tables | Cell fills | Orphans |
|---|---|---|---|---|---|
| A | 7/7 | 6 | 2 | n/a | 0 |
| B | 7/7 | 5 | 2 | n/a | 0 |
| C | 6/9 | 4 | 2 | n/a | 0 |
| D | 5/5 | 8 | 1 | n/a | 1 |
| E | 7/7 | 7 | 1 | n/a | 0 |
| UNIFAP | 14 plan keys | 12 | 1 | covered | 0 |
| Corentocantins | 4 sections | 5 | 0 | partial (title + procedure rows) | 0 |
Regenerate via:
python scripts/build_vendor_b_fixtures.py
python scripts/build_adversarial_fixtures.py
python scripts/build_real_world_source.py
python scripts/run_adversarial_llm.py
python scripts/run_real_world_llm.py
Multimodal vision¶
The LLM call attaches PNG renders of the template (up to 3 pages) so the model can SEE merged cells, table geometry, embedded logos. Pipeline:
template.docx ──→ docx2pdf (Word COM / Pages) ──→ template.pdf
│
▼
PyMuPDF (fitz) per-page
│
▼
base64 PNG data URLs
│
▼
OpenAI vision (gpt-4o)
multipart user message
engine.section_mapper.template_renderer.render_pages(docx_path,
max_pages=3) returns list[PageImage]. Both docx2pdf and pymupdf
are optional — when missing, the orchestrator falls back to text-only
mode (logged at info level). Install via:
Cell-level fills¶
Mega-table layouts (Corentocantins-style POPs) carry the entire document as one big table. Heading cells and body slot cells live in the same table grid. The rules engine renderer doesn't see inside cells.
LLM-driven mapper adds:
TemplateCell(table_index, row, col, text, is_fillable)— every cell of every body table is profiled with a fillability heuristic (empty / imperative-instruction text /XXmask / parenthesised hint / label-no-value / known template defaults likeFulano de Tal).MappingPlan.cell_fills: list[CellFill]— LLM addresses each fillable cell by(table_index, row, col).auto_renderer._apply_cell_fills— writes viacell.textwhile preserving the first paragraph's run formatting. Mirrors the fill across every sibling cell in the same row that shared the original text (merged-column groups).- A deduplicated
FILLABLE CELLS YOU MUST ADDRESSchecklist is appended to the prompt, grouping merged columns into one logical entry per row, so the LLM no longer thinks it filled them.
Plan validation + retry¶
After the initial LLM call, _detect_plan_gaps reports:
- placeholders the LLM left empty
- template headings empty in the plan when the source mentions a matching keyword
- empty template tables not addressed in
table_data
When gaps exist, a focused retry prompt lists exactly what's missing
and asks the LLM to fill ONLY those slots. _merge_plans overlays the
retry without erasing existing values. max_retries=1 by default.
Plan cache¶
engine.section_mapper.plan_cache persists every successful
MappingPlan to ${XDG_CACHE_HOME:-~/.cache}/template-engine/plans/
keyed by sha256(template_bytes) + sha256(source_bytes) +
PROMPT_VERSION. Same template + source pair → 0 LLM calls. Override
the cache directory via TEMPLATE_ENGINE_CACHE_DIR=/path.
Real-run benchmark (Vendor E, gpt-4o):
| Run | Wall time | LLM calls |
|---|---|---|
| First | ~20 s | 1 (call) + 0-1 (retry) |
| Second (cache hit) | 4.6 s | 0 |
CLI --no-cache skips the cache for one-off runs.
Polymorphic source input¶
profile_source accepts:
Path/str— file path on diskbytes/bytearray— raw docx bytes (written to a temp file)BytesIO/ anyio.IOBase— read & buffered- URL strings (
http:///https://) — downloaded viaurllib - existing
SourceStructure— passed through (idempotent)
The same applies to source paths inside map_sections_async.
CLI command¶
template-engine map-sections \
--template ./template.docx \
--source ./source.docx \
--output ./output.docx \
--provider openai --api-key "$OPENAI_API_KEY" --model gpt-4o
Auto-picks mode="llm" when a provider is supplied, "rules"
otherwise. --no-cache disables the plan cache. --json <path> emits
the SectionMappingReport.
Smart-default mode¶
map_sections_async(... mode=None) (default) auto-picks:
llmprovider supplied →mode="llm"- no provider →
mode="rules"
Callers don't need to remember mode flags.
Limits¶
See REAL-WORLD-LIMITS.md for the full list. Honest call-outs:
rules engine (rules mode)¶
- Scanned PDFs are not OCR'd. Use
.docxsource whenever possible. - Multi-column PDFs interleave columns at extraction time; convert to single-column first.
- Source tables (other than the canonical Histórico / Responsabilidade) come through as flattened text.
- Synonym table is Brazilian-Portuguese specific. Install the
[embeddings]extra for cross-language matching, or supply an LLM provider for the long tail. - Sub-section hierarchy (
3.2.1.) is preserved as text prefix, not as nested heading anchors.
LLM-driven mapper (LLM mode)¶
- Determinism lost — gpt-4o varies slightly across runs. The plan cache mitigates this for repeated pairs but not for first runs.
- Cost — ~\(0.05/doc with gpt-4o, ~\)0.001/doc with Gemini Flash 2.5. Cache makes follow-up runs free.
- Multimodal optional — when
docx2pdf(Word COM) orpymupdfis missing, the orchestrator silently falls back to text-only mode. Install both for best mega-table coverage. - Token-window cap — template JSON capped at 30 000 chars, source JSON at 60 000 chars. Very large templates may be truncated.
- Mega-table body slots with imperative hints still partially resist replacement (Corentocantins rows 2-7). The model preserves cells whose current text combines a numbered heading with a parenthesised hint. Closing this is a prompt-tightening target.
Universal¶
- Real-world template variance is endless. Five vendors covered now (synthetic + real). Every new vendor is a new failure-mode discovery exercise; bugs reproduce per-template, not in the abstract.
- No CI integration test for
mode="llm"— the pipeline calls a paid API. Production use requires a smoke run against your own corpus.