From corpus to cognition, reproducibly
A comparable multilingual corpus, a frozen annotation grid, a validated AI pipeline, and an optional reader study — every choice justified in writing, every automatic measure validated against human judgement.
Corpus design
| Component | Specification |
|---|---|
| Comparable corpus | 100–300 articles per language, same events, publication within a defined window (e.g. ±72 h), full texts accessible, balanced genres (hard news, analysis, wire copy). |
| Parallel sub-corpus | Source/translation or source/rewrite pairs (e.g. wire dispatch → localised article), explicitly flagged; target ≥ 30 pairs per direction. |
| Gold annotation subset | 20–30 articles per language fully hand-annotated; the benchmark against which every AI output is validated (H5). |
Four code families
LIN-1 evaluative lexis · LIN-2 modal intensity · LIN-3 certainty/uncertainty markers · LIN-4 logical connectors · LIN-5 complex syntax · LIN-6 nominalisations · LIN-7 metaphor and imagery
DIS-1 event framing (closed label set) · DIS-2 information hierarchy · DIS-3 source selection · DIS-4 narrative angle · DIS-5 opposition/polarisation · DIS-6 agentivity
TRA-1 omission · TRA-2 addition · TRA-3 explicitation · TRA-4 attenuation · TRA-5 amplification · TRA-6 informational reorganisation · TRA-7 register shift — coded on aligned pairs only
PRD-1 lexical surprisal (LM-computed) · PRD-2 cloze probability (piloted) · PRD-3 prediction-support devices · PRD-4 anticipatory shifts that raise target-text predictability
Seven stages, human adjudication throughout
Document pairing
Cluster articles by event across languages (multilingual sentence embeddings + date filters); verify pairs manually.
Sentence alignment
LaBSE/LASER embeddings with alignment heuristics on the parallel sub-corpus; human verification of all retained pairs.
Divergence detection
Flag aligned pairs with low semantic similarity; classify candidate shifts for human adjudication.
Feature extraction
Automatic counts feeding the grid: connectors, passives, nominalisations, modality lexicons; readability and density indices.
Frame comparison
Classification of DIS-1 labels at scale, trained and validated on the hand-annotated gold subset.
Semi-automatic annotation
Model pre-annotates; the researcher corrects; corrections are logged to measure model reliability (H5).
Predictability profiling
Token/sentence surprisal per language with monolingual language models; aligned with shift annotations to test whether translation changes local predictability (H6).
Cognitive evaluation
A between-subjects (or mixed) design: each reader sees a text either in its original language or in a translated/adapted version of comparable length. Conditions cross language × status (original vs. translated), with counterbalanced topics and screened participants.
| Measure | Instrument |
|---|---|
| Reading time | Per text; screen-logged or timed. |
| Factual comprehension | Multiple-choice and short answers keyed to a fixed list of idea units per event. |
| Recall | Immediate free recall; optional delayed recall (48 h). |
| Prediction / cloze | Cloze task on key idea units; reading time, effort, and recall analysed as a function of the version's surprisal profile (H6). |
| Subjective cognitive load | Single-item Paas scale or NASA-TLX subset. |
| Perceived clarity & credibility | Likert ratings with brief justification. |