Skip to content

Database schema Translation lifecycle

domenico edited this page May 21, 2026 · 1 revision

Matecat Database Schema — Translation Lifecycle & Field Relations

Audience: Data analysts / data science engineers working with Matecat translation data. Scope: projects, jobs, segments, segment_translations, segment_translation_versions, segment_translation_events, engines. Focus: match_type, suggestion, suggestions_array, suggestion_match, suggestion_position, suggestion_source — lifecycle analysis, semantic meaning, and inter-field relationships.


1. Entity-Relationship Overview

projects (1) ──────< jobs (N)                          via jobs.id_project
projects ─────────── engines (TM)                       via projects.id_engine_tm
projects ─────────── engines (MT)                       via projects.id_engine_mt
jobs ─────────────── engines (TMS)                      via jobs.id_tms
jobs ─────────────── engines (MT)                       via jobs.id_mt_engine
jobs (1) ──────────< segment_translations (N)           via segment_translations.id_job
segments (1) ──────< segment_translations (N)           via segment_translations.id_segment
segment_translations ─< segment_translation_versions    via (id_segment, id_job)
segment_translations ─< segment_translation_events      via (id_segment, id_job)

Cardinality Notes

Relationship Explanation
projects → jobs One project has N jobs. A project can have one job per language pair.
jobs → segment_translations Each job owns a contiguous range of segments (job_first_segment..job_last_segment). One segment_translations row per (segment, job) pair.
jobs composite PK (id, password) — the same id can appear with multiple password values. This is the chunk model: a job split into parallel chunks for distributed translation, each chunk with its own password and segment sub-range.
segment_translations composite PK (id_segment, id_job) — one row per segment per job.
engines → 4 FKs The engines table is referenced by four foreign keys: projects.id_engine_tm, projects.id_engine_mt, jobs.id_tms, jobs.id_mt_engine. At the project level, TM and MT engines are selected; at the job level, they can be overridden per language pair.
versions & events Append-only audit trails. segment_translation_versions stores a snapshot of the translation BEFORE each edit. segment_translation_events logs who made the change and from which UI source page.

Uml diagram


2. Segment Translation Lifecycle — State Machine

A segment_translations row goes through these phases from creation to final state:

NEW ──(TM analysis)──> [match_type assigned, suggestions populated]
  │
  ├──(translator opens segment)──> suggestions_array written, first suggestion selected
  │
  ├──(translator edits & saves)──> DRAFT or TRANSLATED, version_number++,
  │                                 version snapshot saved, event logged
  │
  ├──(reviewer approves)──> APPROVED
  │
  └──(reviewer rejects)──> REJECTED

Final states: APPROVED, APPROVED2 (second-pass approval)

Status Values (segment_translations.status)

Status Meaning
NEW Segment has never been opened by a translator. Initial placeholder.
DRAFT Translator has started editing but hasn't marked complete.
TRANSLATED Translator has marked segment as complete.
APPROVED Reviewer has approved the translation (first pass).
APPROVED2 Reviewer has approved the translation (second pass — optional QA workflow).
REJECTED Reviewer has rejected the translation; it returns to translator.
FIXED Translator has addressed a rejection.

tm_analysis_status — Analysis Progress

Independent of the translation status, this tracks whether TM analysis (the automatic matching process) has run:

Value Meaning
UNDONE TM analysis has not yet run for this segment.
DONE TM analysis completed; match_type and suggestions populated.
SKIPPED TM analysis skipped (e.g., segment already translated, or project-level skip).

3. The match_type Field — Lifecycle & Semantics

match_type answers: "What kind of match was the best available TM/MT result for this segment?"

It is set in two phases. The final value always comes from Phase 2 (TMAnalysisWorker), which overwrites the Phase 1 (FastAnalysis) value — unless tm_analysis_status = 'SKIPPED', which protects the FastAnalysis value.

3.1 Phase 1 — FastAnalysis (Batch)

When: Immediately after project creation, the FastAnalysis daemon sends all segments in bulk to MyMemory's analyze endpoint.

Source: MyMemory::fastAnalysis()analyze.mymemory.translated.net/api/v1

What happens:

  1. MyMemory returns a type for each segment: 100%, 95%-99%, 85%-94%, 75%-84%, 50%-74%, NO_MATCH
  2. FastAnalysis translates these to internal match_type values:
MyMemory API type Internal match_type (DB) eq_word_count discount
100% 100% or 100%_PUBLIC equivalentWordMapping[100%]
95%-99% INTERNAL equivalentWordMapping[INTERNAL]
85%-94% INTERNAL equivalentWordMapping[INTERNAL]
75%-84% INTERNAL equivalentWordMapping[INTERNAL]
50%-74% NO_MATCH (demoted — not payable) 100% (full word count)
NO_MATCH NO_MATCH 100% (full word count)
  1. The results are INSERTed into segment_translations with ON DUPLICATE KEY UPDATE, but the UPDATE is guarded:
match_type = IF(tm_analysis_status = 'SKIPPED', match_type, VALUES(match_type))

So if TM analysis already ran (DONE), FastAnalysis does NOT overwrite.

Key insight: The range bands (75%-84%, 85%-94%, 95%-99%) are all mapped to INTERNAL during FastAnalysis. This is because FastAnalysis does a batch lookup and cannot distinguish exact match details. The fine-grained match type is resolved in Phase 2.

3.2 Phase 2 — TMAnalysisWorker (Per-Segment)

When: After FastAnalysis, each segment is enqueued into ActiveMQ and processed individually.

Source: TMAnalysisWorkerEngineService::getTMMatches()MyMemory::get() (per-segment TM lookup).

What happens:

  1. The worker queries the TM engine (MyMemory) for matches, and optionally the MT engine for machine translation.
  2. MatchSorter::sortMatches() sorts all results by match percentage descending (ICE beats non-ICE, MT beats TM at same score).
  3. The best match is selected.
  4. getNewMatchTypeAndEquivalentWordDiscount() resolves the final match_type by comparing the FastAnalysis result with the TM analysis result:

TM Match Type Resolution (Non-MT Matches)

TM Match % Condition Final match_type
100% ICE flag set (In-Context Exact match) ICE
100% Public TM (memory_key empty) 100%_PUBLIC (or TM_100_PUBLIC_MT_QE if MTQE workflow)
100% Private TM (memory_key present) 100% (or TM_100_MT_QE if MTQE workflow)
95-99% 95%-99%
85-94% 85%-94%
75-84% 75%-84%
50-74% 50%-74%
<50% NO_MATCH
Repetition Same segment appears elsewhere in project REPETITIONS

RATIO rule: The new match type is adopted ONLY if it's strictly better (lower payable rate per word) than the FastAnalysis result. The worker compares the equivalentWordMapping discount values.

MT Match Type Resolution

When the best match comes from machine translation (no TM match or MT is better):

MT Quality Score MTQE Disabled MTQE Enabled
≥ 0.9 ICE_MT ICE_MT
≥ 0.8 MT TOP_QUALITY_MT
≥ 0.5 MT HIGHER_QUALITY_MT
< 0.5 MT STANDARD_QUALITY_MT

3.3 Complete match_type Value Catalog

Values actually written to the database:

DB Value Meaning Payable Rate Key
NEW Never analyzed, never seen by translator N/A
NO_MATCH No TM match found (or match <50%) NO_MATCH
MT Machine translation (generic, no quality tiers) MT
TOP_QUALITY_MT MT with quality score ≥ 0.8 TOP_QUALITY_MT
HIGHER_QUALITY_MT MT with quality score ≥ 0.5 HIGHER_QUALITY_MT
STANDARD_QUALITY_MT MT with quality score < 0.5 STANDARD_QUALITY_MT
ICE_MT MT match with ICE-like quality (score ≥ 0.9) ICE_MT
50%-74% TM fuzzy match 50-74% 50%-74%
75%-84% TM fuzzy match 75-84% 75%-84%
85%-94% TM fuzzy match 85-94% 85%-94%
95%-99% TM fuzzy match 95-99% 95%-99%
100% TM exact match from private TM 100%
100%_PUBLIC TM exact match from public MyMemory 100%
ICE In-Context Exact match (100% + identical surrounding context) ICE
INTERNAL Internal matches from FastAnalysis (75-99% bands, not yet resolved by TMAnalysisWorker). Transient — replaced by TMAnalysisWorker with precise band. INTERNAL
REPETITIONS Segment is a repetition of another segment in the same project REPETITIONS
NUMBERS_ONLY Segment contains only numbers (no translatable text) N/A

3.4 match_type and eq_word_count Relationship

eq_word_count (equivalent word count) is the billable word count after applying the payable rate discount. The discount is derived from match_type:

eq_word_count = raw_word_count × equivalentWordMapping[match_type] / 100

Where equivalentWordMapping is the job's payable rates (default from PayableRates::$DEFAULT_PAYABLE_RATES):

{
  "NO_MATCH": 100,
  "50%-74%": 100,
  "75%-99%": 60,
  "100%": 30,
  "REPETITIONS": 30,
  "INTERNAL": 60,
  "MT": 85
}

Capped at raw_word_count (never exceeds it).

standard_word_count = eq_word_count in Phase 1. In Phase 2, if the match is MT, standard_word_count may differ because MT word count uses different industry-standard rules.


4. The Suggestion Fields — Inter-Field Relationships

These fields represent the best available automatic translation for the segment, determined during TM analysis and optionally updated when the translator opens the segment.

4.1 Field Definitions

Column Type Meaning
suggestions_array TEXT (JSON) Complete list of all TM/MT match candidates returned by the engines. Sorted by quality (best first). Stored as json_encode() of the matches array.
suggestion TEXT Translation text of the currently selected suggestion (the one shown to the translator). Typically the first/best match.
suggestion_match INT Match score (percentage, e.g., 85 for 85%) of the currently selected suggestion.
suggestion_source VARCHAR Origin of the selected suggestion: "TM" or "MT".
suggestion_position INT Zero-based index into suggestions_array identifying which match was selected. 0 = first/best. Can change if the translator cycles through suggestions.
match_type VARCHAR The fuzzy band of the best match (or NEW/NO_MATCH/MT). Independent of which suggestion was selected.

4.2 Invariants and Relationships

IF suggestion IS NOT NULL AND suggestion_position IS NOT NULL:
    suggestions_array[suggestion_position]['translation'] == suggestion
    suggestions_array[suggestion_position]['match'] == suggestion_match
    suggestion_source ∈ {"TM", "MT"}

4.3 Lifecycle — When Each Field Is Set

Phase 1: FastAnalysis (INSERT)

suggestions_array = NULL           ← not populated
suggestion         = NULL          ← not populated
suggestion_match   = NULL          ← not populated
suggestion_source  = NULL          ← not populated
suggestion_position = NULL         ← not populated
match_type         = <from FastAnalysis mapping, see §3.1>

FastAnalysis only sets match_type, eq_word_count, and standard_word_count. It does NOT populate suggestions — the full suggestion list comes from per-segment TM lookup.

Phase 2: TMAnalysisWorker (UPDATE)

Line references are from TMAnalysisWorker.php:

// Line 186: All TM+MT matches are JSON-encoded
$suggestionJson = json_encode($matches);

// Line 191-193: The post-processed best match becomes the suggestion
$tmData['translation']  = $postProcessed['suggestion'];  // raw translation
$tmData['suggestion']   = $postProcessed['suggestion'];  // same value
$tmData['suggestions_array'] = $suggestionJson;           // ALL matches

// Line 194: Match type resolved (see §3.2)
$tmData['match_type'] = strtoupper($matchType);

// Lines 202-208: Source classification
$tmData['suggestion_source'] = $bestMatch['created_by'] ?? null;
if (str_contains($suggestion_source, 'MT')) → 'MT'
else                                     → 'TM'

// Line 211: Match score
$tmData['suggestion_match'] = $bestMatch['match'] ?? 0;

Note: suggestion_position is NOT set during TM analysis. It is set later when the translator first opens the segment (via GetContributionWorker) or when they interact with the suggestions dropdown.

Phase 3: Translator Opens Segment (GET Contribution)

When a translator opens a segment for the first time, the frontend calls POST /api/app/get-contribution, which triggers GetContributionWorker. This worker:

  1. Queries TM engine (MyMemory::get()) for matches with get_mt=1
  2. Queries optional adaptive MT engine
  3. Sorts results via MatchSorter
  4. Writes the full match list to suggestions_array
  5. Sets suggestion, suggestion_match, suggestion_source, suggestion_position to the first (best) match
  6. Sends results to the browser via WebSocket

Key detail: If the segment is NEW (never opened), this is the first time suggestions_array is populated. For non-NEW segments, this refreshes the suggestions (MT results may have improved since analysis).

Phase 4: Translator Saves Translation (SET Translation)

When the translator saves, SetTranslationController writes the new translation to segment_translations.translation. The suggestion, suggestion_match, suggestion_source, and suggestion_position fields are preserved — they still reflect the original automatic suggestion that the translation was based on.

Phase 5: Translator Changes Suggested Match

The translator can cycle through suggestions. When they select a different suggestion:

  • suggestion_position changes to the index of the newly selected match
  • suggestion updates to that match's translation
  • suggestion_match updates to that match's score
  • suggestion_source updates to "TM" or "MT"

4.4 suggestions_array JSON Structure

Each element in the array is a match candidate from the TM/MT engine. A typical element looks like:

{
  "id": "0",
  "segment": "Source segment text",
  "translation": "La migliore traduzione disponibile",
  "raw_translation": "La migliore traduzione disponibile",
  "target_note": "",
  "raw_segment": "The best available translation",
  "match": 85,
  "created_by": "MT!",
  "last_update_date": "2025-01-15 10:30:00",
  "last_updated_by": "matecat",
  "ICE": false,
  "tm_properties": [],
  "memory_key": "",
  "quality": "85",
  "score": 0.85
}

Key fields in each match:

Field Type Meaning
segment string Source text
translation string Target translation (may include formatting tags)
raw_translation string Translation without post-processing
match int/string Match percentage (e.g., "85" or 85)
created_by string Origin identifier — "MT!" for machine translation, or a user/engine name for TM
ICE bool In-Context Exact match flag
memory_key string TM key name (empty for public MyMemory)
score float MT quality estimation score (0-1 scale)
tm_properties array Additional TM metadata

4.5 suggestion_position vs match_type — Critical Distinction

match_type suggestion_position
What it represents The fuzzy band of the overall best match (for billing) Which specific item in suggestions_array the translator chose
Set by TMAnalysisWorker GetContributionWorker / frontend
Semantics "This segment was classified as a 85%-94% TM match" "The translator chose suggestion #2 (a MT translation), overriding the default suggestion #0"
Example divergence match_type = "85%-94%" but suggestion_position = 1 and suggestion_source = "MT" — the TM analysis found a good TM match, but the translator preferred the MT suggestion stored at index 1

Analytical implication: To understand what the translator actually used as their starting point, look at suggestion_position + suggestion_source + suggestion_match, NOT match_type. The match_type reflects the system's best TM match classification; the suggestion fields reflect what the translator saw and chose.


5. Audit Trail: Versions & Events

5.1 segment_translation_versions

Purpose: Snapshot of the translation BEFORE each edit.

Column Meaning
id_segment, id_job FK to segment_translations
translation The translation text BEFORE the edit
version_number Monotonic counter — segment_translations.version_number BEFORE increment
creation_date When the snapshot was taken
propagated_from If this edit was auto-propagated, the segment it was copied from
time_to_edit Milliseconds the translator spent since the last save
raw_diff Text diff between old and new translation
old_status Status before the edit
new_status Status after the edit

Lifecycle: Created by TranslationVersionsHandler::saveVersionAndIncrement() BEFORE each write to segment_translations, then version_number is incremented.

Query pattern: To get the full edit history of a segment:

SELECT * FROM segment_translation_versions
WHERE id_segment = ? AND id_job = ?
ORDER BY version_number ASC;

5.2 segment_translation_events

Purpose: Event log — who changed what, when, and from which UI.

Column Meaning
id_job, id_segment FK to segment_translations
uid User ID who made the change
version_number Version number AFTER the change
source_page Which UI page: 1 = translate, 2 = revise, 3 = translate (RTL), 4 = revise (RTL)
status Status applied in this event
create_date When the event occurred
final_revision 1 if this was a final review event
time_to_edit Milliseconds spent editing

Lifecycle: Created by TranslationEventsHandler::storeTranslationEvent() AFTER each write to segment_translations.

Complication: segment_translation_events has a composite partition key of (id, id_job)id_job is part of the PK.

Query pattern: To get who approved a segment:

SELECT * FROM segment_translation_events
WHERE id_segment = ? AND id_job = ?
  AND status = 'APPROVED'
ORDER BY create_date DESC
LIMIT 1;

6. Engines — The Matching Backend

The engines table is the registry of all translation engines (TM + MT):

Column Example (MyMemory TM) Example (Lara MT)
id 1 5
type TM MT
base_url https://api.mymemory.translated.net https://api.laratranslate.com
class_load MyMemory Lara
extra_parameters {} {"Lara-AccessKeyId":"...","MMT-License":"..."}

Four references in the schema:

  • projects.id_engine_tm → TM engine for the project
  • projects.id_engine_mt → MT engine for the project
  • jobs.id_tms → TMS engine for the job (can override project-level)
  • jobs.id_mt_engine → MT engine for the job (can override project-level)

A job with only_private_tm = 1 disables public MyMemory lookups. A job with id_tms = 0 or id_mt_engine = 0 disables TM or MT respectively.


7. Common Analytical Queries

7.1 Match Type Distribution Across a Project

SELECT
    j.id AS job_id,
    j.source,
    j.target,
    st.match_type,
    COUNT(*) AS segment_count,
    SUM(st.eq_word_count) AS total_eq_words,
    SUM(st.standard_word_count) AS total_standard_words
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id AND j.password = (SELECT password FROM jobs WHERE id_project = ? AND id = j.id LIMIT 1)
WHERE j.id_project = ? AND j.completed = 0
GROUP BY j.id, st.match_type
ORDER BY j.id, st.match_type;

7.2 Translator Suggestion Choice Analysis

-- What percentage of segments did translators use a different suggestion than the default?
SELECT
    j.source,
    j.target,
    COUNT(*) AS total_segments,
    SUM(CASE WHEN st.suggestion_position > 0 THEN 1 ELSE 0 END) AS changed_suggestion,
    SUM(CASE WHEN st.suggestion_position > 0 THEN 1 ELSE 0 END) / COUNT(*) * 100 AS pct_changed
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id
WHERE j.id_project = ?
  AND st.status IN ('TRANSLATED', 'APPROVED', 'APPROVED2')
  AND st.suggestion_position IS NOT NULL
GROUP BY j.source, j.target;

7.3 Translation Edit Distance vs Match Type

-- How much do translators edit TM suggestions by match type?
SELECT
    st.match_type,
    COUNT(*) AS segments,
    AVG(st.edit_distance) AS avg_edit_distance,
    AVG(st.time_to_edit) AS avg_time_ms
FROM segment_translations st
WHERE st.status IN ('TRANSLATED', 'APPROVED')
  AND st.match_type IS NOT NULL
  AND st.match_type != 'NEW'
GROUP BY st.match_type
ORDER BY st.match_type;

7.4 Full Edit History for a Segment

-- Reconstruct the edit timeline of a specific segment
SELECT
    v.version_number,
    v.translation AS before_translation,
    ste.status AS event_status,
    ste.uid AS changed_by,
    ste.source_page,
    ste.create_date,
    ste.time_to_edit
FROM segment_translation_versions v
LEFT JOIN segment_translation_events ste
    ON v.id_segment = ste.id_segment
    AND v.id_job = ste.id_job
    AND v.version_number = ste.version_number - 1
WHERE v.id_segment = ? AND v.id_job = ?
ORDER BY v.version_number ASC;

7.5 MT Usage Analysis by Language Pair

-- How often is MT used as the starting point vs TM?
SELECT
    j.source,
    j.target,
    st.suggestion_source,
    COUNT(*) AS segments,
    SUM(st.eq_word_count) AS total_eq_words
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id
WHERE st.status IN ('TRANSLATED', 'APPROVED')
  AND st.suggestion_source IS NOT NULL
GROUP BY j.source, j.target, st.suggestion_source
ORDER BY j.source, j.target;

7.6 Project Incompleteness Check

-- Segments not yet translated in a project
SELECT
    st.id_job,
    st.id_segment,
    st.status,
    st.match_type,
    s.segment AS source_text,
    s.raw_word_count
FROM segment_translations st
JOIN segments s ON st.id_segment = s.id
WHERE st.id_job IN (SELECT id FROM jobs WHERE id_project = ?)
  AND st.status NOT IN ('TRANSLATED', 'APPROVED', 'APPROVED2')
ORDER BY st.id_job, st.id_segment;

8. Key Design Decisions for Data Analysis

  1. match_type is for billing, not for quality measurement. It reflects the best automatic match classification. The actual translation quality depends on what the translator did afterward — measured by edit_distance and time_to_edit.

  2. suggestion_position can diverge from match_type. A segment classified as 85%-94% TM match may have been translated starting from an MT suggestion at suggestion_position = 1. Always check suggestion_source when analyzing translator behavior.

  3. segment_translations.translation may be empty for NEW segments with TM matches. The translation field holds the translator's saved text. Before the translator saves, the automatic suggestion is in suggestion and suggestions_array, NOT in translation.

  4. FastAnalysis results are intermediate and may be overwritten. The INTERNAL match type is transient — it means TMAnalysisWorker hasn't processed this segment yet. For completed projects, INTERNAL should not appear.

  5. eq_word_count vs standard_word_count. eq_word_count is the equivalent (billable) word count after applying payable rate discounts. standard_word_count accounts for industry-standard word counting differences (primarily for MT). For non-MT matches, they are identical.

  6. Job chunks mean the same id can appear multiple times. Always include password in queries against jobs. Use the composite key (id_job, id_segment) to uniquely identify a segment translation. Use (id, password) to uniquely identify a job.

  7. The engines table has no direct FK constraints. All four engine references (id_engine_tm, id_engine_mt, id_tms, id_mt_engine) point to engines.id but are not enforced as foreign keys in the schema. Join carefully.

  8. Segment range scoping. Jobs own segments via job_first_segment..job_last_segment, not via foreign keys. When joining, use range conditions or rely on the segment_translations.id_job relationship which already enforces the correct scope.


Appendix A: SQL Column Reference

segment_translations

Column Type Null Default Description
id_segment bigint(20) NO PK (composite), FK to segments
id_job bigint(20) NO PK (composite), FK to jobs
segment_hash varchar(45) NO MD5 hash of segment text
autopropagated_from bigint(20) YES NULL Source segment if translation was propagated
status varchar(45) YES 'NEW' Translation status
translation text YES NULL Translator's saved translation text
translation_date datetime YES NULL Last save timestamp
time_to_edit int(11) NO 0 Milliseconds spent editing
match_type varchar(45) YES 'NEW' Fuzzy band classification
context_hash blob YES NULL Hash of surrounding segment context
eq_word_count double(20,2) YES NULL Billable word count (after discount)
standard_word_count double(20,2) YES NULL Industry standard word count
suggestions_array text YES NULL JSON array of all match candidates
suggestion text YES NULL Translation of selected suggestion
suggestion_match int(11) YES NULL Match % of selected suggestion
suggestion_source varchar(45) YES NULL "TM" or "MT"
suggestion_position int(11) YES NULL Index in suggestions_array
mt_qe float(19,14) NO 0.0 MT quality estimation score
tm_analysis_status varchar(50) YES 'UNDONE' Analysis progress
locked tinyint(4) YES 0 Locked for editing
warning tinyint(4) NO 0 Translation warning flag
serialized_errors_list varchar(512) YES NULL Serialized error details
version_number int(11) YES 0 Edit version counter
edit_distance int(11) YES NULL Edit distance from suggestion

Clone this wiki locally