-
Notifications
You must be signed in to change notification settings - Fork 277
Database schema Translation lifecycle
Audience: Data analysts / data science engineers working with Matecat translation data. Scope:
projects,jobs,segments,segment_translations,segment_translation_versions,segment_translation_events,engines. Focus:match_type,suggestion,suggestions_array,suggestion_match,suggestion_position,suggestion_source— lifecycle analysis, semantic meaning, and inter-field relationships.
projects (1) ──────< jobs (N) via jobs.id_project
projects ─────────── engines (TM) via projects.id_engine_tm
projects ─────────── engines (MT) via projects.id_engine_mt
jobs ─────────────── engines (TMS) via jobs.id_tms
jobs ─────────────── engines (MT) via jobs.id_mt_engine
jobs (1) ──────────< segment_translations (N) via segment_translations.id_job
segments (1) ──────< segment_translations (N) via segment_translations.id_segment
segment_translations ─< segment_translation_versions via (id_segment, id_job)
segment_translations ─< segment_translation_events via (id_segment, id_job)
| Relationship | Explanation |
|---|---|
| projects → jobs | One project has N jobs. A project can have one job per language pair. |
| jobs → segment_translations | Each job owns a contiguous range of segments (job_first_segment..job_last_segment). One segment_translations row per (segment, job) pair. |
| jobs composite PK |
(id, password) — the same id can appear with multiple password values. This is the chunk model: a job split into parallel chunks for distributed translation, each chunk with its own password and segment sub-range. |
| segment_translations composite PK |
(id_segment, id_job) — one row per segment per job. |
| engines → 4 FKs | The engines table is referenced by four foreign keys: projects.id_engine_tm, projects.id_engine_mt, jobs.id_tms, jobs.id_mt_engine. At the project level, TM and MT engines are selected; at the job level, they can be overridden per language pair. |
| versions & events | Append-only audit trails. segment_translation_versions stores a snapshot of the translation BEFORE each edit. segment_translation_events logs who made the change and from which UI source page. |
A segment_translations row goes through these phases from creation to final state:
NEW ──(TM analysis)──> [match_type assigned, suggestions populated]
│
├──(translator opens segment)──> suggestions_array written, first suggestion selected
│
├──(translator edits & saves)──> DRAFT or TRANSLATED, version_number++,
│ version snapshot saved, event logged
│
├──(reviewer approves)──> APPROVED
│
└──(reviewer rejects)──> REJECTED
Final states: APPROVED, APPROVED2 (second-pass approval)
| Status | Meaning |
|---|---|
NEW |
Segment has never been opened by a translator. Initial placeholder. |
DRAFT |
Translator has started editing but hasn't marked complete. |
TRANSLATED |
Translator has marked segment as complete. |
APPROVED |
Reviewer has approved the translation (first pass). |
APPROVED2 |
Reviewer has approved the translation (second pass — optional QA workflow). |
REJECTED |
Reviewer has rejected the translation; it returns to translator. |
FIXED |
Translator has addressed a rejection. |
Independent of the translation status, this tracks whether TM analysis (the automatic matching process) has run:
| Value | Meaning |
|---|---|
UNDONE |
TM analysis has not yet run for this segment. |
DONE |
TM analysis completed; match_type and suggestions populated. |
SKIPPED |
TM analysis skipped (e.g., segment already translated, or project-level skip). |
match_type answers: "What kind of match was the best available TM/MT result for this segment?"
It is set in two phases. The final value always comes from Phase 2 (TMAnalysisWorker), which overwrites the Phase 1 (FastAnalysis) value — unless tm_analysis_status = 'SKIPPED', which protects the FastAnalysis value.
When: Immediately after project creation, the FastAnalysis daemon sends all segments in bulk to MyMemory's analyze endpoint.
Source: MyMemory::fastAnalysis() → analyze.mymemory.translated.net/api/v1
What happens:
- MyMemory returns a
typefor each segment:100%,95%-99%,85%-94%,75%-84%,50%-74%,NO_MATCH - FastAnalysis translates these to internal
match_typevalues:
MyMemory API type
|
Internal match_type (DB) |
eq_word_count discount |
|---|---|---|
100% |
100% or 100%_PUBLIC
|
equivalentWordMapping[100%] |
95%-99% |
INTERNAL |
equivalentWordMapping[INTERNAL] |
85%-94% |
INTERNAL |
equivalentWordMapping[INTERNAL] |
75%-84% |
INTERNAL |
equivalentWordMapping[INTERNAL] |
50%-74% |
NO_MATCH (demoted — not payable) |
100% (full word count) |
NO_MATCH |
NO_MATCH |
100% (full word count) |
- The results are INSERTed into
segment_translationswithON DUPLICATE KEY UPDATE, but the UPDATE is guarded:
match_type = IF(tm_analysis_status = 'SKIPPED', match_type, VALUES(match_type))So if TM analysis already ran (DONE), FastAnalysis does NOT overwrite.
Key insight: The range bands (75%-84%, 85%-94%, 95%-99%) are all mapped to INTERNAL during FastAnalysis. This is because FastAnalysis does a batch lookup and cannot distinguish exact match details. The fine-grained match type is resolved in Phase 2.
When: After FastAnalysis, each segment is enqueued into ActiveMQ and processed individually.
Source: TMAnalysisWorker → EngineService::getTMMatches() → MyMemory::get() (per-segment TM lookup).
What happens:
- The worker queries the TM engine (MyMemory) for matches, and optionally the MT engine for machine translation.
-
MatchSorter::sortMatches()sorts all results by match percentage descending (ICE beats non-ICE, MT beats TM at same score). - The best match is selected.
-
getNewMatchTypeAndEquivalentWordDiscount()resolves the finalmatch_typeby comparing the FastAnalysis result with the TM analysis result:
| TM Match % | Condition | Final match_type
|
|---|---|---|
| 100% |
ICE flag set (In-Context Exact match) |
ICE |
| 100% | Public TM (memory_key empty) |
100%_PUBLIC (or TM_100_PUBLIC_MT_QE if MTQE workflow) |
| 100% | Private TM (memory_key present) |
100% (or TM_100_MT_QE if MTQE workflow) |
| 95-99% | — | 95%-99% |
| 85-94% | — | 85%-94% |
| 75-84% | — | 75%-84% |
| 50-74% | — | 50%-74% |
| <50% | — | NO_MATCH |
| Repetition | Same segment appears elsewhere in project | REPETITIONS |
RATIO rule: The new match type is adopted ONLY if it's strictly better (lower payable rate per word) than the FastAnalysis result. The worker compares the equivalentWordMapping discount values.
When the best match comes from machine translation (no TM match or MT is better):
| MT Quality Score | MTQE Disabled | MTQE Enabled |
|---|---|---|
| ≥ 0.9 | ICE_MT |
ICE_MT |
| ≥ 0.8 | MT |
TOP_QUALITY_MT |
| ≥ 0.5 | MT |
HIGHER_QUALITY_MT |
| < 0.5 | MT |
STANDARD_QUALITY_MT |
Values actually written to the database:
| DB Value | Meaning | Payable Rate Key |
|---|---|---|
NEW |
Never analyzed, never seen by translator | N/A |
NO_MATCH |
No TM match found (or match <50%) | NO_MATCH |
MT |
Machine translation (generic, no quality tiers) | MT |
TOP_QUALITY_MT |
MT with quality score ≥ 0.8 | TOP_QUALITY_MT |
HIGHER_QUALITY_MT |
MT with quality score ≥ 0.5 | HIGHER_QUALITY_MT |
STANDARD_QUALITY_MT |
MT with quality score < 0.5 | STANDARD_QUALITY_MT |
ICE_MT |
MT match with ICE-like quality (score ≥ 0.9) | ICE_MT |
50%-74% |
TM fuzzy match 50-74% | 50%-74% |
75%-84% |
TM fuzzy match 75-84% | 75%-84% |
85%-94% |
TM fuzzy match 85-94% | 85%-94% |
95%-99% |
TM fuzzy match 95-99% | 95%-99% |
100% |
TM exact match from private TM | 100% |
100%_PUBLIC |
TM exact match from public MyMemory | 100% |
ICE |
In-Context Exact match (100% + identical surrounding context) | ICE |
INTERNAL |
Internal matches from FastAnalysis (75-99% bands, not yet resolved by TMAnalysisWorker). Transient — replaced by TMAnalysisWorker with precise band. | INTERNAL |
REPETITIONS |
Segment is a repetition of another segment in the same project | REPETITIONS |
NUMBERS_ONLY |
Segment contains only numbers (no translatable text) | N/A |
eq_word_count (equivalent word count) is the billable word count after applying the payable rate discount. The discount is derived from match_type:
eq_word_count = raw_word_count × equivalentWordMapping[match_type] / 100
Where equivalentWordMapping is the job's payable rates (default from PayableRates::$DEFAULT_PAYABLE_RATES):
{
"NO_MATCH": 100,
"50%-74%": 100,
"75%-99%": 60,
"100%": 30,
"REPETITIONS": 30,
"INTERNAL": 60,
"MT": 85
}Capped at raw_word_count (never exceeds it).
standard_word_count = eq_word_count in Phase 1. In Phase 2, if the match is MT, standard_word_count may differ because MT word count uses different industry-standard rules.
These fields represent the best available automatic translation for the segment, determined during TM analysis and optionally updated when the translator opens the segment.
| Column | Type | Meaning |
|---|---|---|
suggestions_array |
TEXT (JSON) |
Complete list of all TM/MT match candidates returned by the engines. Sorted by quality (best first). Stored as json_encode() of the matches array. |
suggestion |
TEXT | Translation text of the currently selected suggestion (the one shown to the translator). Typically the first/best match. |
suggestion_match |
INT |
Match score (percentage, e.g., 85 for 85%) of the currently selected suggestion. |
suggestion_source |
VARCHAR | Origin of the selected suggestion: "TM" or "MT". |
suggestion_position |
INT |
Zero-based index into suggestions_array identifying which match was selected. 0 = first/best. Can change if the translator cycles through suggestions. |
match_type |
VARCHAR | The fuzzy band of the best match (or NEW/NO_MATCH/MT). Independent of which suggestion was selected. |
IF suggestion IS NOT NULL AND suggestion_position IS NOT NULL:
suggestions_array[suggestion_position]['translation'] == suggestion
suggestions_array[suggestion_position]['match'] == suggestion_match
suggestion_source ∈ {"TM", "MT"}
suggestions_array = NULL ← not populated
suggestion = NULL ← not populated
suggestion_match = NULL ← not populated
suggestion_source = NULL ← not populated
suggestion_position = NULL ← not populated
match_type = <from FastAnalysis mapping, see §3.1>
FastAnalysis only sets match_type, eq_word_count, and standard_word_count. It does NOT populate suggestions — the full suggestion list comes from per-segment TM lookup.
Line references are from TMAnalysisWorker.php:
// Line 186: All TM+MT matches are JSON-encoded
$suggestionJson = json_encode($matches);
// Line 191-193: The post-processed best match becomes the suggestion
$tmData['translation'] = $postProcessed['suggestion']; // raw translation
$tmData['suggestion'] = $postProcessed['suggestion']; // same value
$tmData['suggestions_array'] = $suggestionJson; // ALL matches
// Line 194: Match type resolved (see §3.2)
$tmData['match_type'] = strtoupper($matchType);
// Lines 202-208: Source classification
$tmData['suggestion_source'] = $bestMatch['created_by'] ?? null;
if (str_contains($suggestion_source, 'MT')) → 'MT'
else → 'TM'
// Line 211: Match score
$tmData['suggestion_match'] = $bestMatch['match'] ?? 0;Note: suggestion_position is NOT set during TM analysis. It is set later when the translator first opens the segment (via GetContributionWorker) or when they interact with the suggestions dropdown.
When a translator opens a segment for the first time, the frontend calls POST /api/app/get-contribution, which triggers GetContributionWorker. This worker:
- Queries TM engine (
MyMemory::get()) for matches withget_mt=1 - Queries optional adaptive MT engine
- Sorts results via
MatchSorter - Writes the full match list to
suggestions_array - Sets
suggestion,suggestion_match,suggestion_source,suggestion_positionto the first (best) match - Sends results to the browser via WebSocket
Key detail: If the segment is NEW (never opened), this is the first time suggestions_array is populated. For non-NEW segments, this refreshes the suggestions (MT results may have improved since analysis).
When the translator saves, SetTranslationController writes the new translation to segment_translations.translation. The suggestion, suggestion_match, suggestion_source, and suggestion_position fields are preserved — they still reflect the original automatic suggestion that the translation was based on.
The translator can cycle through suggestions. When they select a different suggestion:
-
suggestion_positionchanges to the index of the newly selected match -
suggestionupdates to that match's translation -
suggestion_matchupdates to that match's score -
suggestion_sourceupdates to"TM"or"MT"
Each element in the array is a match candidate from the TM/MT engine. A typical element looks like:
{
"id": "0",
"segment": "Source segment text",
"translation": "La migliore traduzione disponibile",
"raw_translation": "La migliore traduzione disponibile",
"target_note": "",
"raw_segment": "The best available translation",
"match": 85,
"created_by": "MT!",
"last_update_date": "2025-01-15 10:30:00",
"last_updated_by": "matecat",
"ICE": false,
"tm_properties": [],
"memory_key": "",
"quality": "85",
"score": 0.85
}Key fields in each match:
| Field | Type | Meaning |
|---|---|---|
segment |
string | Source text |
translation |
string | Target translation (may include formatting tags) |
raw_translation |
string | Translation without post-processing |
match |
int/string | Match percentage (e.g., "85" or 85) |
created_by |
string | Origin identifier — "MT!" for machine translation, or a user/engine name for TM |
ICE |
bool | In-Context Exact match flag |
memory_key |
string | TM key name (empty for public MyMemory) |
score |
float | MT quality estimation score (0-1 scale) |
tm_properties |
array | Additional TM metadata |
match_type |
suggestion_position |
|
|---|---|---|
| What it represents | The fuzzy band of the overall best match (for billing) | Which specific item in suggestions_array the translator chose |
| Set by | TMAnalysisWorker | GetContributionWorker / frontend |
| Semantics | "This segment was classified as a 85%-94% TM match" | "The translator chose suggestion #2 (a MT translation), overriding the default suggestion #0" |
| Example divergence |
match_type = "85%-94%" but suggestion_position = 1 and suggestion_source = "MT" — the TM analysis found a good TM match, but the translator preferred the MT suggestion stored at index 1 |
Analytical implication: To understand what the translator actually used as their starting point, look at suggestion_position + suggestion_source + suggestion_match, NOT match_type. The match_type reflects the system's best TM match classification; the suggestion fields reflect what the translator saw and chose.
Purpose: Snapshot of the translation BEFORE each edit.
| Column | Meaning |
|---|---|
id_segment, id_job
|
FK to segment_translations
|
translation |
The translation text BEFORE the edit |
version_number |
Monotonic counter — segment_translations.version_number BEFORE increment |
creation_date |
When the snapshot was taken |
propagated_from |
If this edit was auto-propagated, the segment it was copied from |
time_to_edit |
Milliseconds the translator spent since the last save |
raw_diff |
Text diff between old and new translation |
old_status |
Status before the edit |
new_status |
Status after the edit |
Lifecycle: Created by TranslationVersionsHandler::saveVersionAndIncrement() BEFORE each write to segment_translations, then version_number is incremented.
Query pattern: To get the full edit history of a segment:
SELECT * FROM segment_translation_versions
WHERE id_segment = ? AND id_job = ?
ORDER BY version_number ASC;Purpose: Event log — who changed what, when, and from which UI.
| Column | Meaning |
|---|---|
id_job, id_segment
|
FK to segment_translations
|
uid |
User ID who made the change |
version_number |
Version number AFTER the change |
source_page |
Which UI page: 1 = translate, 2 = revise, 3 = translate (RTL), 4 = revise (RTL) |
status |
Status applied in this event |
create_date |
When the event occurred |
final_revision |
1 if this was a final review event |
time_to_edit |
Milliseconds spent editing |
Lifecycle: Created by TranslationEventsHandler::storeTranslationEvent() AFTER each write to segment_translations.
Complication: segment_translation_events has a composite partition key of (id, id_job) — id_job is part of the PK.
Query pattern: To get who approved a segment:
SELECT * FROM segment_translation_events
WHERE id_segment = ? AND id_job = ?
AND status = 'APPROVED'
ORDER BY create_date DESC
LIMIT 1;The engines table is the registry of all translation engines (TM + MT):
| Column | Example (MyMemory TM) | Example (Lara MT) |
|---|---|---|
id |
1 | 5 |
type |
TM |
MT |
base_url |
https://api.mymemory.translated.net |
https://api.laratranslate.com |
class_load |
MyMemory |
Lara |
extra_parameters |
{} |
{"Lara-AccessKeyId":"...","MMT-License":"..."} |
Four references in the schema:
-
projects.id_engine_tm→ TM engine for the project -
projects.id_engine_mt→ MT engine for the project -
jobs.id_tms→ TMS engine for the job (can override project-level) -
jobs.id_mt_engine→ MT engine for the job (can override project-level)
A job with only_private_tm = 1 disables public MyMemory lookups. A job with id_tms = 0 or id_mt_engine = 0 disables TM or MT respectively.
SELECT
j.id AS job_id,
j.source,
j.target,
st.match_type,
COUNT(*) AS segment_count,
SUM(st.eq_word_count) AS total_eq_words,
SUM(st.standard_word_count) AS total_standard_words
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id AND j.password = (SELECT password FROM jobs WHERE id_project = ? AND id = j.id LIMIT 1)
WHERE j.id_project = ? AND j.completed = 0
GROUP BY j.id, st.match_type
ORDER BY j.id, st.match_type;-- What percentage of segments did translators use a different suggestion than the default?
SELECT
j.source,
j.target,
COUNT(*) AS total_segments,
SUM(CASE WHEN st.suggestion_position > 0 THEN 1 ELSE 0 END) AS changed_suggestion,
SUM(CASE WHEN st.suggestion_position > 0 THEN 1 ELSE 0 END) / COUNT(*) * 100 AS pct_changed
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id
WHERE j.id_project = ?
AND st.status IN ('TRANSLATED', 'APPROVED', 'APPROVED2')
AND st.suggestion_position IS NOT NULL
GROUP BY j.source, j.target;-- How much do translators edit TM suggestions by match type?
SELECT
st.match_type,
COUNT(*) AS segments,
AVG(st.edit_distance) AS avg_edit_distance,
AVG(st.time_to_edit) AS avg_time_ms
FROM segment_translations st
WHERE st.status IN ('TRANSLATED', 'APPROVED')
AND st.match_type IS NOT NULL
AND st.match_type != 'NEW'
GROUP BY st.match_type
ORDER BY st.match_type;-- Reconstruct the edit timeline of a specific segment
SELECT
v.version_number,
v.translation AS before_translation,
ste.status AS event_status,
ste.uid AS changed_by,
ste.source_page,
ste.create_date,
ste.time_to_edit
FROM segment_translation_versions v
LEFT JOIN segment_translation_events ste
ON v.id_segment = ste.id_segment
AND v.id_job = ste.id_job
AND v.version_number = ste.version_number - 1
WHERE v.id_segment = ? AND v.id_job = ?
ORDER BY v.version_number ASC;-- How often is MT used as the starting point vs TM?
SELECT
j.source,
j.target,
st.suggestion_source,
COUNT(*) AS segments,
SUM(st.eq_word_count) AS total_eq_words
FROM segment_translations st
JOIN jobs j ON st.id_job = j.id
WHERE st.status IN ('TRANSLATED', 'APPROVED')
AND st.suggestion_source IS NOT NULL
GROUP BY j.source, j.target, st.suggestion_source
ORDER BY j.source, j.target;-- Segments not yet translated in a project
SELECT
st.id_job,
st.id_segment,
st.status,
st.match_type,
s.segment AS source_text,
s.raw_word_count
FROM segment_translations st
JOIN segments s ON st.id_segment = s.id
WHERE st.id_job IN (SELECT id FROM jobs WHERE id_project = ?)
AND st.status NOT IN ('TRANSLATED', 'APPROVED', 'APPROVED2')
ORDER BY st.id_job, st.id_segment;-
match_typeis for billing, not for quality measurement. It reflects the best automatic match classification. The actual translation quality depends on what the translator did afterward — measured byedit_distanceandtime_to_edit. -
suggestion_positioncan diverge frommatch_type. A segment classified as85%-94%TM match may have been translated starting from an MT suggestion atsuggestion_position = 1. Always checksuggestion_sourcewhen analyzing translator behavior. -
segment_translations.translationmay be empty forNEWsegments with TM matches. Thetranslationfield holds the translator's saved text. Before the translator saves, the automatic suggestion is insuggestionandsuggestions_array, NOT intranslation. -
FastAnalysis results are intermediate and may be overwritten. The
INTERNALmatch type is transient — it means TMAnalysisWorker hasn't processed this segment yet. For completed projects,INTERNALshould not appear. -
eq_word_countvsstandard_word_count.eq_word_countis the equivalent (billable) word count after applying payable rate discounts.standard_word_countaccounts for industry-standard word counting differences (primarily for MT). For non-MT matches, they are identical. -
Job chunks mean the same
idcan appear multiple times. Always includepasswordin queries againstjobs. Use the composite key(id_job, id_segment)to uniquely identify a segment translation. Use(id, password)to uniquely identify a job. -
The
enginestable has no direct FK constraints. All four engine references (id_engine_tm,id_engine_mt,id_tms,id_mt_engine) point toengines.idbut are not enforced as foreign keys in the schema. Join carefully. -
Segment range scoping. Jobs own segments via
job_first_segment..job_last_segment, not via foreign keys. When joining, use range conditions or rely on thesegment_translations.id_jobrelationship which already enforces the correct scope.
| Column | Type | Null | Default | Description |
|---|---|---|---|---|
id_segment |
bigint(20) | NO | — | PK (composite), FK to segments |
id_job |
bigint(20) | NO | — | PK (composite), FK to jobs |
segment_hash |
varchar(45) | NO | — | MD5 hash of segment text |
autopropagated_from |
bigint(20) | YES | NULL | Source segment if translation was propagated |
status |
varchar(45) | YES | 'NEW' | Translation status |
translation |
text | YES | NULL | Translator's saved translation text |
translation_date |
datetime | YES | NULL | Last save timestamp |
time_to_edit |
int(11) | NO | 0 | Milliseconds spent editing |
match_type |
varchar(45) | YES | 'NEW' | Fuzzy band classification |
context_hash |
blob | YES | NULL | Hash of surrounding segment context |
eq_word_count |
double(20,2) | YES | NULL | Billable word count (after discount) |
standard_word_count |
double(20,2) | YES | NULL | Industry standard word count |
suggestions_array |
text | YES | NULL | JSON array of all match candidates |
suggestion |
text | YES | NULL | Translation of selected suggestion |
suggestion_match |
int(11) | YES | NULL | Match % of selected suggestion |
suggestion_source |
varchar(45) | YES | NULL | "TM" or "MT" |
suggestion_position |
int(11) | YES | NULL | Index in suggestions_array |
mt_qe |
float(19,14) | NO | 0.0 | MT quality estimation score |
tm_analysis_status |
varchar(50) | YES | 'UNDONE' | Analysis progress |
locked |
tinyint(4) | YES | 0 | Locked for editing |
warning |
tinyint(4) | NO | 0 | Translation warning flag |
serialized_errors_list |
varchar(512) | YES | NULL | Serialized error details |
version_number |
int(11) | YES | 0 | Edit version counter |
edit_distance |
int(11) | YES | NULL | Edit distance from suggestion |