processors controller docs added to the repo. by predragmacura · Pull Request #209 · genestack/user-docs

predragmacura · 2026-06-25T21:46:20Z

Hey guys, the current user-docs structured is expended with the processor controller docs. You can find all new documents in the transformations section at the end of the advanced guide.

Please go through the documents carefully and verify that what's stated matches the actual codebase and current product behavior.

Please note that this is just a fraction of the documentation restructure project I'm currently working on. This means that each document represents one of fours possible Diataxis types (tutorial, how-to, reference, explanation).

MariaBorodaenko · 2026-06-26T12:33:23Z

@@ -0,0 +1,38 @@
+# About the Processors Controller
+
+A transformation takes an attached input file and turns its contents into ODM objects. Some transformations produce indexable metadata: for example, ODM cannot index a CSV file directly as a source of metadata, but the `metadata-basic` transformation converts it into TSV-based metadata objects that ODM can index. Others turn raw data into structured ODM objects: for example, the `hdf5-cells` transformation converts single-cell HDF5 (H5AD/H5) files into ODM Cell Groups and Expression Groups. Either way, a transformation bridges the gap between the file you have and the ODM objects you need.


Since the title is "About The Processors Controller", it makes sense to start with a sentence about it, not transformations, otherwise users can be confused - they don't know yet the relationship between them.

Also, I would suggest to highlight why it is needed and what problem it solves - no need to wrangle the files before ingesting to ODM.

The last thing: HDF5/H5 files are not necessarily contains raw data. H5 files are usually raw at first, but later processed and enriched with clusters, UMAP and/or PCA values, cell-types and other cell-metadata values. Also, the expression values can be normalised, features properly annotated. Sometimes an HDF5 file can contain both original raw data and annotated/processed files.

We recommend our users to transform and indexed only the processed single-cell data from the HDF5 file which is the most meaningful for further exploration and analysis.

Thus I would suggest to state here that we can transform both metadata (example with samples) and data (example with hdf5-cells).

MariaBorodaenko · 2026-06-26T13:00:37Z

@@ -0,0 +1,38 @@
+# About the Processors Controller
+
+A transformation takes an attached input file and turns its contents into ODM objects. Some transformations produce indexable metadata: for example, ODM cannot index a CSV file directly as a source of metadata, but the `metadata-basic` transformation converts it into TSV-based metadata objects that ODM can index. Others turn raw data into structured ODM objects: for example, the `hdf5-cells` transformation converts single-cell HDF5 (H5AD/H5) files into ODM Cell Groups and Expression Groups. Either way, a transformation bridges the gap between the file you have and the ODM objects you need.


Suggested change

A transformation takes an attached input file and turns its contents into ODM objects. Some transformations produce indexable metadata: for example, ODM cannot index a CSV file directly as a source of metadata, but the `metadata-basic` transformation converts it into TSV-based metadata objects that ODM can index. Others turn raw data into structured ODM objects: for example, the `hdf5-cells` transformation converts single-cell HDF5 (H5AD/H5) files into ODM Cell Groups and Expression Groups. Either way, a transformation bridges the gap between the file you have and the ODM objects you need.

The Processors Controller is the ODM API for running file transformations: it lets you discover what transformation images are available, configure how they process your data, and execute and monitor transformation jobs. Its purpose is to remove preprocessing as a barrier to ingestion - you attach the file you have and ODM transforms it into queryable objects. Transformations cover both metadata and data: `metadata-basic` converts a CSV into indexable sample objects, while `hdf5-cells` extracts cell-type annotations, dimensional-reduction results, and processed expression values from an H5AD or H5 file into ODM Cell Groups and Expression Groups.

MariaBorodaenko · 2026-06-26T13:08:26Z

+
+These three map onto a simple idea: an image is *what processing to do*, a configuration is *how to tune it*, and a job is *doing it once, against specific files*. The same image and configuration can drive many independent jobs.
+
+## Transformation configurations


I would suggest to swap configurations and images. Starting with the images makes more sense since the transformation itself is defined in the image. Configurations are an additional handle which can optionally be used to "tune" the transformation to your case (only in scope of what is already written within the code.) Configurations are optional.

We could also state clearly that users cannot change the transformation image itself, but can adjust it within the supported scope via configurations.

MariaBorodaenko · 2026-06-26T13:10:35Z

+
+## Transformation images
+
+Transformation images are versioned container images (self-contained, ready-to-run packages of the processing software) that run the processing logic. Available image versions can be queried through the API. Each image handles a specific input/output format pair: for example, `metadata-basic` converts CSV files to TSV-based metadata objects (samples, libraries, preparations, cell metadata, expression, or variants); `hdf5-cells` converts H5AD or H5 single-cell files into ODM Cell Groups, Expression Groups, and associated metadata. When starting a job you can specify either `latest` or a specific release tag.


As I know only transforming CSV files to samples is supported at the moment. Let's remove other options.

MariaBorodaenko · 2026-06-26T13:18:02Z

+
+A job is not where your results are stored. As it runs, the transformation writes its output into ODM as ordinary objects, so once the job finishes those results are part of your ODM data like anything else.
+
+When a job finishes it stops running, and the resources that were processing it are released: nothing keeps running in the background. What stays behind is the job's record: its final status and full logs. These are kept indefinitely, with no expiry, and are never deleted. You retrieve them through the same API endpoints whether the job is still running or finished long ago, so from your side nothing about fetching a job's status or logs changes once it is done.


This is the true for the transformation job itself, the container created for it. But it is not so for ODM itself - the job is considered done once the import data job is finished. But right after it the internal indexing in ODM is started. For some large single-cell data files it could take hours and significantly load ODM resources. I would suggest to rephrase this part, since it creates an impression that user can execute multiple jobs one by one, which can in reality be problematic.

MariaBorodaenko · 2026-06-26T13:27:50Z

+
+A job is not where your results are stored. As it runs, the transformation writes its output into ODM as ordinary objects, so once the job finishes those results are part of your ODM data like anything else.
+
+When a job finishes it stops running, and the resources that were processing it are released: nothing keeps running in the background. What stays behind is the job's record: its final status and full logs. These are kept indefinitely, with no expiry, and are never deleted. You retrieve them through the same API endpoints whether the job is still running or finished long ago, so from your side nothing about fetching a job's status or logs changes once it is done.


Suggested change

When a job finishes it stops running, and the resources that were processing it are released: nothing keeps running in the background. What stays behind is the job's record: its final status and full logs. These are kept indefinitely, with no expiry, and are never deleted. You retrieve them through the same API endpoints whether the job is still running or finished long ago, so from your side nothing about fetching a job's status or logs changes once it is done.

When a job finishes, the transformation is complete and its container is released. Be aware that ODM continues internal indexing after the job completes; the imported data may not be immediately available via the API until indexing is done. The job record with its final status and logs is retained in line with ODM's standard log retention policy and is retrievable through the same API endpoints, whether the job is still running or long finished.

MariaBorodaenko · 2026-06-26T13:29:21Z

+Transformations can be run in dry-run mode by setting `dry_run: true`. A dry run validates the configuration and input without writing any objects to ODM, which makes it the safest way to iterate on a configuration before committing results. For the steps to run a job in dry-run mode, see [How to run a transformation](how-to-run-a-transformation.md).
+
+!!! warning "Editorial TODO: resolve before publishing"
+    Confirm which transformation images actually implement dry-run (validate-without-write). The flag is accepted for all jobs, but honoring it is image-specific. If not universal, scope this wording (e.g. to single-cell).


Dry run is currently implemented for hdf5-cells only.

MariaBorodaenko · 2026-06-26T13:32:41Z

+
+## Transformation logs
+
+Each job produces a log recording processing steps, warnings, errors, the source file name and accession, and the accessions of any ODM objects it created. Logs are retained permanently and are always retrievable through the API: the logs endpoint returns the live log while the job runs and the archived log once it has finished, transparently. A finished job's log is never unavailable. Separately, the log is also uploaded into ODM as an attachment on the owning study, so it sits alongside the job's other generated files; this attachment is an additional copy and is not what keeps the log available.


I would suggest to avoid too strict wording like "permanently", "always", never unavailable".

Regarding the copy loaded as attachment, we hope to remove this in the next week under the ticket https://genestack.atlassian.net/browse/BIA-151, I will keep you updated.

MariaBorodaenko · 2026-06-26T13:58:33Z

+
+- An API token. See [Authentication and tokens](../getting-a-genestack-api-token.md).
+- Curator group membership.
+- Every study that contains an attachment listed in the job's `input_accessions` must be shared with you. Requests that reference an attachment you cannot access are rejected with a generic `Item not found or insufficient permission` message.


Please note that this is to be changed after implementing https://genestack.atlassian.net/browse/ODM-13244 (case 1.4)

MariaBorodaenko · 2026-06-26T14:07:24Z

+GET /api/v1/transformations/images
+```
+
+Note the `name` and `version` of the image you want to use. Use `"latest"` for the most recent version, or a specific release tag (for example, `"0.0.7"`) for reproducibility in production pipelines. See [Available images reference](available-images-reference.md) for the full catalogue and per-image guidance.


Suggested change

Note the `name` and `version` of the image you want to use. Use `"latest"` for the most recent version, or a specific release tag (for example, `"0.0.7"`) for reproducibility in production pipelines. See [Available images reference](available-images-reference.md) for the full catalogue and per-image guidance.

Note the `name` and `version` of the image you want to use. The `version` field is optional - if omitted or set to `"latest"`, the most recent version is used automatically. Specify an explicit version tag (for example, `"0.0.7"`) for reproducibility in production pipelines. See [Available images reference](available-images-reference.md) for the full catalogue and per-image guidance.

MariaBorodaenko · 2026-06-26T14:15:30Z

+- An API token. See [Authentication and tokens](../getting-a-genestack-api-token.md).
+- Curator group membership.
+- Every study that contains an attachment listed in the job's `input_accessions` must be shared with you. Requests that reference an attachment you cannot access are rejected with a generic `Item not found or insufficient permission` message.
+- The source attachment already uploaded to a study in ODM (you need its accession). See [Import attached files](../import-data-in-odm.md).


Suggested change

- The source attachment already uploaded to a study in ODM (you need its accession). See [Import attached files](../import-data-in-odm.md).

- The source attachment already uploaded to a study in ODM (you need its accession). See [Import attached files](../import-data-in-odm.md#attach-a-file).

MariaBorodaenko · 2026-06-26T14:19:00Z

+By default the job runs against the latest version of the configuration. To pin a specific version (for example, to reproduce an earlier job), add a `version` to the `configuration_reference` object:
+
+```json
+"configuration_reference": {
+  "id": <config_id>,
+  "version": <version>
+}
+```
+
+> **[Subject to change (ODM-13233)]** The exact name and shape of the `configuration_reference` field are not yet finalized. Verify against the released API before relying on it.


I would suggest to move the part starting with "By default the job runs against the latest version of the configuration..." to the previous section: ## Step 2: Create or identify a configuration

MariaBorodaenko · 2026-06-26T14:48:43Z

+`volume_size` is a Kubernetes resource quantity string (for example `"30Gi"` for 30 GiB, or `"512Mi"`), not a plain number: the request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4× the original file size; for CSV files, a small value such as `"30Gi"` is sufficient. If you omit `volume_size`, the image's default is used (falling back to `"30Gi"`).
+
+The body also accepts an optional `memory_size` quantity string (for example `"512Mi"`). Increase it if a job ends in `FAILED` with `status.reason: OOMKilled`.


Suggested change

`volume_size` is a Kubernetes resource quantity string (for example `"30Gi"` for 30 GiB, or `"512Mi"`), not a plain number: the request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4× the original file size; for CSV files, a small value such as `"30Gi"` is sufficient. If you omit `volume_size`, the image's default is used (falling back to `"30Gi"`).

The body also accepts an optional `memory_size` quantity string (for example `"512Mi"`). Increase it if a job ends in `FAILED` with `status.reason: OOMKilled`.

Two optional parameters control resource allocation for the job: `volume_size` and `memory_size`.

The first, `volume_size`, sets the disk space allocated for processing. It must be a Kubernetes resource quantity string, for example, `"4Gi"` for 4 GiB or `"512Mi"` for 512 MiB. The request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4×; for CSV files, a small value is typically sufficient.

`memory_size` sets the RAM allocated for processing, using the same quantity format, for example, `"512Mi"`. Increase it if a job ends in `FAILED` with `status.reason: OOMKilled` (out-of-memory termination).

For default values for both parameters, see [Available images reference](/transformations/available-images-reference.md).

MariaBorodaenko · 2026-06-26T15:12:50Z

+
+- Configuration validation messages.
+- The file structure report: which metadata keys are present in your input file.
+- Linking validation results: whether cell batch values resolve to existing ODM objects.


Suggested change

- Linking validation results: whether cell batch values resolve to existing ODM objects.

- Linking validation results: whether the transformation output can be linked to existing ODM objects.

MariaBorodaenko · 2026-06-26T15:46:54Z

+
+## Step 6: Submit the full run
+
+Once the dry run completes without issues, resubmit with `dry_run` set to `false`:


Suggested change

Once the dry run completes without issues, resubmit with `dry_run` set to `false`:

Once the dry run completes without issues, resubmit the job. You can either set `dry_run` to `false` or omit it entirely — it defaults to `false`:

MariaBorodaenko · 2026-06-26T15:48:50Z

+}
+```
+
+Monitor and review logs the same way as Steps 4–5. When the job completes, the logs contain the ODM accessions of all objects that were created or updated. Logs are uploaded as an attachment to the same study.


Suggested change

Monitor and review logs the same way as Steps 4–5. When the job completes, the logs contain the ODM accessions of all objects that were created or updated. Logs are uploaded as an attachment to the same study.

Monitor and review logs the same way as Steps 4–5. When the job completes, the logs contain the ODM accessions of all objects that were created or updated.

This again should be changed under https://genestack.atlassian.net/browse/BIA-151

MariaBorodaenko · 2026-06-26T15:50:46Z

+
+## Use-case guides
+
+- For single-cell HDF5 ingestion, see [single-cell/single-cell-getting-started.md](single-cell/single-cell-getting-started.md).


Suggested change

- For single-cell HDF5 ingestion, see [single-cell/single-cell-getting-started.md](single-cell/single-cell-getting-started.md).

- For single-cell HDF5 ingestion, see [Single-cell data in ODM: Getting started](single-cell/single-cell-getting-started.md).

MariaBorodaenko · 2026-06-26T15:51:31Z

+## Use-case guides
+
+- For single-cell HDF5 ingestion, see [single-cell/single-cell-getting-started.md](single-cell/single-cell-getting-started.md).
+- For CSV-to-TSV conversion, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).


Suggested change

- For CSV-to-TSV conversion, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).

- For CSV-to-TSV conversion, see [CSV to Sample Group](csv-to-tsv/how-to-transform-csv-to-tsv.md).

MariaBorodaenko · 2026-06-26T16:17:47Z

@@ -0,0 +1,104 @@
+# How to manage transformation configurations
+
+This guide shows you how to develop a transformation configuration and iterate on it until it produces the results you want. A configuration is a reusable, versioned JSON document that tells an image how to process your input. The workflow below takes you from a first draft, through dry-run testing, to a configuration you can run for real and reuse across jobs.


Suggested change

This guide shows you how to develop a transformation configuration and iterate on it until it produces the results you want. A configuration is a reusable, versioned JSON document that tells an image how to process your input. The workflow below takes you from a first draft, through dry-run testing, to a configuration you can run for real and reuse across jobs.

This guide shows you how to develop a transformation configuration and iterate on it until it produces the results you want. A configuration is a reusable, versioned JSON document that tells an image how to process your input. The workflow below takes you from a first draft, through dry-run testing, to a validated configuration you can run against your data and reuse across jobs.

MariaBorodaenko · 2026-06-26T16:28:26Z

+
+## The iteration loop
+
+Developing a configuration is a loop. You create a first draft, submit it as a dry-run job, review the logs, and update the configuration to fix whatever the dry run surfaced, repeating until the dry run is clean. Only then do you submit a full run.


Suggested change

Developing a configuration is a loop. You create a first draft, submit it as a dry-run job, review the logs, and update the configuration to fix whatever the dry run surfaced, repeating until the dry run is clean. Only then do you submit a full run.

Developing a configuration is a loop. You create a first draft, submit it as a dry-run job, review the logs, and update the configuration based on the results, repeating until the dry run completes without issues and produces the output you expect. Only then you submit a full run. Once the configuration is working, you can reuse it for any input file with the same structure.

MariaBorodaenko · 2026-06-26T16:35:34Z

+
+The request body requires `data`: the image-specific processing specification. `name` and `description` are optional but recommended, since the list and get responses surface them so you can identify the configuration later.
+
+For the `metadata-basic` image (CSV to Sample group), a minimal request looks like this:


Do we have a configuration for metadata-basic at all? Can they affect anything in the transformation image flow?

MariaBorodaenko · 2026-06-26T16:36:42Z

+}
+```
+
+Keep that `id`: you use it to retrieve, update, and submit jobs against the configuration. For the single-cell HDF5 `hdf5-cells` image, the `data` field follows a different schema. See the [Configuration Reference](single-cell/configuration-reference.md).


Suggested change

Keep that `id`: you use it to retrieve, update, and submit jobs against the configuration. For the single-cell HDF5 `hdf5-cells` image, the `data` field follows a different schema. See the [Configuration Reference](single-cell/configuration-reference.md).

Keep that `id`: you use it to retrieve, update, and reference the configuration in job submissions. For the single-cell HDF5 `hdf5-cells` image, the `data` field follows a different schema. See the [Configuration Reference](single-cell/configuration-reference.md).

MariaBorodaenko · 2026-06-26T16:40:01Z

+
+## Submit a dry run and review the logs
+
+Submit the configuration as a dry-run job, then read the job logs to see how it behaved against your real input without writing any data. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).


Suggested change

Submit the configuration as a dry-run job, then read the job logs to see how it behaved against your real input without writing any data. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).

Submit a dry-run job referencing this configuration against your input file, then review the logs to verify it behaves as expected, without writing any data to ODM. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).

MariaBorodaenko · 2026-06-26T16:47:26Z

+PUT /api/v1/transformations/configurations/{id}
+```
+
+The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is archived as a previous version and the active version is incremented. The same `id` is reused across all iterations, and every earlier version stays retrievable, so you can audit or re-run a job with the exact parameters used in the past.


Suggested change

The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is archived as a previous version and the active version is incremented. The same `id` is reused across all iterations, and every earlier version stays retrievable, so you can audit or re-run a job with the exact parameters used in the past.

The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is saved as a previous version and the active version is incremented. The same `id` is reused across all iterations, and any version can be referenced in a job - by default the latest is used.

MariaBorodaenko · 2026-06-26T16:48:25Z

+
+The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is archived as a previous version and the active version is incremented. The same `id` is reused across all iterations, and every earlier version stays retrievable, so you can audit or re-run a job with the exact parameters used in the past.
+
+Resubmit the dry-run job against the same configuration and review the logs again. Repeat until the dry run completes without errors or warnings that require action, then submit the full run.


Suggested change

Resubmit the dry-run job against the same configuration and review the logs again. Repeat until the dry run completes without errors or warnings that require action, then submit the full run.

Resubmit the dry-run job awith the updated configuration and review the logs again. Repeat until the dry run completes without issues and produces the output you expect, then submit the full run.

MariaBorodaenko · 2026-06-26T16:50:28Z

+
+## Review your configurations
+
+At any point you can inspect what you have. To list your configurations:


Suggested change

At any point you can inspect what you have. To list your configurations:

At any point you can inspect all available configurations. To list them:

MariaBorodaenko · 2026-06-26T16:55:35Z

+GET /api/v1/transformations/configurations
+```
+
+The response is a paginated envelope: the configurations are in the `items` array, and `limit`/`offset` query parameters page through the results (default 100 per page). The list returns the latest version of each configuration, including its full `data`, so you can review the current state of each one without a second request. See [Pagination](api-reference.md#pagination).


This part will be updated after https://genestack.atlassian.net/browse/ODM-13238 is tested.

A new parameter "include_archived" will be added, false by default. Only active configs will be retrieved by default, archived configs can be included intentionally by setting "include_archived" to "true".

MariaBorodaenko · 2026-06-26T16:59:31Z

+
+## Reuse a working configuration
+
+Configurations are reusable. Once a configuration is working correctly, you can apply it to multiple input files in subsequent jobs without recreating it.


Suggested change

Configurations are reusable. Once a configuration is working correctly, you can apply it to multiple input files in subsequent jobs without recreating it.

Once you have validated a configuration through dry-run testing, it becomes the foundation of your ingestion pipeline: the same configuration can be applied to any number of input files that share the same structure or come from the same source, without any further setup. This makes it straightforward to automate ingestion, for example, to process a batch of files or integrate transformation jobs into a recurring pipeline.

MariaBorodaenko · 2026-06-26T17:09:30Z

@@ -0,0 +1,70 @@
+# Available transformation images reference
+
+Each transformation image handles a specific input/output format pair. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions at runtime.


Suggested change

Each transformation image handles a specific input/output format pair. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions at runtime.

Each transformation image defines what input formats it accepts and what ODM objects it produces. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions.

MariaBorodaenko · 2026-06-26T17:11:12Z

+}
+```
+
+For the full how-to, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).


Suggested change

For the full how-to, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).

For the full how-to, see [CSV to Sample Group](csv-to-tsv/how-to-transform-csv-to-tsv.md).

MariaBorodaenko · 2026-06-26T17:13:10Z

+
+**Available versions:** `latest`
+
+**Use case:** Converts a CSV file attached to a study into an ODM Sample metadata group. The configuration `data` field specifies the source format and the destination entity type.


Let's review if configuration can affect anything for the metadata-basic image.

MariaBorodaenko · 2026-06-26T17:15:38Z

+
+**Default volume:** `5Gi`
+
+**Available versions:** `latest`


I would suggest to remove the field, it does not provide any details or helpful information.

MariaBorodaenko · 2026-06-26T17:15:46Z

+
+**Default volume:** `5Gi`
+
+**Available versions:** `latest`


I would suggest to remove the field, it does not provide any details or helpful information.

MariaBorodaenko · 2026-06-26T17:19:15Z

+## Version conventions
+
+- `latest` is an alias for the most recent stable version of an image. Use it for exploration and development.
+- Specific tags (for example, `0.0.7`) pin the job to a particular image version. Use them in production pipelines where reproducibility matters.


Suggested change

- Specific tags (for example, `0.0.7`) pin the job to a particular image version. Use them in production pipelines where reproducibility matters.

- Specific tags (for example, `0.0.7`) pin the job to a particular image version. Use them when you need to reproduce a previous result - the exact image and version used in any job are recorded in its logs.

MariaBorodaenko · 2026-06-26T17:20:26Z

+## Known limitations
+
+Only one transformation process can be run per attachment.


The limitation was removed by recent update to ODM multipart endpoints and can be removed https://genestack.atlassian.net/browse/ODM-13307

MariaBorodaenko · 2026-06-26T17:22:41Z

+
+**Input formats:** H5AD (AnnData), 10x Genomics H5 (converted internally to H5AD before processing), Legacy 10x Genomics H5 v<3 (single-genome only; multi-genome legacy files are not supported).
+
+**Output formats:** ODM Cell Group, Expression Group, and attachments, with optional Sample, Library, and Preparation groups.


Suggested change

**Output formats:** ODM Cell Group, Expression Group, and attachments, with optional Sample, Library, and Preparation groups.

**Output formats:** ODM Cell Group, Expression Group, with optional Sample, Library, and Preparation groups.

processors controller docs added to the repo.

24481e1

predragmacura requested review from MariaBorodaenko, MikhailAf, eeliane and genestack-okunitsyn June 25, 2026 21:46

predragmacura requested review from a team as code owners June 25, 2026 21:46

MariaBorodaenko reviewed Jun 26, 2026

View reviewed changes

MariaBorodaenko requested changes Jun 26, 2026

View reviewed changes

		@@ -0,0 +1,38 @@
		# About the Processors Controller

		A transformation takes an attached input file and turns its contents into ODM objects. Some transformations produce indexable metadata: for example, ODM cannot index a CSV file directly as a source of metadata, but the `metadata-basic` transformation converts it into TSV-based metadata objects that ODM can index. Others turn raw data into structured ODM objects: for example, the `hdf5-cells` transformation converts single-cell HDF5 (H5AD/H5) files into ODM Cell Groups and Expression Groups. Either way, a transformation bridges the gap between the file you have and the ODM objects you need.


		These three map onto a simple idea: an image is what processing to do, a configuration is how to tune it, and a job is doing it once, against specific files. The same image and configuration can drive many independent jobs.

		## Transformation configurations


		## Transformation images

		Transformation images are versioned container images (self-contained, ready-to-run packages of the processing software) that run the processing logic. Available image versions can be queried through the API. Each image handles a specific input/output format pair: for example, `metadata-basic` converts CSV files to TSV-based metadata objects (samples, libraries, preparations, cell metadata, expression, or variants); `hdf5-cells` converts H5AD or H5 single-cell files into ODM Cell Groups, Expression Groups, and associated metadata. When starting a job you can specify either `latest` or a specific release tag.


		A job is not where your results are stored. As it runs, the transformation writes its output into ODM as ordinary objects, so once the job finishes those results are part of your ODM data like anything else.

		When a job finishes it stops running, and the resources that were processing it are released: nothing keeps running in the background. What stays behind is the job's record: its final status and full logs. These are kept indefinitely, with no expiry, and are never deleted. You retrieve them through the same API endpoints whether the job is still running or finished long ago, so from your side nothing about fetching a job's status or logs changes once it is done.


		## Transformation logs

		Each job produces a log recording processing steps, warnings, errors, the source file name and accession, and the accessions of any ODM objects it created. Logs are retained permanently and are always retrievable through the API: the logs endpoint returns the live log while the job runs and the archived log once it has finished, transparently. A finished job's log is never unavailable. Separately, the log is also uploaded into ODM as an attachment on the owning study, so it sits alongside the job's other generated files; this attachment is an additional copy and is not what keeps the log available.

	Note the `name` and `version` of the image you want to use. Use `"latest"` for the most recent version, or a specific release tag (for example, `"0.0.7"`) for reproducibility in production pipelines. See [Available images reference](available-images-reference.md) for the full catalogue and per-image guidance.
	Note the `name` and `version` of the image you want to use. The `version` field is optional - if omitted or set to `"latest"`, the most recent version is used automatically. Specify an explicit version tag (for example, `"0.0.7"`) for reproducibility in production pipelines. See [Available images reference](available-images-reference.md) for the full catalogue and per-image guidance.

	- The source attachment already uploaded to a study in ODM (you need its accession). See [Import attached files](../import-data-in-odm.md).
	- The source attachment already uploaded to a study in ODM (you need its accession). See [Import attached files](../import-data-in-odm.md#attach-a-file).

		`volume_size` is a Kubernetes resource quantity string (for example `"30Gi"` for 30 GiB, or `"512Mi"`), not a plain number: the request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4× the original file size; for CSV files, a small value such as `"30Gi"` is sufficient. If you omit `volume_size`, the image's default is used (falling back to `"30Gi"`).

		The body also accepts an optional `memory_size` quantity string (for example `"512Mi"`). Increase it if a job ends in `FAILED` with `status.reason: OOMKilled`.

-`volume_size` is a Kubernetes resource quantity string (for example `"30Gi"` for 30 GiB, or `"512Mi"`), not a plain number: the request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4× the original file size; for CSV files, a small value such as `"30Gi"` is sufficient. If you omit `volume_size`, the image's default is used (falling back to `"30Gi"`).
-The body also accepts an optional `memory_size` quantity string (for example `"512Mi"`). Increase it if a job ends in `FAILED` with `status.reason: OOMKilled`.
+Two optional parameters control resource allocation for the job: `volume_size` and `memory_size`.
+The first, `volume_size`, sets the disk space allocated for processing. It must be a Kubernetes resource quantity string, for example, `"4Gi"` for 4 GiB or `"512Mi"` for 512 MiB. The request is rejected if the value is not a valid quantity or is zero. As a guideline: for H5AD input files, allocate at least 1.4× the original file size; for 10x H5 input files, at least 4×; for CSV files, a small value is typically sufficient.
+`memory_size` sets the RAM allocated for processing, using the same quantity format, for example, `"512Mi"`. Increase it if a job ends in `FAILED` with `status.reason: OOMKilled` (out-of-memory termination).
+For default values for both parameters, see [Available images reference](/transformations/available-images-reference.md).

	- Linking validation results: whether cell batch values resolve to existing ODM objects.
	- Linking validation results: whether the transformation output can be linked to existing ODM objects.


		## Step 6: Submit the full run

		Once the dry run completes without issues, resubmit with `dry_run` set to `false`:

	Monitor and review logs the same way as Steps 4–5. When the job completes, the logs contain the ODM accessions of all objects that were created or updated. Logs are uploaded as an attachment to the same study.
	Monitor and review logs the same way as Steps 4–5. When the job completes, the logs contain the ODM accessions of all objects that were created or updated.


		## Use-case guides

		- For single-cell HDF5 ingestion, see [single-cell/single-cell-getting-started.md](single-cell/single-cell-getting-started.md).

	- For CSV-to-TSV conversion, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).
	- For CSV-to-TSV conversion, see [CSV to Sample Group](csv-to-tsv/how-to-transform-csv-to-tsv.md).

		@@ -0,0 +1,104 @@
		# How to manage transformation configurations

		This guide shows you how to develop a transformation configuration and iterate on it until it produces the results you want. A configuration is a reusable, versioned JSON document that tells an image how to process your input. The workflow below takes you from a first draft, through dry-run testing, to a configuration you can run for real and reuse across jobs.


		## The iteration loop

		Developing a configuration is a loop. You create a first draft, submit it as a dry-run job, review the logs, and update the configuration to fix whatever the dry run surfaced, repeating until the dry run is clean. Only then do you submit a full run.


		The request body requires `data`: the image-specific processing specification. `name` and `description` are optional but recommended, since the list and get responses surface them so you can identify the configuration later.

		For the `metadata-basic` image (CSV to Sample group), a minimal request looks like this:

	Keep that `id`: you use it to retrieve, update, and submit jobs against the configuration. For the single-cell HDF5 `hdf5-cells` image, the `data` field follows a different schema. See the [Configuration Reference](single-cell/configuration-reference.md).
	Keep that `id`: you use it to retrieve, update, and reference the configuration in job submissions. For the single-cell HDF5 `hdf5-cells` image, the `data` field follows a different schema. See the [Configuration Reference](single-cell/configuration-reference.md).


		## Submit a dry run and review the logs

		Submit the configuration as a dry-run job, then read the job logs to see how it behaved against your real input without writing any data. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).

	Submit the configuration as a dry-run job, then read the job logs to see how it behaved against your real input without writing any data. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).
	Submit a dry-run job referencing this configuration against your input file, then review the logs to verify it behaves as expected, without writing any data to ODM. The job-submission and log-retrieval endpoints are covered in [How to run a transformation](how-to-run-a-transformation.md).

Uh oh!

Conversation

predragmacura commented Jun 25, 2026

Uh oh!

MariaBorodaenko Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

MariaBorodaenko Jun 26, 2026 •

edited

Loading

	The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is archived as a previous version and the active version is incremented. The same `id` is reused across all iterations, and every earlier version stays retrievable, so you can audit or re-run a job with the exact parameters used in the past.
	The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is saved as a previous version and the active version is incremented. The same `id` is reused across all iterations, and any version can be referenced in a job - by default the latest is used.


		The request body follows the same structure as the `POST` endpoint. Updating does not overwrite the configuration: the current state is archived as a previous version and the active version is incremented. The same `id` is reused across all iterations, and every earlier version stays retrievable, so you can audit or re-run a job with the exact parameters used in the past.

		Resubmit the dry-run job against the same configuration and review the logs again. Repeat until the dry run completes without errors or warnings that require action, then submit the full run.

	Resubmit the dry-run job against the same configuration and review the logs again. Repeat until the dry run completes without errors or warnings that require action, then submit the full run.
	Resubmit the dry-run job awith the updated configuration and review the logs again. Repeat until the dry run completes without issues and produces the output you expect, then submit the full run.


		## Review your configurations

		At any point you can inspect what you have. To list your configurations:

	At any point you can inspect what you have. To list your configurations:
	At any point you can inspect all available configurations. To list them:


		## Reuse a working configuration

		Configurations are reusable. Once a configuration is working correctly, you can apply it to multiple input files in subsequent jobs without recreating it.

		@@ -0,0 +1,70 @@
		# Available transformation images reference

		Each transformation image handles a specific input/output format pair. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions at runtime.

	Each transformation image handles a specific input/output format pair. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions at runtime.
	Each transformation image defines what input formats it accepts and what ODM objects it produces. Use `GET /api/v1/transformations/images` to retrieve the current list of available images and their versions.

	For the full how-to, see [csv-to-tsv/how-to-transform-csv-to-tsv.md](csv-to-tsv/how-to-transform-csv-to-tsv.md).
	For the full how-to, see [CSV to Sample Group](csv-to-tsv/how-to-transform-csv-to-tsv.md).


		Available versions: `latest`

		Use case: Converts a CSV file attached to a study into an ODM Sample metadata group. The configuration `data` field specifies the source format and the destination entity type.