Multiple doc types by kpsherva · Pull Request #732 · CERNDocumentServer/cds-rdm

kpsherva · 2026-03-11T09:54:11Z

closes Inspire Harvester: Explore and implement a solution to map multiple document types as resource type #567

kpsherva · 2026-03-20T16:22:37Z

-            del new_version_entry["pids"]
-
-        # Preserve existing programmes in new versions
-        existing_programmes = (


check if the custom fields are preserved between versions

TahaKhan998 · 2026-05-13T14:24:17Z

-            # error_message = f"Unexpected error while processing entry: {str(e)}."
+            import traceback
+            traceback.print_exc()
+        except Exception as e:


I think the generic except Exception case is a bit risky here. For WriterError and ValidationError we actually build an error_message and add it to stream_entry.errors, so the failed entry is tracked properly. But here we only print the traceback and then continue. So if _route() fails because of some unexpected bug, nothing gets added to stream_entry.errors, op_type can stay unset, and the function still returns the entry. That feels like it could leave it in a half-failed state without the failure being tracked clearly. Should we either add an explicit error there too?

TahaKhan998 · 2026-05-13T14:44:30Z

+        for doc_type in doc_types:
+            self.logger.debug(f"Mapping {doc_type} to version.")
+            resource_type = INSPIRE_DOCUMENT_TYPE_MAPPING[doc_type]
+            self.logger.info(f"Mapped {doc_type} to {resource_type}.")
+            if resource_type is not self.main_res_type:
+                version_ctx = MetadataSerializationContext(
+                    resource_type=resource_type,
+                    inspire_id=self.inspire_id,
+                    cds_rdm_id=self.cds_id,
+                )
+                mappers = self.policy.build_for(resource_type)
+                assert_unique_ids(mappers)
+                patches = [
+                    m.apply(self.inspire_record, version_ctx, self.logger)
+                    for m in mappers
+                ]
+


I might be misunderstanding this, but is there a risk of generating duplicate versions for the same target resource type here? split() loops over every raw INSPIRE document_type, and if two different doc types map to the same CDS resource_type, it looks like we would append two versions with the same target type. Then later the writer would process that same resource type more than once. I’m also wondering if the unused _PREPRINT_DOC_TYPES constant was meant to group some of these doc types together instead.

TahaKhan998 · 2026-05-13T14:57:03Z

+            source = abstract.get("source", "").lower()
+            if source and source not in ["arxiv", "cds"]:
+                return abstract["value"]
+            return abstracts[0]["value"]


i think this may return too early. Because the fallback return abstracts[0]["value"] is inside the loop, it looks like we only really inspect the first abstract. So if a preferred non-arxiv/non-cds abstract appears later in the list, we would never reach it. Was the fallback meant to be outside the loop?

Good catch! CC @kpsherva

TahaKhan998 · 2026-05-13T15:04:23Z

+            source = abstract.get("source", "").lower()
+            if source and source in ["arxiv", "cds"]:
+                return abstract["value"]
+            return abstracts[0]["value"]


I think this may have the same early return issue as article.py.

* change (harvester): skip update if metadata is equal # Conflicts: # site/cds_rdm/inspire_harvester/writer.py

refactor: exception handling # Conflicts: # site/cds_rdm/legacy/redirector.py # Conflicts: # site/cds_rdm/legacy/redirector.py

* refactor(harvester): add specialised classes to handle drafts, files, matching records

* change(harvester): usage of resource types

* change(harvester): add license mapping

sakshamarora1 · 2026-06-24T08:09:41Z

+                f"Imported files to {draft.id} from previous version: {record.id}"
+            )
+
+        new_files = incoming_record["files"].get("entries", {})


was it not .files? and it should always exist right? Otherwise we might wanna do get to not error out.

sakshamarora1 · 2026-06-24T08:44:55Z

+            source = abstract.get("source", "").lower()
+            if source and source not in ["arxiv", "cds"]:
+                return abstract["value"]
+            return abstracts[0]["value"]


Good catch! CC @kpsherva

sakshamarora1 · 2026-06-24T09:01:22Z

+                filename = file["filename"]
+                source = file.get("source")
+                url = file["url"]
+                if "pdf" not in filename:


I think it would be better to use .endswith(".pdf") so that we don't miss edge cases like "Fulltext_pdf_paper"

sakshamarora1 · 2026-06-24T09:02:07Z


+    def filter(self, doi):
+        """Filter doi based on given criteria."""
+        return True


Is it added for a future use case?

zubeydecivelek · 2026-06-25T16:27:08Z

+from cds_rdm.inspire_harvester.utils import assert_unique_ids, deep_merge_all
+
+# Doc types that belong to the arXiv/preprint stream
+_PREPRINT_DOC_TYPES = frozenset({"report", "note", "activity report"})


Is this used somewhere? Is it needed?

zubeydecivelek · 2026-06-25T16:34:47Z

+        else:
+            self._create_record(stream_entry)
+            return "create"


Is it possible to have a new entry (first time harvest) with multiple resource types?

zubeydecivelek · 2026-06-25T16:37:16Z

+            del new_version_entry["pids"]["oai"]
+            if new_version_entry["pids"]["doi"]["provider"] != "external":
+                del new_version_entry["pids"]["doi"]


Is it possible to have an entry without oai or doi?

zubeydecivelek · 2026-06-26T07:29:57Z

+    def compute_diff(self, existing_files, new_files) -> FileDiff:
+        """Return the set difference between existing and new file checksums."""
+        existing_checksums = [value["checksum"] for value in existing_files.values()]
+        new_checksums = [value["checksum"] for value in new_files.values()]


Since arxiv files dont have a checksum here and we generate the checksum when file uploaded, if we receive the same arxiv file it'll have None in the new_checksums if I understand correctly. Could this cause the same file to be treated as changed?

zubeydecivelek · 2026-06-26T08:11:48Z

+                    inspire_url = file.get("source_url")
+                    file_content = self.fetch(inspire_url, logger)
+                    self._upload_file(draft, file, file_content, logger)
+            logger.info(f"{len(new_files)} files successfully created.")


Suggested change

logger.info(f"{len(new_files)} files successfully created.")

logger.info(f"{len(diff.to_add)} files successfully created.")

zubeydecivelek · 2026-06-26T08:51:39Z

+        ror_ids = []
+        if affiliations_identifiers:
+            ror_ids = [
+                normalize_ror(ai["value"])
+                for ai in affiliations_identifiers
+                if ai.get("schema") == "ROR"
+            ]
+
+        for i, affiliation in enumerate(affiliations):
+            if i < len(ror_ids):
+                mapped_affiliations.append({"id": ror_ids[i]})
+            else:
+                value = affiliation.get("value")
+                if value:
+                    mapped_affiliations.append({"name": value.rstrip(".").strip()})


Is it possible for affiliations and ror_ids to have different lengths? If so, could we end up assigning the wrong ROR to an affiliation because the mapping relies on the index?

zubeydecivelek · 2026-06-26T08:57:48Z

+            return False
+        if (not material or material == "preprint") and source == "arxiv":
+            return True
+        if source == "CDS":


Suggested change

if source == "CDS":

if source.lower() == "cds":

isn't it safer like this?

kpsherva moved this to In progress in Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva added this to Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva self-assigned this Mar 11, 2026

kpsherva removed this from Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva force-pushed the multiple-doc-types branch from 241a5e4 to ad92c3a Compare March 12, 2026 08:12

kpsherva force-pushed the multiple-doc-types branch from ad92c3a to afa86dc Compare March 20, 2026 16:07

kpsherva commented Mar 20, 2026

View reviewed changes

kpsherva marked this pull request as ready for review March 23, 2026 12:35

kpsherva force-pushed the multiple-doc-types branch from 3062e73 to 85fa1ee Compare April 27, 2026 14:48

TahaKhan998 reviewed May 13, 2026

View reviewed changes

kpsherva added 12 commits June 18, 2026 14:16

change(harvester): improve writer interface

0264013

* change (harvester): skip update if metadata is equal # Conflicts: # site/cds_rdm/inspire_harvester/writer.py

chore(resolver): return always the same type

186851a

refactor: exception handling # Conflicts: # site/cds_rdm/legacy/redirector.py # Conflicts: # site/cds_rdm/legacy/redirector.py

feat(harvester): introduce notion of versions in the writer

6894c7d

* refactor(harvester): add specialised classes to handle drafts, files, matching records

feat(harvester): add specialized mappers per res type

a48eb5d

tests: add multiple resource type case

0d1b370

chore(harvester): fix tests

49f4fa5

wip: adjust file field update

56637e9

change(harvester): when to update files policy

b813559

* change(harvester): usage of resource types

change(harvester): splitting metadata assignment by resource type

fa55b09

* change(harvester): add license mapping

chore(harvester): formatting

a2ba8ac

feat(harvester): add ROR assignment in mapping

916d066

fix(tests): resource type assert

0d84f9d

kpsherva force-pushed the multiple-doc-types branch from 6e1a803 to 0d84f9d Compare June 18, 2026 12:24

sakshamarora1 reviewed Jun 24, 2026

View reviewed changes

kpsherva requested review from palkerecsenyi, zubeydecivelek and zzacharo June 25, 2026 14:56

zubeydecivelek reviewed Jun 26, 2026

View reviewed changes

	logger.info(f"{len(new_files)} files successfully created.")
	logger.info(f"{len(diff.to_add)} files successfully created.")

Uh oh!

Conversation

kpsherva commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TahaKhan998 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kpsherva commented Mar 11, 2026 •

edited

Loading

TahaKhan998 May 13, 2026 •

edited

Loading