rtransparency automatically identifies and extracts indicators of research
transparency from the full text of biomedical articles, in both PubMed Central
(PMC) JATS XML and plain-text (PDF-derived) form. Every prediction comes with the
exact statement that triggered it, so results are auditable rather than a black
box. Detection is rule-based (curated regular expressions over the relevant
article sections), self-contained (no GitHub-only or AGPL dependencies), and
ships with reproducible accuracy benchmarks.
| Indicator | Detects | XML function | Text function |
|---|---|---|---|
| Conflicts of interest | A COI disclosure is present (including "no competing interests") | rt_coi_pmc |
rt_coi |
| Funding | A statement that funding was received | rt_fund_pmc |
rt_fund |
| Protocol registration | A trial/protocol registration identifier or statement (NCT, ISRCTN, PROSPERO, OSF, CHiCTR, DRKS, ANZCTR, IRCT, UMIN, ...) | rt_register_pmc |
rt_register |
| Novelty | The article claims its own work is novel or first | rt_novelty_pmc |
rt_novelty |
| Replication | A replication or external/independent validation was performed | rt_replication_pmc |
rt_replication |
| Data sharing | The authors' own data are made available (repository, accession, or in-article) | rt_data_code_pmc |
rt_data_code |
| Code sharing | The authors' own analysis code is shared | rt_data_code_pmc |
rt_data_code |
| AI disclosure | A statement discloses generative-AI use in manuscript preparation (2023+) | rt_ai_pmc |
rt_ai |
| Open-access license | The article is openly licensed, and which license (CC-BY, CC-BY-NC-ND, CC0, ...) | rt_oa_pmc |
rt_oa |
| Reporting guideline | The authors followed a reporting guideline, and which (CONSORT, PRISMA, STROBE, ARRIVE, ...) | rt_reporting_pmc |
rt_reporting |
Conflicts of interest and AI disclosure are disclosure-based: a statement on the topic counts whether the disclosure is positive or negative. Conflict-of- interest and funding statements are detected not only in English but also in Spanish, Portuguese, French, German and Italian.
# From CRAN (when available)
install.packages("rtransparency")
# Development version from GitHub
# install.packages("remotes")
remotes::install_github("choxos/rtransparency", build_vignettes = TRUE)No GitHub-only or AGPL dependencies are required; data and code detection is
native (it no longer wraps oddpub). rt_read_pdf() (PDF to text) additionally
needs the poppler pdftotext utility on your system. The optional furrr and
future packages enable parallel corpus processing; ggplot2 enables plotting.
library(rtransparency)
xml <- system.file("extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency")
res <- rt_all_pmc(xml, remove_ns = TRUE)
# The predictions, one column per indicator:
res[, c("is_coi_pred", "is_fund_pred", "is_register_pred", "is_novelty_pred",
"is_replication_pred", "is_open_data", "is_open_code", "is_ai_pred",
"is_open_access", "is_reporting_pred")]
# Each prediction is paired with the text/value that triggered it, e.g.:
res$coi_text
res$open_data_statements
res$oa_license # e.g. "CC-BY-4.0"
res$reporting_guideline # e.g. "PRISMA"rt_all_pmc() returns one row with the ten predictions, the extracted statement
for each, article identifiers and metadata, the year, and is_success.
is_ai_pred is NA for articles published before 2023.
Each indicator can be run on its own, for a PMC XML file or a plain-text file:
rt_coi_pmc(xml, remove_ns = TRUE) # conflicts of interest
rt_fund_pmc(xml, remove_ns = TRUE) # funding
rt_register_pmc(xml, remove_ns = TRUE) # protocol registration
rt_novelty_pmc(xml, remove_ns = TRUE) # novelty claims
rt_replication_pmc(xml, remove_ns = TRUE)# replication / external validation
rt_data_code_pmc(xml, remove_ns = TRUE) # data AND code sharing (+ extracted links)
rt_ai_pmc(xml, remove_ns = TRUE) # generative-AI-use disclosure (2023+)
rt_oa_pmc(xml, remove_ns = TRUE) # open-access status + license
rt_reporting_pmc(xml, remove_ns = TRUE) # reporting-guideline use + which one
rt_meta_pmc(xml, remove_ns = TRUE) # article metadatart_all_pmc_dir() runs all ten indicators over an entire directory (or a
vector of paths). It is built for large corpora:
res <- rt_all_pmc_dir(
"path/to/xml", # a directory, or a character vector of file paths
remove_ns = TRUE,
output = "results.csv", # resumable: re-running skips files already recorded
parallel = TRUE, # via furrr + an active future::plan()
progress = TRUE
)- Resumable: with
output, results are written to a CSV in chunks; a re-run skips files already recorded and appends only the new ones. - Failure-isolated: a malformed file yields an
is_success = FALSErow instead of aborting the run. - Parallel: set
future::plan("multisession")andparallel = TRUE.
The same detectors run on plain-text (PDF-derived) articles. rt_read_pdf()
returns the extracted text as a character string; write it to a .txt file,
then point the text detectors (which share the PMC detection logic) at that file:
article_txt <- rt_read_pdf("article.pdf") # needs poppler's pdftotext; returns text
writeLines(article_txt, "article.txt") # the detectors take a file path
rt_all("article.txt") # COI, funding, registration, novelty, replication
rt_coi("article.txt") # or one indicator at a time
rt_ai("article.txt") # generative-AI-use disclosurert_ai() is the plain-text counterpart of rt_ai_pmc(). Because a text file
carries no reliable publication date, it applies no 2023 year gate (it
returns TRUE/FALSE, never NA) and cannot confine the scan to back-matter
sections, so restrict its use to 2023-or-later articles and expect a slightly
higher false-positive rate on papers that use AI as a research method.
Once you have one row per article, summarize the corpus:
data(rt_demo) # a small simulated example shipped with the package
rt_summary(rt_demo) # per-indicator prevalence with a Wilson confidence
# interval and a sensitivity/specificity-corrected
# (Rogan-Gladen) prevalence
rt_summary(rt_demo, by = "year") # subgroup summaries
rt_score(rt_demo) # add a per-article count of openness practices met
rt_plot(rt_demo) # prevalence bar chart
rt_plot(rt_demo, type = "trend", year = "year") # prevalence over timeThe accuracy correction uses the bundled rt_accuracy table (detector
sensitivity and specificity for eight indicators; open-access licensing and
AI-use disclosure are reported uncorrected). Supply your own estimates:
rt_accuracy # the bundled estimates
my_acc <- data.frame(variable = "is_open_data", sensitivity = 0.84, specificity = 0.97)
rt_summary(rt_demo, accuracy = my_acc) # correct with your own valuesThe data- and code-availability links the detector extracts (open_data_links,
open_code_links) can be passed to FAIR-assessment tooling such as
rfair to score the findability and
accessibility of the shared resources.
Benchmarked against the human-labeled XML benchmark of Serghiou et al. (2021),
reproducible under data-raw/benchmark/, with results in inst/benchmark/:
| Indicator | Sensitivity | Specificity |
|---|---|---|
| Conflicts of interest | 94.0% | 100% |
| Funding | 100% | 95.7% |
| Protocol registration | 99.2% | 96.9% |
| Data sharing | 76.5% | 99.0% |
| Code sharing | 88.1% | 99.5% |
Registration and code in the Serghiou benchmark table above are labeled
independently of the detector; COI, funding and data labels in the 1000-article 2023 sample were
reconciled against detector-extracted statements (detector-adjudicated), so their
agreement is not a fully independent estimate. Data sharing is deliberately
precision-favoring: its 76.5% sensitivity trades recall for 99.0% specificity
(the original oddpub algorithm scores about 84%/97% on this set).
The newer indicators are validated against maintainer-built, hand-labeled
benchmarks in inst/benchmark/:
| Indicator | Sensitivity | Specificity | Basis |
|---|---|---|---|
| Novelty | 83.8% | 95.2% | hand-labeled novelty/replication gold set |
| Replication | 92.8% | 98.5% | replication-enriched sample (111 positives); correction is approximate |
| AI-use disclosure | not accuracy-corrected | — | experimental; only 9 positives in the 2023 sample |
| Open-access license | 100% | not estimable | structured <license> extraction; license-type exact match 99.8%; specificity rests on 1 negative in the OA subset, so it is reported uncorrected |
| Reporting guideline | 93.8% | 99.0% | 1000-article 2023 sample hand-labeled (65 positives) |
Replication's correction mixes designs (sensitivity from the enriched sample,
specificity from the representative 2023 sample), so it is less clean than the
single-design corrections above. AI-use disclosure is reported uncorrected and is
excluded from rt_accuracy until a larger labeled post-2022 sample exists. Two
further benchmarks live in inst/benchmark/: a five-language sample for
multilingual COI and funding, and a TXT-parity benchmark comparing the text
and XML detectors.
See vignette("rtransparency") for the methodology and vignette("scope-and-limitations")
for what each indicator does and does not capture.
vignette("rtransparency")— introduction and methodologyvignette("transparency-summary")— corpus prevalence, scoring and plottingvignette("ai-disclosure")— the AI-use disclosure indicator in depthvignette("scope-and-limitations")— indicator semantics, limitations, output schema- Package website: https://choxos.github.io/rtransparency/
This package builds on the original rtransparent tool of Stylianos
(Stelios) Serghiou, an enhanced, renamed fork maintained by Ahmad Sofi-Mahmudi
(ORCID 0000-0001-6829-0823, GitHub
@choxos). It adds four indicators (novelty,
replication, AI disclosure, and a natively re-implemented data/code detector),
multilingual COI and funding detection, plain-text parity, and corpus-scale
batch processing. Serghiou is credited as an author.
The foundational paper: Serghiou et al., Assessment of transparency indicators
across the biomedical literature: How open is open? PLOS Biology, 2021,
doi:10.1371/journal.pbio.3001107.
Run citation("rtransparency") for both references.
Parts of this package were developed with the assistance of generative AI (Anthropic's Claude, via Claude Code), including code, tests, documentation, and benchmark tooling. All AI-assisted output was reviewed, run, and validated by the maintainer, who is responsible for the final content. This mirrors the kind of disclosure the package itself is built to detect.
Please file bugs or questions as issues at https://github.com/choxos/rtransparency/issues with a minimal reproducible example.