Extract data points from Health Technology Assessment documents using generative AI

Introduction

Health Technology Assessment (HTA) documents published by HTA organisations assess the clinical effectiveness and cost-effectiveness of new drugs, and provide guidance and recommendations for use by policymakers, healthcare providers, and insurance companies. This project aims to extract data points from HTA documents published by HTA bodies in the EU. The goal is to create an Open Science database that can be used by other researchers and decision makers.

Prerequisites

Open an Anthropic AI account and generate an Anthropic AI API key.
Install the requirements by typing the following on the terminal: pip install -r requirements.txt

Usage

Create input data file: Put the HTA documents in the data directory.
Set Anthropic AI API key: Set the Anthropic AI API key as an environment variable.
- On Linux, type in the terminal:
export ANTHROPIC_API_KEY=<your Anthropic AI API key>
- On Windows, type in the command prompt:
setx ANTHROPIC_API_KEY <your Anthropic AI API key>
Create configuration files: In the config directory, specify the user-defined variables in the config.yaml file, the instructions part of the prompt in prompt_template.yaml, and the output JSON schema in schema.json.
Run program: Run the program from the terminal:

python run.py

Attributes to extract

We want to extract the following attributes from each HTA document:

Data point	Explanation
HTA ID	Name of HTA organisation performing the assessment
Treatment type	Type of treatment (medicine, device, therapy)
Assessment type	Is this the first assessment, a reassessment, or an indication broadening?
Internal identifier	Code or label identifying the document
INN	International non-proprietary name of assessed drug
Brand name	Brand name of assessed drug
Assessment date	When was the assessment finalised?
Indication	Medical condition for which the drug is assessed
Final recommendation	What is the final recommendation for this drug-indication combination?
Comparator	Drug with which the performance of the assessed drug is compared
Relative effectiveness assessment outcome	Outcome of the relative effectiveness assessment for this drug-indication combination
Cost-effectiveness assessment outcome	Outcome of the cost-effectiveness assessment for this drug-indication combination
Managed entry agreements	Was any OECD-defined managed entry agreement proposed?
Clinical restrictions	Were any clinical restrictions stated in the recommendation?

Output attributes format

Due to the complexity of the data to be extracted (presence of multiple drugs, health indications, etc.), we need a nested structure to represent the output data. We use this JSON schema, which can also be represented as the following tree structure:

schema {}
├── hta_id
├── treatment_type
├── assessment_type
├── assessment_date
├── internal_identifier
└── indications
    └── indication_name
    └── technologies
        ├── inn
        ├── brand_name
        ├── comparators
        ├── outcome_rea
        ├── outcome_cea
        ├── final_recommendation
        ├── managed_entry_agreement
        └── clinical_restrictions

Input

The input is a set of HTA documents.

If evaluation is desired, a corresponding ground truth in JSON format, following the same schema above, is also provided.

Output

The output document is a list of JSON objects, with each object corresponding to one HTA document, and containing the extracted attributes of interest. Example: Using the generative AI model claude-3-opus-20240229 from Anthropic AI, this is the JSON object corresponding to the document Adefovir dipivoxil and peginterferon alfa-2a for the treatment of chronic hepatitis B (this is an HTA document published by the United Kingdom's National Institute of Health and Care Excellence (NICE)):

 {
   "hta_id": "NICE (UK)",
   "treatment_type": "medicine",
   "assessment_type": "initial assessment",
   "assessment_date": "2006-02-22",
   "internal_identifier": "TA96",
   "indications": [
     {
       "indication_name": "Chronic hepatitis B (HBeAg-positive or HBeAg-negative) in adults with compensated liver disease and evidence of viral replication, increased ALT and histologically verified liver inflammation and/or fibrosis",
       "technologies": [
         {
           "inn": "peginterferon alfa-2a",
           "brand_name": "Pegasys",
           "comparators": "interferon alfa-2a",
           "outcome_rea": "equal",
           "outcome_cea": "positive",
           "final_recommendation": "positive",
           "managed_entry_agreement": null,
           "clinical_restrictions": null
         }
       ]
     },
     {
       "indication_name": "Chronic hepatitis B (HBeAg-positive or HBeAg-negative) in adults with compensated liver disease and evidence of active viral replication, persistently elevated serum ALT levels and histological evidence of active liver inflammation and fibrosis, or decompensated liver disease",
       "technologies": [
         {
           "inn": "adefovir dipivoxil",
           "brand_name": "Hepsera",
           "comparators": "lamivudine, best supportive care",
           "outcome_rea": "positive",
           "outcome_cea": "positive",
           "final_recommendation": "positive",
           "managed_entry_agreement": null,
           "clinical_restrictions": "Adefovir dipivoxil is recommended as an option for the treatment of chronic hepatitis B for patients in whom prolonged oral antiviral treatment is required, only after the use of an interferon unless this is contraindicated. The decision to use adefovir dipivoxil (alone or in combination with lamivudine) should take into account various factors including HBeAg status, stage of disease process (for example the presence of compensated or decompensated cirrhosis) and the presence of, or likelihood of the emergence of, virus resistance."
         }
       ]
     }
   ],
   "filename": "ta96.pdf"
 }

If evaluation is performed, then a file containing the performance metrics (precision, recall, accuracy, F1 score) per attribute, as well as the overall performance metrics, is produced. A file containing detailed comparisons between the extracted attributes and the ground truth is also produced.

License

This project is licensed under the terms of the MIT License.

About the project

Date: September 2023 -

Researchers:

Jan-Willem Versteeg (j.versteeg@uu.nl)
Lourens Bloem (l.t.bloem@uu.nl)
Marie L. de Bruin (m.l.debruin@uu.nl)

Research Engineers:

Modhurita Mitra (m.mitra@uu.nl)
Maarten Schermer (m.d.schermer@uu.nl)
Shiva Nadi Najafabadi (s.nadinajafabadi@uu.nl)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract data points from Health Technology Assessment documents using generative AI

Introduction

Prerequisites

Usage

Attributes to extract

Output attributes format

Input

Output

License

About the project

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
config		config
data		data
hta		hta
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Extract data points from Health Technology Assessment documents using generative AI

Introduction

Prerequisites

Usage

Attributes to extract

Output attributes format

Input

Output

License

About the project

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages