Skip to content

UtrechtUniversity/hta-genai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract data points from Health Technology Assessment documents using generative AI

Introduction

Health Technology Assessment (HTA) documents published by HTA organisations assess the clinical effectiveness and cost-effectiveness of new drugs, and provide guidance and recommendations for use by policymakers, healthcare providers, and insurance companies. This project aims to extract data points from HTA documents published by HTA bodies in the EU. The goal is to create an Open Science database that can be used by other researchers and decision makers.

Prerequisites

Usage

  • Create input data file: Put the HTA documents in the data directory.

  • Set Anthropic AI API key: Set the Anthropic AI API key as an environment variable.

    • On Linux, type in the terminal:

    export ANTHROPIC_API_KEY=<your Anthropic AI API key>

    • On Windows, type in the command prompt:

    setx ANTHROPIC_API_KEY <your Anthropic AI API key>

  • Create configuration files: In the config directory, specify the user-defined variables in the config.yaml file, the instructions part of the prompt in prompt_template.yaml, and the output JSON schema in schema.json.

  • Run program: Run the program from the terminal:

    python run.py

Attributes to extract

We want to extract the following attributes from each HTA document:

Data point Explanation
HTA ID Name of HTA organisation performing the assessment
Treatment type Type of treatment (medicine, device, therapy)
Assessment type Is this the first assessment, a reassessment, or an indication broadening?
Internal identifier Code or label identifying the document
INN International non-proprietary name of assessed drug
Brand name Brand name of assessed drug
Assessment date When was the assessment finalised?
Indication Medical condition for which the drug is assessed
Final recommendation What is the final recommendation for this drug-indication combination?
Comparator Drug with which the performance of the assessed drug is compared
Relative effectiveness assessment outcome Outcome of the relative effectiveness assessment for this drug-indication combination
Cost-effectiveness assessment outcome Outcome of the cost-effectiveness assessment for this drug-indication combination
Managed entry agreements Was any OECD-defined managed entry agreement proposed?
Clinical restrictions Were any clinical restrictions stated in the recommendation?

Output attributes format

Due to the complexity of the data to be extracted (presence of multiple drugs, health indications, etc.), we need a nested structure to represent the output data. We use this JSON schema, which can also be represented as the following tree structure:

schema {}
├── hta_id
├── treatment_type
├── assessment_type
├── assessment_date
├── internal_identifier
└── indications
    └── indication_name
    └── technologies
        ├── inn
        ├── brand_name
        ├── comparators
        ├── outcome_rea
        ├── outcome_cea
        ├── final_recommendation
        ├── managed_entry_agreement
        └── clinical_restrictions

Input

The input is a set of HTA documents.

If evaluation is desired, a corresponding ground truth in JSON format, following the same schema above, is also provided.

Output

The output document is a list of JSON objects, with each object corresponding to one HTA document, and containing the extracted attributes of interest. Example: Using the generative AI model claude-3-opus-20240229 from Anthropic AI, this is the JSON object corresponding to the document Adefovir dipivoxil and peginterferon alfa-2a for the treatment of chronic hepatitis B (this is an HTA document published by the United Kingdom's National Institute of Health and Care Excellence (NICE)):

 {
   "hta_id": "NICE (UK)",
   "treatment_type": "medicine",
   "assessment_type": "initial assessment",
   "assessment_date": "2006-02-22",
   "internal_identifier": "TA96",
   "indications": [
     {
       "indication_name": "Chronic hepatitis B (HBeAg-positive or HBeAg-negative) in adults with compensated liver disease and evidence of viral replication, increased ALT and histologically verified liver inflammation and/or fibrosis",
       "technologies": [
         {
           "inn": "peginterferon alfa-2a",
           "brand_name": "Pegasys",
           "comparators": "interferon alfa-2a",
           "outcome_rea": "equal",
           "outcome_cea": "positive",
           "final_recommendation": "positive",
           "managed_entry_agreement": null,
           "clinical_restrictions": null
         }
       ]
     },
     {
       "indication_name": "Chronic hepatitis B (HBeAg-positive or HBeAg-negative) in adults with compensated liver disease and evidence of active viral replication, persistently elevated serum ALT levels and histological evidence of active liver inflammation and fibrosis, or decompensated liver disease",
       "technologies": [
         {
           "inn": "adefovir dipivoxil",
           "brand_name": "Hepsera",
           "comparators": "lamivudine, best supportive care",
           "outcome_rea": "positive",
           "outcome_cea": "positive",
           "final_recommendation": "positive",
           "managed_entry_agreement": null,
           "clinical_restrictions": "Adefovir dipivoxil is recommended as an option for the treatment of chronic hepatitis B for patients in whom prolonged oral antiviral treatment is required, only after the use of an interferon unless this is contraindicated. The decision to use adefovir dipivoxil (alone or in combination with lamivudine) should take into account various factors including HBeAg status, stage of disease process (for example the presence of compensated or decompensated cirrhosis) and the presence of, or likelihood of the emergence of, virus resistance."
         }
       ]
     }
   ],
   "filename": "ta96.pdf"
 }

If evaluation is performed, then a file containing the performance metrics (precision, recall, accuracy, F1 score) per attribute, as well as the overall performance metrics, is produced. A file containing detailed comparisons between the extracted attributes and the ground truth is also produced.

License

This project is licensed under the terms of the MIT License.

About the project

Date: September 2023 -

Researchers:

Research Engineers:

Releases

No releases published

Packages

 
 
 

Contributors

Languages