Skip to content

iehok/DualCSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DualCSE

This is the repository for paper One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations.

🌲 Directory Structure

The directory structure should look like this:

DualCSE
├── figures/            # Paper figures
│   └── *.png
├── scripts/            # Shell scripts
│   └── *.sh
├── src/                # Our main implementation
│   └── *.py
├── prompt/            
│   └── prompt.txt
├── datasets/
│   ├── inli/           # Cloned INLI repo
│   │   ├── INLI Data
│   │   ├── Resources
│   │   └── ...
│   └── impscore/       # Downloaded from ImpScore repo
│       └── all_data.csv
└── ...

📦 Setup

1. Clone the INLI Dataset

Before running the code, please clone the INLI dataset into the project root:

git clone https://github.com/google-deepmind/inli.git datasets/inli

2. Download the Wang's Dataset

Please download all_data.csv from ImpScore repo into datasets/impscore.

3. Preprocess

To standardize the dataset format, please perform preprocessing on the downloaded data:

python -m src.prepare_data

🏋️ Model Training & Testing

To train the model, please execute the following command:

bash scripts/train.sh

After training, you can test the model as the following command:

python -m src.test_rte --run_name RUN_NAME
python -m src.test_implicitness_scoring --run_name RUN_NAME

🔐 API Keys

To run experiments with external LLM APIs, create a .env file in the root directory to store your API keys:

vi .env

Add the following keys:

OPENAI_API_KEY="<your OpenAI API key>"
DEEPSEEK_API_KEY="<your DeepSeek API key>"
GEMINI_API_KEY="<your Gemini API key>"
CLAUDE_API_KEY="<your Claude API key>"
MISTRAL_API_KEY="<your Mistral API key>"

🚀 Running LLM Baselines & Evaluation

To run the LLM baseline and evaluate results, use the following commands:

python llm_baseline.py --model_name gpt-4o --n_shot 0 # zero-shot
python llm_baseline.py --model_name gpt-4o --n_shot 8 # eight-shot

To test accuracy based on .xlsx file:

python test_accuracy_llm.py --model_name gpt-4o --n_shot 0
python test_accuracy_llm.py --model_name gpt-4o --n_shot 8

You can replace --model_name with any supported LLM API name configured in your .env.


📚 Dataset Attribution

This project uses the INLI dataset released by Google DeepMind under the following license:

The dataset is intended for research and evaluation purposes only.

The prompt.txt file included in this repository is a modified version based on the original prompt published in the paper INLI dataset by Google DeepMind, and is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

All other code in this repository is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors