GitHub - iehok/DualCSE

DualCSE

This is the repository for paper One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations.

🌲 Directory Structure

The directory structure should look like this:

DualCSE
├── figures/            # Paper figures
│   └── *.png
├── scripts/            # Shell scripts
│   └── *.sh
├── src/                # Our main implementation
│   └── *.py
├── prompt/            
│   └── prompt.txt
├── datasets/
│   ├── inli/           # Cloned INLI repo
│   │   ├── INLI Data
│   │   ├── Resources
│   │   └── ...
│   └── impscore/       # Downloaded from ImpScore repo
│       └── all_data.csv
└── ...

📦 Setup

1. Clone the INLI Dataset

Before running the code, please clone the INLI dataset into the project root:

git clone https://github.com/google-deepmind/inli.git datasets/inli

2. Download the Wang's Dataset

Please download all_data.csv from ImpScore repo into datasets/impscore.

3. Preprocess

To standardize the dataset format, please perform preprocessing on the downloaded data:

python -m src.prepare_data

🏋️ Model Training & Testing

To train the model, please execute the following command:

bash scripts/train.sh

After training, you can test the model as the following command:

python -m src.test_rte --run_name RUN_NAME
python -m src.test_implicitness_scoring --run_name RUN_NAME

🔐 API Keys

To run experiments with external LLM APIs, create a .env file in the root directory to store your API keys:

vi .env

Add the following keys:

OPENAI_API_KEY="<your OpenAI API key>"
DEEPSEEK_API_KEY="<your DeepSeek API key>"
GEMINI_API_KEY="<your Gemini API key>"
CLAUDE_API_KEY="<your Claude API key>"
MISTRAL_API_KEY="<your Mistral API key>"

🚀 Running LLM Baselines & Evaluation

To run the LLM baseline and evaluate results, use the following commands:

python llm_baseline.py --model_name gpt-4o --n_shot 0 # zero-shot
python llm_baseline.py --model_name gpt-4o --n_shot 8 # eight-shot

To test accuracy based on .xlsx file:

python test_accuracy_llm.py --model_name gpt-4o --n_shot 0
python test_accuracy_llm.py --model_name gpt-4o --n_shot 8

You can replace --model_name with any supported LLM API name configured in your .env.

📚 Dataset Attribution

This project uses the INLI dataset released by Google DeepMind under the following license:

Source: google-deepmind/inli
License: CC BY-SA 4.0

The dataset is intended for research and evaluation purposes only.

The prompt.txt file included in this repository is a modified version based on the original prompt published in the paper INLI dataset by Google DeepMind, and is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

All other code in this repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
prompt		prompt
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DualCSE

🌲 Directory Structure

📦 Setup

1. Clone the INLI Dataset

2. Download the Wang's Dataset

3. Preprocess

🏋️ Model Training & Testing

🔐 API Keys

🚀 Running LLM Baselines & Evaluation

📚 Dataset Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DualCSE

🌲 Directory Structure

📦 Setup

1. Clone the INLI Dataset

2. Download the Wang's Dataset

3. Preprocess

🏋️ Model Training & Testing

🔐 API Keys

🚀 Running LLM Baselines & Evaluation

📚 Dataset Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages