LLM Information Retrieval

Vector DB + LLM chaining using langchain with open source models for an information retrieval system on domain specific data. It enhances the experience of using a search engine to get direct concise answers besides pointing to the source document referred to generate the answer. I'm using this repository to document my experiments with generative llm as new methods/ tricks are released in the open source.

Requirements

Packages required are installed at the beginning of the notebook
Standard_NC64as_T4_v3 Azure VM node type was used to run the notebook

Overview

Vector Index Setup

Collect all documents of your corpus into a single folder in pdf format
Index is created by reading each document page-by-page and ahead of this each page will be referred as a document
Embeddings for the vector index are generated by a text embedding model, various sentence-transformers models are available to choose from here
FAISS is used to create an index of all these vectors and can be designed as complex as necessary to trade between faster retrieval speed and accuracy of retrieval

Generative LLM Setup

From this leaderboard make a choice of the model
Each model comes with its own complexities of hardware needed to load it and the packages that were used for training it
Most leading models on huggingface provide guidance on both of these and its best to follow them before trying customizations
Tweaking around the generation parameters like temperature, top_p, top_k, etc. helps in controlling quality of the generation

Steps of running a query

The question is first run against the vector index to get top hits of documents
Number of topk hits that can be used is limited by context length supported by the Generative LLM and chunking used to decide the length of each document
A prompt template helps in explaining the task to the model with some examples given showing to set expectations for the generated tokens
It is also prompted to return the document identifier as a source reference and the template gives explicit instructions on how this should be formatted
A limited set of topk hits are sent in the prompt template to the model with the question to generate the answer
Since it was prompted the follow a format in the answer in order to cite the reference, checking whether the format was used or not can help in discarding one of the cases where the Generative LLM definitely halucinated

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Information Retrieval

Requirements

Overview

Vector Index Setup

Generative LLM Setup

Steps of running a query

Models used

Text embedding

Generative LLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Information Retrieval

Requirements

Overview

Vector Index Setup

Generative LLM Setup

Steps of running a query

Models used

Text embedding

Generative LLM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages