AI-Case-Studies

This repository is dedicated to a comprehensive exploration of the Transformer architecture, focusing on detailed insights into each component, its requirements, and various applications.

Repo Setup

Set up the environment by running the following command:
```
sh scripts/setup_env.sh
```
To add new dependency package to the repo
```
uv add <package-name>
```

Note: The apache-beam library is utilized as a large-scale data processing tool in this repository. However, due to dependency conflicts with other packages, it has been intentionally excluded from the requirements-dev.txt file. It is strongly recommended to install this package separately.

Repository Structure

project
│   README.md
│   requirements.txt
│___scripts (contains setup and other script files)
│___core
|   |____tokenizer (holds all tokenizer training files and artifacts)
|   |____activation (holds all activation implementation and test cases)
|   |____configuratioin (hold all model configuration)
|   |____layers (hold all custom transformers and other model layers)
|   |____utils (hold all helper utils)
|   |____training (holds all training related modules)
│___study (contains other experimentation studies and other resources)
|

Transformers

This repository provides an in-depth implementation of the Transformer model's decoder architecture, with components focused on tokenization, training, and customization. The repository serves as a hands-on resource for experimenting with state-of-the-art models, making it suitable for those interested in advanced AI applications.

Studies

The following Jupyter notebooks explore various aspects of tokenization:

Tokenizer

This repository includes resources for training and customizing Byte Pair Encoding (BPE) tokenizers, essential for handling text input efficiently in Transformer models.

AI-Uncomplicated/study/Tokenizer.ipynb: Overview and insights on tokenization techniques.
AI-Uncomplicated/study/tokenizer_training_toy.ipynb: Hands-on guide for training a tokenizer with toy datasets.

Training and Customization

For advanced users interested in training and modifying BPE tokenizers:

Training a BPE Tokenizer: Use AI-Uncomplicated/tokenizer/bpe/trainer.py to train a tokenizer from scratch.
Post-Processing (Token Addition & Removal): The AI-Uncomplicated/tokenizer/bpe/notebooks/post_process_trained_tokenizer.ipynb notebook provides tools for adding or removing specific tokens from an existing tokenizer.

Position Encoding

In Transformer models, position encoding is crucial for providing a sense of word order in sequences since these models lack inherent positional awareness. This section explores both absolute and relative position encoding techniques.

What is Positional Encoding? To grasp the fundamentals of positional encoding and sinusoidal encoding, explore the notebook: AI-Uncomplicated/study/sinisouidal_encoding.ipynb. This provides a detailed explanation of how positional information is encoded mathematically in Transformers.
Relative Positional Encoding Relative positional encoding introduces flexibility by enabling Transformers to consider relationships between tokens rather than absolute positions. Several new approaches have been proposed to implement this effectively. Use the notebook AI-Uncomplicated/study/rope_positional_encoding.ipynb to delve into the Rotary Positional Embedding (ROPE) method, understand its motivation, and see how it works in practice.

References Attention is All You Need

Neural Machine Translation with a Transformer and Keras

Lingua Repo Meta [Training scripts and modified data loader have been used from this repo]

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
core		core
data_processing		data_processing
scripts		scripts
study		study
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
collect_env.py		collect_env.py
config.json		config.json
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements-ci.txt		requirements-ci.txt
requirements-dev.txt		requirements-dev.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Case-Studies

Repo Setup

Repository Structure

Transformers

Studies

Tokenizer

Training and Customization

Position Encoding

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-Case-Studies

Repo Setup

Repository Structure

Transformers

Studies

Tokenizer

Training and Customization

Position Encoding

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages