subword

This repository contains source code implementation of assignments for NTU's MSAI course AI6127 on Deep Neural Networks for Natural Language Processing (2019 Sem 2).

nlp ner language-model subword msai

Updated Dec 11, 2020
Jupyter Notebook

burcgokden / BERT-Subword-Tokenizer-Wrapper

Star

A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.

machine-learning deep-learning tensorflow machine-translation vocabulary-builder bert subword wordpiece berttokenizer tensorflow-text

Updated Jul 6, 2021
Python

jluo41 / NLPText

Star

corpus subword textpreprocessing field-grains granularity

Updated Jan 8, 2023
Jupyter Notebook

Catmono / bpe-tokenizer-ts

Star

🧠 Build and explore a minimal Byte Pair Encoding tokenizer in TypeScript, training and encoding text using raw UTF-8 bytes without external libraries.

nlp open-source education encoding machine-learning typescript ai compiler decoding developer-tools text-processing language-models utf8 subword bun bpe byte-pair-encoding llm

Updated Jun 9, 2026
TypeScript

xaviviro / la-morfologia-no-surt-de-franc

Star

Com la tokenització fractura la morfologia catalana i si una segmentació conscient dels morfemes recupera la geometria. Provat en 3 llengües indoeuropees (català, castellà, anglès): el català es fragmenta ~1,7× més que l'anglès; forçar el tall morfèmic recupera la composicionalitat (robust a portadora i replicat en castellà).

nlp morphology catalan gemma tokenization subword salamandra llm qwen low-resource-language representation-geometry open-weight-models

Updated May 28, 2026
Python

Ishan-Kotian / Tokenizer_NLP

Star

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

cat nlp count tensorflow tokenizer natural-language character sentence keras-classification-models subword nerual-network imdb-dataset deep-learning-architectures rnn-keras smaller-units tokenizer-nlp

Updated Jun 30, 2021
Jupyter Notebook

Scitator / subword-nmt

Star

Subword Neural Machine Translation

deep-learning seq2seq neural-machine-translation language-model subword

Updated Jun 20, 2017
Python

DHRUVCHARNE / bpe-tokenizer-ts

Star

From-scratch Byte Pair Encoding (BPE) tokenizer in TypeScript using Bun

Updated Feb 11, 2026
TypeScript

TiMauzi / dawg

Star

The concept of DAWGs is based on: Blumer, A. et al. (1985). The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 40, 31–55.

nlp tree parsing tree-structure theoretical-computer-science dawg subword subword-segmentation subwords

Updated Sep 13, 2022
Java

Improve this page

Add a description, image, and links to the subword topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the subword topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subword

Here are 19 public repositories matching this topic...

chrisgrieser / nvim-various-textobjs

scarletcho / KoLM

zouharvi / tokenization-scorer

lallubharteja / KWS-Scripts

cooelf / subMrc

andreasgrv / johnny

cooelf / subword_seg

wang-h / FMDL

explanare / char-iit

scarletcho / subword-mikolov

kkaryl / AI6127-Deep_NLP

burcgokden / BERT-Subword-Tokenizer-Wrapper

jluo41 / NLPText

Catmono / bpe-tokenizer-ts

xaviviro / la-morfologia-no-surt-de-franc

Ishan-Kotian / Tokenizer_NLP

Scitator / subword-nmt

DHRUVCHARNE / bpe-tokenizer-ts

TiMauzi / dawg

Improve this page

Add this topic to your repo