Skip to content
#

subword

Here are 19 public repositories matching this topic...

Com la tokenització fractura la morfologia catalana i si una segmentació conscient dels morfemes recupera la geometria. Provat en 3 llengües indoeuropees (català, castellà, anglès): el català es fragmenta ~1,7× més que l'anglès; forçar el tall morfèmic recupera la composicionalitat (robust a portadora i replicat en castellà).

  • Updated May 28, 2026
  • Python

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

  • Updated Jun 30, 2021
  • Jupyter Notebook

Improve this page

Add a description, image, and links to the subword topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the subword topic, visit your repo's landing page and select "manage topics."

Learn more