A comparative study of TF-IDF, BERT, and roBERTa representations for unsupervised clustering of Turkish literary prose by author, developed for the Information Retrieval course at Bilkent University (MSc Computer Engineering).
Five Turkish authors, multiple books each, no labels at training time — can a clustering model recover the author of an unseen passage from its text alone, and how does the choice of representation change the answer?
The pipeline:
- Ingest & preprocess raw Turkish prose (sentence splitting, NLTK tokenization, lowercasing, punctuation/diacritic handling).
- Chunk each work into fixed-length passages of 32, 64, or 128 tokens. Chunk size is the central knob in this study — short chunks carry less style signal but yield more samples; long chunks are the opposite.
- Vectorize each chunk three ways: classical TF-IDF, contextual BERT sentence embeddings, contextual roBERTa sentence embeddings.
- Cluster with K-Means (
k = number of authors) on each representation × chunk-size combination. - Evaluate with cluster purity, Rand index, adjusted Rand index, and silhouette score, then visualize via PCA projection.
The headline question: do contextual embeddings actually help when the goal is style attribution rather than topic? The figures (tf_Idf.png, bert_32/64/128.png, roberta_32/64/128.png) compare the cluster structure each representation recovers.
- Abdülhak Şinasi Hisar
- Ahmet Hamdi Tanpınar
- Ali Teoman
- Halid Ziya Uşaklıgil
- Refik Halid Karay
| File | What it is |
|---|---|
IR_Project.ipynb |
End-to-end notebook: ingest, chunk, vectorize (TF-IDF / BERT / roBERTa), cluster, evaluate |
tf_Idf.png |
Cluster visualization with TF-IDF features |
bert_32.png, bert_64.png, bert_128.png |
BERT embeddings, by chunk size |
roberta_32.png, roberta_64.png, roberta_128.png |
roBERTa embeddings, by chunk size |
Note on the corpus: the raw Turkish-language PDFs used as input are not bundled in this repository because they are works under copyright. The notebook expects them under
./<Author Name>/*.pdf. If you reproduce the experiment, supply your own corpus or substitute a public-domain Turkish-prose dataset.
pip install nltk scikit-learn transformers torch numpy pandas matplotlib pypdf
jupyter notebook IR_Project.ipynbYou will also need to download NLTK's Turkish tokenizer data (punkt) on first run.
Python · scikit-learn (TfidfVectorizer, KMeans, PCA, silhouette_score, adjusted_rand_score) · Hugging Face Transformers (bert-base-multilingual, xlm-roberta) · NLTK · Matplotlib
CS 533 — Information Retrieval Systems, Bilkent University.