Authorship Clustering of Turkish Literary Texts

A comparative study of TF-IDF, BERT, and roBERTa representations for unsupervised clustering of Turkish literary prose by author, developed for the Information Retrieval course at Bilkent University (MSc Computer Engineering).

What this project does

Five Turkish authors, multiple books each, no labels at training time — can a clustering model recover the author of an unseen passage from its text alone, and how does the choice of representation change the answer?

The pipeline:

Ingest & preprocess raw Turkish prose (sentence splitting, NLTK tokenization, lowercasing, punctuation/diacritic handling).
Chunk each work into fixed-length passages of 32, 64, or 128 tokens. Chunk size is the central knob in this study — short chunks carry less style signal but yield more samples; long chunks are the opposite.
Vectorize each chunk three ways: classical TF-IDF, contextual BERT sentence embeddings, contextual roBERTa sentence embeddings.
Cluster with K-Means (k = number of authors) on each representation × chunk-size combination.
Evaluate with cluster purity, Rand index, adjusted Rand index, and silhouette score, then visualize via PCA projection.

The headline question: do contextual embeddings actually help when the goal is style attribution rather than topic? The figures (tf_Idf.png, bert_32/64/128.png, roberta_32/64/128.png) compare the cluster structure each representation recovers.

Authors in the corpus

Abdülhak Şinasi Hisar
Ahmet Hamdi Tanpınar
Ali Teoman
Halid Ziya Uşaklıgil
Refik Halid Karay

Files

File	What it is
`IR_Project.ipynb`	End-to-end notebook: ingest, chunk, vectorize (TF-IDF / BERT / roBERTa), cluster, evaluate
`tf_Idf.png`	Cluster visualization with TF-IDF features
`bert_32.png`, `bert_64.png`, `bert_128.png`	BERT embeddings, by chunk size
`roberta_32.png`, `roberta_64.png`, `roberta_128.png`	roBERTa embeddings, by chunk size

Note on the corpus: the raw Turkish-language PDFs used as input are not bundled in this repository because they are works under copyright. The notebook expects them under ./<Author Name>/*.pdf. If you reproduce the experiment, supply your own corpus or substitute a public-domain Turkish-prose dataset.

Running it

pip install nltk scikit-learn transformers torch numpy pandas matplotlib pypdf
jupyter notebook IR_Project.ipynb

You will also need to download NLTK's Turkish tokenizer data (punkt) on first run.

Stack

Python · scikit-learn (TfidfVectorizer, KMeans, PCA, silhouette_score, adjusted_rand_score) · Hugging Face Transformers (bert-base-multilingual, xlm-roberta) · NLTK · Matplotlib

Course

CS 533 — Information Retrieval Systems, Bilkent University.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Authorship Clustering of Turkish Literary Texts

What this project does

Authors in the corpus

Files

Running it

Stack

Course

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
IR_Project.ipynb		IR_Project.ipynb
README.md		README.md
bert_128.png		bert_128.png
bert_32.png		bert_32.png
bert_64.png		bert_64.png
roberta_128.png		roberta_128.png
roberta_32.png		roberta_32.png
roberta_64.png		roberta_64.png
tf_Idf.png		tf_Idf.png

Folders and files

Latest commit

History

Repository files navigation

Authorship Clustering of Turkish Literary Texts

What this project does

Authors in the corpus

Files

Running it

Stack

Course

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages