Skip to content

NoorMuhammad1/turkish-author-clustering

Repository files navigation

Authorship Clustering of Turkish Literary Texts

A comparative study of TF-IDF, BERT, and roBERTa representations for unsupervised clustering of Turkish literary prose by author, developed for the Information Retrieval course at Bilkent University (MSc Computer Engineering).

What this project does

Five Turkish authors, multiple books each, no labels at training time — can a clustering model recover the author of an unseen passage from its text alone, and how does the choice of representation change the answer?

The pipeline:

  1. Ingest & preprocess raw Turkish prose (sentence splitting, NLTK tokenization, lowercasing, punctuation/diacritic handling).
  2. Chunk each work into fixed-length passages of 32, 64, or 128 tokens. Chunk size is the central knob in this study — short chunks carry less style signal but yield more samples; long chunks are the opposite.
  3. Vectorize each chunk three ways: classical TF-IDF, contextual BERT sentence embeddings, contextual roBERTa sentence embeddings.
  4. Cluster with K-Means (k = number of authors) on each representation × chunk-size combination.
  5. Evaluate with cluster purity, Rand index, adjusted Rand index, and silhouette score, then visualize via PCA projection.

The headline question: do contextual embeddings actually help when the goal is style attribution rather than topic? The figures (tf_Idf.png, bert_32/64/128.png, roberta_32/64/128.png) compare the cluster structure each representation recovers.

Authors in the corpus

  • Abdülhak Şinasi Hisar
  • Ahmet Hamdi Tanpınar
  • Ali Teoman
  • Halid Ziya Uşaklıgil
  • Refik Halid Karay

Files

File What it is
IR_Project.ipynb End-to-end notebook: ingest, chunk, vectorize (TF-IDF / BERT / roBERTa), cluster, evaluate
tf_Idf.png Cluster visualization with TF-IDF features
bert_32.png, bert_64.png, bert_128.png BERT embeddings, by chunk size
roberta_32.png, roberta_64.png, roberta_128.png roBERTa embeddings, by chunk size

Note on the corpus: the raw Turkish-language PDFs used as input are not bundled in this repository because they are works under copyright. The notebook expects them under ./<Author Name>/*.pdf. If you reproduce the experiment, supply your own corpus or substitute a public-domain Turkish-prose dataset.

Running it

pip install nltk scikit-learn transformers torch numpy pandas matplotlib pypdf
jupyter notebook IR_Project.ipynb

You will also need to download NLTK's Turkish tokenizer data (punkt) on first run.

Stack

Python · scikit-learn (TfidfVectorizer, KMeans, PCA, silhouette_score, adjusted_rand_score) · Hugging Face Transformers (bert-base-multilingual, xlm-roberta) · NLTK · Matplotlib

Course

CS 533 — Information Retrieval Systems, Bilkent University.

About

Unsupervised authorship clustering of Turkish literary prose with TF-IDF, BERT, and roBERTa. CS 533, Bilkent.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors