Skip to content

Valdecy/pyAutoSummarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyAutoSummarizer

pyAutoSummarizer — An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence.

Citation

PEREIRA, V., DE LIMA PORTO, R.C., FIGUEIRA, L.A.A., FERREIRA, R.A.C.A. (2026). Unveiling pyAutoSummarizer: An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence. In: DA HORA, H., PORTER, A.L., CHIAVETTA, D., ZHANG, Y. (eds) Technology Mining. Springer, Cham. https://doi.org/10.1007/978-3-032-10849-4_2

Introduction

pyAutoSummarizer is a Python library for text summarization, covering both extractive and abstractive approaches, and providing a comprehensive suite of evaluation metrics — from classic n-gram overlap to modern semantic and faithfulness measures.

Summarization Methods

Extractive — identifies and returns the most important sentences from the original text:

Method Description
TextRank Graph-based ranking using sentence embeddings and cosine similarity
LexRank Graph-based ranking using TF-IDF cosine similarity
LSA Latent Semantic Analysis via SVD on embeddings or TF-IDF matrix
KL-Sum Selects sentences that minimise KL-divergence from the full document distribution
BART facebook/bart-large-cnn abstractive model (deep learning)
T5 t5-base abstractive model (deep learning)

Abstractive — generates new text that captures the meaning of the source:

Method Description
PEGASUS google/pegasus-xsum model fine-tuned for abstractive summarization
chatGPT OpenAI gpt-4o-mini (or any chat model) via the OpenAI API

Text Pre-processing

The library provides a flexible pre-processing pipeline:

  • Lowercasing, accent removal, special character removal, number removal
  • Custom word removal
  • Stopword removal across 26 languages: Arabic, Bengali, Bulgarian, Chinese, Czech, English, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Japanese, Korean, Marathi, Persian, Polish, Portuguese-br, Romanian, Russian, Slovak, Spanish, Swedish, Thai, and Ukrainian
  • Sentence segmentation by punctuation, word count, or character count

Evaluation Metrics

Classic Metrics (reference-based, lexical)

Metric Method Returns
ROUGE-N rouge_N(generated, reference, n=1) F1, Precision, Recall
ROUGE-L rouge_L(generated, reference) F1, Precision, Recall
ROUGE-S rouge_S(generated, reference, skip_distance=4) F1, Precision, Recall
BLEU bleu(generated, reference, n=4) Score
METEOR meteor(generated, reference) Score

Semantic Metric (reference-based)

Metric Method Returns Notes
BERTScore bert_score(generated, reference, model_type='roberta-large') F1, Precision, Recall Requires pip install bert-score. Captures paraphrasing that ROUGE misses by comparing contextualised token embeddings.

Faithfulness / Factual Consistency Metrics (source-based, no reference needed)

These metrics check whether the summary is factually consistent with the source document, detecting hallucinations that lexical metrics cannot see.

Metric Method Returns Notes
SummaC summa_c(generated, nli_model='cross-encoder/nli-deberta-v3-small') Score ∈ [0, 1] Self-contained NLI-based faithfulness scorer using HuggingFace transformers. No extra install needed.
AlignScore align_score(generated, model='AlignScore-base') Score ∈ [0, 1] Requires pip install pyAutoSummarizer[faithfulness] and python -m spacy download en_core_web_sm. Based on Zha et al., ACL 2023.

LLM-as-Judge Metric

Metric Method Returns Notes
G-Eval g_eval(generated, api_key, model='gpt-4o-mini', dimensions=['coherence','consistency','fluency','relevance']) dict {dimension: int 1–5} Uses an OpenAI chat model to score the summary across four quality dimensions. Based on Liu et al., 2023. Requires an OpenAI API key.

Installation

Core install (extractive/abstractive methods + lexical/BERTScore metrics)

pip install pyAutoSummarizer

With faithfulness metrics (AlignScore)

pip install "pyAutoSummarizer[faithfulness]"
python -m spacy download en_core_web_sm

Requirements: Python ≥ 3.9

Quick Start

from pyAutoSummarizer.base import psr

text = """
Your long text goes here. It can be multiple paragraphs.
The library will pre-process it, split it into sentences,
and summarize it using any of the available methods.
"""

# Initialise — pre-processes the text
s = psr.summarization(text, stop_words=['en'], lowercase=True,
                      rmv_accents=True, rmv_special_chars=True, rmv_numbers=True)

# --- Extractive summarization ---
rank    = s.summ_text_rank()          # TextRank
summary = s.show_summary(rank, n=3)   # top-3 sentences
print(summary)

# --- Abstractive summarization ---
summary = s.summ_abst_chatgpt(api_key='YOUR_KEY', model='gpt-4o-mini')

# --- Evaluation (classic) ---
f1, p, r = s.rouge_N(summary, reference, n=1)
bleu_s   = s.bleu(summary, reference)

# --- Evaluation (semantic) ---
f1, p, r = s.bert_score(summary, reference)

# --- Evaluation (faithfulness — no reference needed) ---
faith_sc = s.summa_c(summary)    # SummaC (built-in NLI)
align_sc = s.align_score(summary) # AlignScore (requires [faithfulness] extra)

# --- Evaluation (LLM-as-judge) ---
scores   = s.g_eval(summary, api_key='YOUR_KEY')
# {'coherence': 4, 'consistency': 5, 'fluency': 5, 'relevance': 4}

Colab Demos

Extractive Summarization

Abstractive Summarization

Related Projects

  • pyBibX — A Bibliometric and Scientometric Python Library Powered with Artificial Intelligence Tools

About

pyAutoSummarizer - An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages