Skip to content

brendaamareco/thesis

Repository files navigation

Entropy-enhanced ranking pipeline for mobile app store reviews

Research implementation for prioritizing mobile app store reviews using a weighted ranking function, Shannon Entropy, NDCG evaluation, and algorithmic bias analysis.

This repository contains the experimental pipeline developed for my thesis, "Optimizando parametros en procesamiento de comentarios de usuarios de aplicaciones moviles", and the related paper "Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing".

Overview

Mobile app stores contain large volumes of user reviews that can help developers identify bugs, feature requests, and relevant user concerns. However, these reviews are usually noisy, unstructured, and hard to prioritize manually.

This pipeline ranks app reviews according to their relevance for developers. It compares a standard weighted-function ranking based on traditional features with an entropy-enhanced ranking where Shannon Entropy replaces review length as a ranking feature.

What this pipeline does

  • Prepares app review datasets for ranking experiments
  • Adds Shannon Entropy as a feature extracted from review text
  • Generates weighted ranking functions using exhaustive search
  • Evaluates ranking quality with NDCG
  • Compares standard features against entropy-enhanced features
  • Detects country-based algorithmic bias using AIF360
  • Applies bias mitigation with Reweighing
  • Generates experiment outputs and statistics

Research Context

The pipeline evaluates whether Shannon Entropy can improve user feedback prioritization in requirements engineering.

The experiments compare two feature sets:

Standard ranking:
Category + Sentiment + Score + Review Length

Entropy-enhanced ranking:
Category + Sentiment + Score + Shannon Entropy

The best entropy-enhanced configuration reported in the paper achieved a higher NDCG than the standard ranking, suggesting that entropy can capture useful information density in reviews while reducing dependency on heavier feature extraction steps.

Pipeline Stages

1. Preprocessing
2. Feature Extraction
3. Ranking
4. Quality Testing
5. Bias Testing
6. Statistics

Experiments

The pipeline can run four experiment modes:

1 - Weighted-function ranking with standard features
2 - Weighted-function ranking replacing Review Length with Entropy
3 - Entropy-enhanced ranking with bias evaluation
4 - Entropy-enhanced ranking with bias mitigation

Supported decimal precision values:

1.0, 0.1, 0.01, 0.001

Note: higher precision increases the number of weight combinations significantly.

Dataset

The experiments use Apple App Store reviews from eight countries:

Australia
Canada
Hong Kong
India
Singapore
South Africa
United Kingdom
United States

The annotated subset contains manually ranked reviews used as ground truth for NDCG evaluation.

Requirements

This implementation was tested with:

Debian 11
Python 3.9.7
R
RStudio

Python dependencies are listed in:

requirements.txt

Install them with:

pip install -r requirements.txt

The statistics stage uses R scripts, so R/RStudio must be available in the environment.

To avoid indentation errors when editing scripts, configure your text editor with:

1 tab = 4 spaces

Running the Pipeline

From the pipeline directory:

cd pipeline
bash cli.sh

The script asks for:

experiment number
decimal precision

Experiment outputs are saved under:

pipeline/0-Data/3_experimentes_results/

Repository Structure

pipeline/
  0-Data/              datasets, intermediate data, experiment results
  1-Preprocessing/     data preparation scripts
  2-FeatureExtraction/ entropy extraction
  4-Ranking/           weighted ranking function and weight generation
  5-QualityTesting/    NDCG evaluation
  6-BiasTesting/       bias detection and mitigation
  7-Statistics/        R scripts and plots

Paper

Andres Rojas Paredes, Brenda Mareco Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing
arXiv:2409.12012

Read the paper on arXiv

Releases

No releases published

Packages

 
 
 

Contributors