Merge pull request #175 from RichardLitt/feat/small-edis

shakedzy · web-flow · commit c43641a69bc0 · 2025-12-16T11:07:38.000+02:00
Small edits for readability
diff --git a/paper.md b/paper.md
@@ -20,26 +20,26 @@ bibliography: paper.bib
 
 # Summary  
 
-Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, returns a clean tabular result. 
+Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, and returns a clean tabular result. 
 
 As `dython` was designed to be used for research, it puts an emphasis on the visual plots generated by its core methods, providing
-highly-readable and customizable visualizations of the output, treating those as a core component rather than a by-product.
+highly-readable and customizable visualizations of the output, treating these as a core component rather than a byproduct.
 
-In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves reproducibility of EDA workflows.
+In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves the reproducibility of EDA workflows.
 
 **First version release date:** February 2018.
 
 # Statement of Need  
 
 While there are many statistical and visualization libraries in Python (e.g. `pandas` [@pandas], `scipy` [@scipy], `scikit-learn` [@scikit-learn], `seaborn` [@seaborn]), they treat continuous data, categorical data and the overall visualization separately. Users often resort to custom glue code to:  
 
-1. determine which columns are categorical vs numeric,  
-2. choose an appropriate association statistic (e.g. Pearson for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),  
+1. determine which columns are categorical vs. numeric,  
+2. choose an appropriate association statistic (e.g. Pearson's R for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),  
 3. compute those pairwise,  
 4. assemble a matrix or graph,  
 5. annotate, visualize, and interpret the results.
 
-This fragmentation results in boilerplate, inconsistency, or mistake-risk, especially in exploratory settings or pipelines.  
+This fragmentation results in boilerplate, inconsistency, or risk of mistakes, especially in exploratory settings or pipelines.  
 
 **`dython`** addresses this gap by providing a unified, high-level API that:  
 
@@ -49,7 +49,7 @@ This fragmentation results in boilerplate, inconsistency, or mistake-risk, espec
  
 - **returns structured and annotated output**
   
-- **offers visualization** (heatmaps, annotation) integrated
+- **offers visualization** (heatmaps, annotation) integrations
    
 - **offers model evaluation tools** (ROC, AUC, thresholding) for classification tasks
 
@@ -64,14 +64,14 @@ Below is a summary of existing methods of `dython`, per module.
 |--------|-------------|
 | associations | Computes associations between mixed-type features. |
 | cluster_correlations | Applies clustering to reorder a correlation matrix. |
-| compute_associations | Deprecated; replaced by associations(compute_only). |
+| compute_associations | Deprecated; replaced by `associations(compute_only=True)`. |
 | conditional_entropy | Computes conditional entropy of X given Y. |
-| correlation_ratio | Computes correlation between categorical and numeric vars. |
-| cramers_v | Computes Cramer’s V between categorical variables. |
+| correlation_ratio | Computes correlation between categorical and numeric variables. |
+| cramers_v | Computes Cramér’s V between categorical variables. |
 | identify_nominal_columns | Detects nominal (categorical) columns. |
 | identify_numeric_columns | Detects numeric columns. |
 | numerical_encoding | Encodes a mixed dataset into numeric format. |
-| replot_last_associations | Re-plots the last association heatmap. |
+| replot_last_associations | Replots the last association heatmap. |
 | theils_u | Computes Theil’s U (uncertainty coefficient). |
 
 ## `model_utils`
@@ -120,7 +120,7 @@ Below is a summary of existing methods of `dython`, per module.
 ### Model evaluation
 
 * `dython.model_utils.metric_graph(y_true, y_pred, metric='roc', **kwargs)`
-    This utility helps visualize classification performance. For a given true-label array y_true and predicted scores y_pred, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
+    This utility helps visualize classification performance. For a given true-label array `y_true` and predicted scores `y_pred`, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
 
     Example:
 
@@ -146,7 +146,7 @@ Several libraries provide components somewhat overlapping `dython`’s functiona
 
 * `scikit-learn` [@scikit-learn] — mutual information, label encoding, classification metrics, but lacks seamless cross-type association matrices
 
-* `pingouin` [@pingouin] — a statistical package including correlation, effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
+* `pingouin` [@pingouin] — a statistical package including correlation and effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
 
 # Installation