Skip to content

Commit c43641a

Browse files
authored
Merge pull request #175 from RichardLitt/feat/small-edis
Small edits for readability
2 parents 111f32f + cf688b4 commit c43641a

1 file changed

Lines changed: 13 additions & 13 deletions

File tree

paper.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -20,26 +20,26 @@ bibliography: paper.bib
2020

2121
# Summary
2222

23-
Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, returns a clean tabular result.
23+
Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, and returns a clean tabular result.
2424

2525
As `dython` was designed to be used for research, it puts an emphasis on the visual plots generated by its core methods, providing
26-
highly-readable and customizable visualizations of the output, treating those as a core component rather than a by-product.
26+
highly-readable and customizable visualizations of the output, treating these as a core component rather than a byproduct.
2727

28-
In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves reproducibility of EDA workflows.
28+
In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves the reproducibility of EDA workflows.
2929

3030
**First version release date:** February 2018.
3131

3232
# Statement of Need
3333

3434
While there are many statistical and visualization libraries in Python (e.g. `pandas` [@pandas], `scipy` [@scipy], `scikit-learn` [@scikit-learn], `seaborn` [@seaborn]), they treat continuous data, categorical data and the overall visualization separately. Users often resort to custom glue code to:
3535

36-
1. determine which columns are categorical vs numeric,
37-
2. choose an appropriate association statistic (e.g. Pearson for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),
36+
1. determine which columns are categorical vs. numeric,
37+
2. choose an appropriate association statistic (e.g. Pearson's R for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),
3838
3. compute those pairwise,
3939
4. assemble a matrix or graph,
4040
5. annotate, visualize, and interpret the results.
4141

42-
This fragmentation results in boilerplate, inconsistency, or mistake-risk, especially in exploratory settings or pipelines.
42+
This fragmentation results in boilerplate, inconsistency, or risk of mistakes, especially in exploratory settings or pipelines.
4343

4444
**`dython`** addresses this gap by providing a unified, high-level API that:
4545

@@ -49,7 +49,7 @@ This fragmentation results in boilerplate, inconsistency, or mistake-risk, espec
4949

5050
- **returns structured and annotated output**
5151

52-
- **offers visualization** (heatmaps, annotation) integrated
52+
- **offers visualization** (heatmaps, annotation) integrations
5353

5454
- **offers model evaluation tools** (ROC, AUC, thresholding) for classification tasks
5555

@@ -64,14 +64,14 @@ Below is a summary of existing methods of `dython`, per module.
6464
|--------|-------------|
6565
| associations | Computes associations between mixed-type features. |
6666
| cluster_correlations | Applies clustering to reorder a correlation matrix. |
67-
| compute_associations | Deprecated; replaced by associations(compute_only). |
67+
| compute_associations | Deprecated; replaced by `associations(compute_only=True)`. |
6868
| conditional_entropy | Computes conditional entropy of X given Y. |
69-
| correlation_ratio | Computes correlation between categorical and numeric vars. |
70-
| cramers_v | Computes Cramer’s V between categorical variables. |
69+
| correlation_ratio | Computes correlation between categorical and numeric variables. |
70+
| cramers_v | Computes Cramér’s V between categorical variables. |
7171
| identify_nominal_columns | Detects nominal (categorical) columns. |
7272
| identify_numeric_columns | Detects numeric columns. |
7373
| numerical_encoding | Encodes a mixed dataset into numeric format. |
74-
| replot_last_associations | Re-plots the last association heatmap. |
74+
| replot_last_associations | Replots the last association heatmap. |
7575
| theils_u | Computes Theil’s U (uncertainty coefficient). |
7676

7777
## `model_utils`
@@ -120,7 +120,7 @@ Below is a summary of existing methods of `dython`, per module.
120120
### Model evaluation
121121

122122
* `dython.model_utils.metric_graph(y_true, y_pred, metric='roc', **kwargs)`
123-
This utility helps visualize classification performance. For a given true-label array y_true and predicted scores y_pred, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
123+
This utility helps visualize classification performance. For a given true-label array `y_true` and predicted scores `y_pred`, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
124124

125125
Example:
126126

@@ -146,7 +146,7 @@ Several libraries provide components somewhat overlapping `dython`’s functiona
146146

147147
* `scikit-learn` [@scikit-learn] — mutual information, label encoding, classification metrics, but lacks seamless cross-type association matrices
148148

149-
* `pingouin` [@pingouin] — a statistical package including correlation, effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
149+
* `pingouin` [@pingouin] — a statistical package including correlation and effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
150150

151151
# Installation
152152

0 commit comments

Comments
 (0)