You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,26 +20,26 @@ bibliography: paper.bib
20
20
21
21
# Summary
22
22
23
-
Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, returns a clean tabular result.
23
+
Exploratory data analysis (EDA) frequently requires quantifying and visualizing associations between variables. In datasets with mixed variable types (continuous and categorical), analysts must manually choose suitable metrics (Pearson's R, correlation ratio, Cramér’s V, Theil’s U, etc.), compute them, then assemble the results and plot, often with custom logic. The **`dython`** package automates much of this workflow: it inspects variable types, computes suitable association measures, and returns a clean tabular result.
24
24
25
25
As `dython` was designed to be used for research, it puts an emphasis on the visual plots generated by its core methods, providing
26
-
highly-readable and customizable visualizations of the output, treating those as a core component rather than a by-product.
26
+
highly-readable and customizable visualizations of the output, treating these as a core component rather than a byproduct.
27
27
28
-
In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves reproducibility of EDA workflows.
28
+
In short, **`dython`** lowers the friction for inter-variable association analysis in mixed-type datasets and improves the reproducibility of EDA workflows.
29
29
30
30
**First version release date:** February 2018.
31
31
32
32
# Statement of Need
33
33
34
34
While there are many statistical and visualization libraries in Python (e.g. `pandas`[@pandas], `scipy`[@scipy], `scikit-learn`[@scikit-learn], `seaborn`[@seaborn]), they treat continuous data, categorical data and the overall visualization separately. Users often resort to custom glue code to:
35
35
36
-
1. determine which columns are categorical vs numeric,
37
-
2. choose an appropriate association statistic (e.g. Pearson for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),
36
+
1. determine which columns are categorical vs. numeric,
37
+
2. choose an appropriate association statistic (e.g. Pearson's R for numeric–numeric, correlation ratio for numeric–categorical, Cramér’s V or Theil’s U for categorical–categorical),
38
38
3. compute those pairwise,
39
39
4. assemble a matrix or graph,
40
40
5. annotate, visualize, and interpret the results.
41
41
42
-
This fragmentation results in boilerplate, inconsistency, or mistake-risk, especially in exploratory settings or pipelines.
42
+
This fragmentation results in boilerplate, inconsistency, or risk of mistakes, especially in exploratory settings or pipelines.
43
43
44
44
**`dython`** addresses this gap by providing a unified, high-level API that:
45
45
@@ -49,7 +49,7 @@ This fragmentation results in boilerplate, inconsistency, or mistake-risk, espec
This utility helps visualize classification performance. For a given true-label array y_true and predicted scores y_pred, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
123
+
This utility helps visualize classification performance. For a given true-label array `y_true` and predicted scores `y_pred`, it can plot ROC curves, compute AUC for each class (in multiclass settings), and show threshold recommendations.
124
124
125
125
Example:
126
126
@@ -146,7 +146,7 @@ Several libraries provide components somewhat overlapping `dython`’s functiona
146
146
147
147
*`scikit-learn` [@scikit-learn] — mutual information, label encoding, classification metrics, but lacks seamless cross-type association matrices
148
148
149
-
*`pingouin` [@pingouin] — a statistical package including correlation, effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
149
+
*`pingouin` [@pingouin] — a statistical package including correlationand effect sizes, but does not integrate categorical–categorical measures like Theil’s U or automatic visualization
0 commit comments