Switch scatter regression to OLS with statistics by solresol · Pull Request #15 · language-in-genetics/language-analysis

solresol · 2025-11-04T05:16:51Z

Summary

replace the Theil-Sen fit on the journal and year scatter charts with an ordinary least squares regression
compute and display the regression formula, R², and p-value beneath each scatter plot
add numerical helpers for the t-distribution so p-values can be derived directly in the dashboard

Testing

python -m compileall extractor/generate_dashboard.py

https://chatgpt.com/codex/tasks/task_e_69095661587483258cee0fada0a71fcd

claude · 2025-11-04T05:18:20Z

Code Review: Switch scatter regression to OLS with statistics

Thank you for this PR! Overall, this is a solid enhancement that replaces Theil-Sen regression with OLS and adds valuable statistical context to the dashboard. Below is my detailed feedback:

Positive Aspects

✅ Good statistical implementation: The OLS implementation (extractor/generate_dashboard.py:990-1065) correctly calculates slope, intercept, R², and p-value using standard formulas.

✅ Comprehensive p-value calculation: The implementation of the Student's t-distribution CDF with supporting functions (logGamma, betacf, regularizedIncompleteBeta) is mathematically sound and follows established numerical methods.

✅ Robust error handling: Good defensive programming with checks for edge cases (n < 2, sxx == 0, non-finite values, etc.).

✅ User-friendly display: The regression statistics are well-formatted and presented clearly below each scatter plot.

Issues & Concerns

1. Missing Justification for Regression Method Change ⚠️

The PR switches from Theil-Sen (robust to outliers) to OLS (sensitive to outliers) without explaining why. For research data that may contain outliers or non-normal distributions, this could be problematic:

Recommendation: Add a comment in the code or PR description explaining why OLS is preferred over Theil-Sen for this specific dataset.
Alternative: Consider providing both methods or adding outlier detection.

2. No Test Coverage 🔴

The PR adds ~240 lines of complex mathematical code with no automated tests. This is concerning for several reasons:

The numerical functions (logGamma, betacf, etc.) are complex and error-prone
Edge cases (e.g., df=0, n=1, all points on a line) need verification
p-value calculations can have subtle bugs that aren't immediately obvious

Recommendations:

# Example test cases that should be added:
# 1. Test OLS with known datasets (e.g., y = 2x + 3)
# 2. Test edge cases (n=0, n=1, n=2, vertical/horizontal lines)
# 3. Test p-value calculation against scipy.stats.linregress
# 4. Test formatting functions with extreme values

3. Code Organization 📋

All JavaScript code is embedded in a Python string (extractor/generate_dashboard.py:891-1250+). While this works, it makes the code harder to maintain:

Suggestion: Consider extracting the statistical functions into a separate JS file that gets included in the generated HTML. This would:
- Make the code more testable
- Improve code organization
- Enable JS linting/testing tools

4. Potential Numerical Stability Issues ⚠️

Line 961: Computing Math.exp(a * Math.log(x) + b * Math.log(1 - x) - lnBeta) could overflow/underflow for extreme values of a, b, or x.
Line 1041: varianceEstimate / sxx could be near-zero, making standardError susceptible to floating-point precision issues.

Recommendation: Add overflow/underflow checks or clamp values to prevent NaN/Infinity propagation.

5. Hard-coded Constants 📌

Line 891-900: Magic numbers in the logGamma coefficients have no source attribution.
Line 917-919: MAX_ITER=200, EPS=1e-12, FPMIN=1e-30 have no explanation.

Recommendation: Add comments citing the source (e.g., "Lanczos approximation coefficients from Numerical Recipes") and explaining the choice of tolerances.

Minor Issues

6. Inconsistent Formatting

Line 1096: The sign formatting could produce " - -" for very negative intercepts (if interceptText includes a sign). Use Math.abs(intercept) consistently.
Line 1082: Magic number 1e-4 should be a named constant.

7. Performance

Line 1013-1033: Two separate loops iterate over validPoints. These could be combined into a single loop for efficiency.

// Current: two passes
for (const point of validPoints) { /* calculate sxx, sxy */ }
for (const point of validPoints) { /* calculate sse, sst */ }

// Better: single pass (requires two-pass algorithm for numerical stability, but residuals can use cached predictions)

8. Documentation

The PR description mentions "Testing: python -m compileall extractor/generate_dashboard.py", but this only checks for syntax errors, not correctness. The testing section should clarify this only validates syntax, not behavior.

Security Concerns

✅ No XSS vulnerabilities detected (values are properly escaped in template literals)
✅ No SQL injection risks (this is frontend-only code)
✅ No external dependencies introduced

Recommendations Summary

High Priority:

Add unit tests for the statistical functions
Document why OLS was chosen over Theil-Sen
Add source citations for numerical algorithms

Medium Priority:
4. Extract JavaScript into a separate file for better maintainability
5. Add overflow/underflow guards in numerical calculations
6. Combine loops in OLS calculation for efficiency

Low Priority:
7. Use named constants instead of magic numbers
8. Improve code comments explaining the math

Suggested Next Steps

Add a test suite (even basic tests are better than none)
Update the PR description with the rationale for switching regression methods
Consider adding a brief comment in the code explaining the statistical approach

Despite these concerns, the core implementation is solid and the feature adds value to the dashboard. With tests and documentation improvements, this would be production-ready.

Let me know if you'd like help implementing any of these suggestions!

Review generated with Claude Code

Switch scatter regression to OLS with statistics

cb4dfaf

solresol added the codex label Nov 4, 2025 — with ChatGPT Codex Connector

solresol merged commit 09b9c6e into main Nov 4, 2025
1 check passed

solresol deleted the codex/switch-to-ordinary-least-squares-regression-o6oxc8 branch November 4, 2025 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch scatter regression to OLS with statistics#15

Switch scatter regression to OLS with statistics#15
solresol merged 1 commit into
mainfrom
codex/switch-to-ordinary-least-squares-regression-o6oxc8

solresol commented Nov 4, 2025

Uh oh!

claude Bot commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

solresol commented Nov 4, 2025

Summary

Testing

Uh oh!

claude Bot commented Nov 4, 2025

Code Review: Switch scatter regression to OLS with statistics

Positive Aspects

Issues & Concerns

1. Missing Justification for Regression Method Change ⚠️

2. No Test Coverage 🔴

3. Code Organization 📋

4. Potential Numerical Stability Issues ⚠️

5. Hard-coded Constants 📌

Minor Issues

6. Inconsistent Formatting

7. Performance

8. Documentation

Security Concerns

Recommendations Summary

Suggested Next Steps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant