Skip to content

Switch scatter regression to OLS with statistics#15

Merged
solresol merged 1 commit into
mainfrom
codex/switch-to-ordinary-least-squares-regression-o6oxc8
Nov 4, 2025
Merged

Switch scatter regression to OLS with statistics#15
solresol merged 1 commit into
mainfrom
codex/switch-to-ordinary-least-squares-regression-o6oxc8

Conversation

@solresol

@solresol solresol commented Nov 4, 2025

Copy link
Copy Markdown
Collaborator

Summary

  • replace the Theil-Sen fit on the journal and year scatter charts with an ordinary least squares regression
  • compute and display the regression formula, R², and p-value beneath each scatter plot
  • add numerical helpers for the t-distribution so p-values can be derived directly in the dashboard

Testing

  • python -m compileall extractor/generate_dashboard.py

https://chatgpt.com/codex/tasks/task_e_69095661587483258cee0fada0a71fcd

@claude

claude Bot commented Nov 4, 2025

Copy link
Copy Markdown

Code Review: Switch scatter regression to OLS with statistics

Thank you for this PR! Overall, this is a solid enhancement that replaces Theil-Sen regression with OLS and adds valuable statistical context to the dashboard. Below is my detailed feedback:


Positive Aspects

Good statistical implementation: The OLS implementation (extractor/generate_dashboard.py:990-1065) correctly calculates slope, intercept, R², and p-value using standard formulas.

Comprehensive p-value calculation: The implementation of the Student's t-distribution CDF with supporting functions (logGamma, betacf, regularizedIncompleteBeta) is mathematically sound and follows established numerical methods.

Robust error handling: Good defensive programming with checks for edge cases (n < 2, sxx == 0, non-finite values, etc.).

User-friendly display: The regression statistics are well-formatted and presented clearly below each scatter plot.


Issues & Concerns

1. Missing Justification for Regression Method Change ⚠️

The PR switches from Theil-Sen (robust to outliers) to OLS (sensitive to outliers) without explaining why. For research data that may contain outliers or non-normal distributions, this could be problematic:

  • Recommendation: Add a comment in the code or PR description explaining why OLS is preferred over Theil-Sen for this specific dataset.
  • Alternative: Consider providing both methods or adding outlier detection.

2. No Test Coverage 🔴

The PR adds ~240 lines of complex mathematical code with no automated tests. This is concerning for several reasons:

  • The numerical functions (logGamma, betacf, etc.) are complex and error-prone
  • Edge cases (e.g., df=0, n=1, all points on a line) need verification
  • p-value calculations can have subtle bugs that aren't immediately obvious

Recommendations:

# Example test cases that should be added:
# 1. Test OLS with known datasets (e.g., y = 2x + 3)
# 2. Test edge cases (n=0, n=1, n=2, vertical/horizontal lines)
# 3. Test p-value calculation against scipy.stats.linregress
# 4. Test formatting functions with extreme values

3. Code Organization 📋

All JavaScript code is embedded in a Python string (extractor/generate_dashboard.py:891-1250+). While this works, it makes the code harder to maintain:

  • Suggestion: Consider extracting the statistical functions into a separate JS file that gets included in the generated HTML. This would:
    • Make the code more testable
    • Improve code organization
    • Enable JS linting/testing tools

4. Potential Numerical Stability Issues ⚠️

  • Line 961: Computing Math.exp(a * Math.log(x) + b * Math.log(1 - x) - lnBeta) could overflow/underflow for extreme values of a, b, or x.
  • Line 1041: varianceEstimate / sxx could be near-zero, making standardError susceptible to floating-point precision issues.

Recommendation: Add overflow/underflow checks or clamp values to prevent NaN/Infinity propagation.

5. Hard-coded Constants 📌

  • Line 891-900: Magic numbers in the logGamma coefficients have no source attribution.
  • Line 917-919: MAX_ITER=200, EPS=1e-12, FPMIN=1e-30 have no explanation.

Recommendation: Add comments citing the source (e.g., "Lanczos approximation coefficients from Numerical Recipes") and explaining the choice of tolerances.


Minor Issues

6. Inconsistent Formatting

  • Line 1096: The sign formatting could produce " - -" for very negative intercepts (if interceptText includes a sign). Use Math.abs(intercept) consistently.
  • Line 1082: Magic number 1e-4 should be a named constant.

7. Performance

  • Line 1013-1033: Two separate loops iterate over validPoints. These could be combined into a single loop for efficiency.
// Current: two passes
for (const point of validPoints) { /* calculate sxx, sxy */ }
for (const point of validPoints) { /* calculate sse, sst */ }

// Better: single pass (requires two-pass algorithm for numerical stability, but residuals can use cached predictions)

8. Documentation

The PR description mentions "Testing: python -m compileall extractor/generate_dashboard.py", but this only checks for syntax errors, not correctness. The testing section should clarify this only validates syntax, not behavior.


Security Concerns

✅ No XSS vulnerabilities detected (values are properly escaped in template literals)
✅ No SQL injection risks (this is frontend-only code)
✅ No external dependencies introduced


Recommendations Summary

High Priority:

  1. Add unit tests for the statistical functions
  2. Document why OLS was chosen over Theil-Sen
  3. Add source citations for numerical algorithms

Medium Priority:
4. Extract JavaScript into a separate file for better maintainability
5. Add overflow/underflow guards in numerical calculations
6. Combine loops in OLS calculation for efficiency

Low Priority:
7. Use named constants instead of magic numbers
8. Improve code comments explaining the math


Suggested Next Steps

  1. Add a test suite (even basic tests are better than none)
  2. Update the PR description with the rationale for switching regression methods
  3. Consider adding a brief comment in the code explaining the statistical approach

Despite these concerns, the core implementation is solid and the feature adds value to the dashboard. With tests and documentation improvements, this would be production-ready.

Let me know if you'd like help implementing any of these suggestions!


Review generated with Claude Code

@solresol solresol merged commit 09b9c6e into main Nov 4, 2025
1 check passed
@solresol solresol deleted the codex/switch-to-ordinary-least-squares-regression-o6oxc8 branch November 4, 2025 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant