ASA Connect

 View Only
Expand all | Collapse all

Can We Trust The Statistical Libraries In Programming Languages?

  • 1.  Can We Trust The Statistical Libraries In Programming Languages?

    Posted 08-05-2025 15:18

    Note for the Statistics Community: Erroneous False Positive Rates of Brown-Forsythe Trimmed Mean Test in Python But Not in R

    This note presents findings from a simulation study evaluating the false positive rates (Type I error rates) of the Brown-Forsythe test for equality of variances under normality, comparing implementations in Python's scipy.stats and R's car package.

    Simulation Description: 10,000 pairs of samples (n1=n2=20) were generated from normal distributions with equal variances (variance=4). For each pair, the Brown-Forsythe statistic was calculated centered around the median and the 10% trimmed mean. The proportion of tests with p-values less than or equal to 0.05 was then calculated for each centering method and software library.

    Software Versions:

    • Python 3.11.13 Jun  4 2025: SciPy version: 1.16.0
    • R: version 4.5.1 (2025-06-13) car package version: 3.1.3

    Function Calls:

    • Python (using scipy.stats.levene):
      • Median-centered: scipy.stats.levene(sample1, sample2, center='median')
      • 10% trimmed mean centered: scipy.stats.levene(sample1, sample2, center='trimmed', proportiontocut=0.1)
    • R (using car::leveneTest):
      • Median-centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = "median")
      • 10% trimmed mean centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = mean, trim.alpha = 0.1)

    Comparative False Positive Rates (FPRs):

    Based on the simulation, the approximate false positive rates were:

    Centering Method

    Language

    Test Description

    FPR (nominal 0.05)

    Trimmed Mean

    Python

    Brown-Forsythe W₁₀

    0.1160

    Trimmed Mean

    R

    Brown-Forsythe W₁₀

    0.0535

    Median

    Python

    Brown-Forsythe W₅₀

    0.0410

    Median

    R

    Brown-Forsythe W₅₀

    0.0396

    Importance:

    This comparison is important for practitioners using these statistical software libraries. While both implementations of the median-centered Brown-Forsythe test show similar FPRs close to the nominal alpha level of 0.05 under these conditions, there appears to be a notable difference in the FPRs for the trimmed mean centered test between the Python and R implementations. This highlights the potential for variations in test performance depending on the software and specific test variant used, even when ostensibly performing the same statistical procedure. Further investigation into the implementations is warranted to correct this discrepancy. The more important questions are the implications for the large and generally used software libraries. In this age of AI, it is not difficult to pose a question and get the Python code to reference software libraries. Does the statistical community have a role in quality control?



    ------------------------------
    Alan B. Forsythe
    Forsythe and Bear LLC
    ------------------------------


  • 2.  RE: Can We Trust The Statistical Libraries In Programming Languages?

    Posted 08-06-2025 23:56
    Hi, 
     
    If you believe you have found a bug in a software package, it's best practice to file a bug report with your reproducible example. 
     
    https://projects.scipy.org/bug-report.html
     
    Hope this helps, 
    Rachel


    ------------------------------
    Rachel Hunter-Merrill
    ------------------------------