ASA Connect

View Only

Back to discussions

Expand all | Collapse all

Can We Trust The Statistical Libraries In Programming Languages?

Alan Forsythe08-05-2025 15:18

Note for the Statistics Community: Erroneous False Positive Rates of Brown-Forsythe Trimmed Mean Test ...

Rachel Hunter-Merrill08-06-2025 23:56

Hi, If you believe you have found a bug in a software package, it's best practice to file ...

1. Can We Trust The Statistical Libraries In Programming Languages?

Recommend

Alan Forsythe

Posted 08-05-2025 15:18

Note for the Statistics Community: Erroneous False Positive Rates of Brown-Forsythe Trimmed Mean Test in Python But Not in R

This note presents findings from a simulation study evaluating the false positive rates (Type I error rates) of the Brown-Forsythe test for equality of variances under normality, comparing implementations in Python's scipy.stats and R's car package.

Simulation Description: 10,000 pairs of samples (n1=n2=20) were generated from normal distributions with equal variances (variance=4). For each pair, the Brown-Forsythe statistic was calculated centered around the median and the 10% trimmed mean. The proportion of tests with p-values less than or equal to 0.05 was then calculated for each centering method and software library.

Software Versions:

Python 3.11.13 Jun 4 2025: SciPy version: 1.16.0
R: version 4.5.1 (2025-06-13) car package version: 3.1.3

Function Calls:

Python (using scipy.stats.levene):

Median-centered: scipy.stats.levene(sample1, sample2, center='median')
10% trimmed mean centered: scipy.stats.levene(sample1, sample2, center='trimmed', proportiontocut=0.1)

R (using car::leveneTest):

Median-centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = "median")
10% trimmed mean centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = mean, trim.alpha = 0.1)

Comparative False Positive Rates (FPRs):

Based on the simulation, the approximate false positive rates were:

Centering Method	Language	Test Description	FPR (nominal 0.05)
Trimmed Mean	Python	Brown-Forsythe W₁₀	0.1160
Trimmed Mean	R	Brown-Forsythe W₁₀	0.0535
Median	Python	Brown-Forsythe W₅₀	0.0410
Median	R	Brown-Forsythe W₅₀	0.0396

Importance:

This comparison is important for practitioners using these statistical software libraries. While both implementations of the median-centered Brown-Forsythe test show similar FPRs close to the nominal alpha level of 0.05 under these conditions, there appears to be a notable difference in the FPRs for the trimmed mean centered test between the Python and R implementations. This highlights the potential for variations in test performance depending on the software and specific test variant used, even when ostensibly performing the same statistical procedure. Further investigation into the implementations is warranted to correct this discrepancy. The more important questions are the implications for the large and generally used software libraries. In this age of AI, it is not difficult to pose a question and get the Python code to reference software libraries. Does the statistical community have a role in quality control?

------------------------------
Alan B. Forsythe
Forsythe and Bear LLC
------------------------------

2. RE: Can We Trust The Statistical Libraries In Programming Languages?

Recommend

Rachel Hunter-Merrill

Posted 08-06-2025 23:56

Hi,

If you believe you have found a bug in a software package, it's best practice to file a bug report with your reproducible example.

https://projects.scipy.org/bug-report.html

Hope this helps,

Rachel

------------------------------
Rachel Hunter-Merrill
------------------------------

Original Message

Original Message:
Sent: 08-05-2025 15:18
From: Alan Forsythe
Subject: Can We Trust The Statistical Libraries In Programming Languages?

Note for the Statistics Community: Erroneous False Positive Rates of Brown-Forsythe Trimmed Mean Test in Python But Not in R

Software Versions:

Python 3.11.13 Jun 4 2025: SciPy version: 1.16.0
R: version 4.5.1 (2025-06-13) car package version: 3.1.3

Function Calls:

Python (using scipy.stats.levene):

Median-centered: scipy.stats.levene(sample1, sample2, center='median')
10% trimmed mean centered: scipy.stats.levene(sample1, sample2, center='trimmed', proportiontocut=0.1)

R (using car::leveneTest):

Median-centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = "median")
10% trimmed mean centered: car::leveneTest(c(sample1, sample2), factor(rep(1:2, each = 20)), center = mean, trim.alpha = 0.1)

Comparative False Positive Rates (FPRs):

Based on the simulation, the approximate false positive rates were:

Centering Method	Language	Test Description	FPR (nominal 0.05)
Trimmed Mean	Python	Brown-Forsythe W₁₀	0.1160
Trimmed Mean	R	Brown-Forsythe W₁₀	0.0535
Median	Python	Brown-Forsythe W₅₀	0.0410
Median	R	Brown-Forsythe W₅₀	0.0396

Importance:

------------------------------
Alan B. Forsythe
Forsythe and Bear LLC
------------------------------

ASA Connect

Can We Trust The Statistical Libraries In Programming Languages?

Alan Forsythe08-05-2025 15:18

Rachel Hunter-Merrill08-06-2025 23:56

1. Can We Trust The Statistical Libraries In Programming Languages?

2. RE: Can We Trust The Statistical Libraries In Programming Languages?

Contact Us

Membership

Privacy

Follow Us

ASA Connect

Can We Trust The Statistical Libraries In Programming Languages?

Alan Forsythe08-05-2025 15:18

Rachel Hunter-Merrill08-06-2025 23:56

1. Can We Trust The Statistical Libraries In Programming Languages?

2. RE: Can We Trust The Statistical Libraries In Programming Languages?

Related Content

Geometric Mean

Python

Geometric Mean Attachments

Library Available

webinar on BUGS-compatible software for fitting hierarchical models

Contact Us

Membership

Privacy

Follow Us