Between stats
State
Type | No. of groups | Test | Effect | Function used | Implemented |
---|---|---|---|---|---|
Parametric | 2 | Student/Welch | Cohen's d/Hedge's g | Test:scipy.stats.ttest_ind |
❌ |
Non-parametric | 2 | Mann-Whitney U | r (rank-biserial correlation) | Test:scipy.stats.mannwhitneyu |
❌ |
Robust | 2 | Yuen | Algina-Keselman-Penfield | Test:scipy.stats.ttest_ind |
❌ |
Reference
fleur.betweenstats.BetweenStats
Statistical comparison and plotting class for between-group analysis.
This class provides functionality to visualize and statistically compare numerical data across two or more categorical groups. It supports t-tests for two groups and one-way ANOVA for three or more groups. Visualization options include violin plots, box plots, and swarm plots.
Attributes:
Name | Type | Description |
---|---|---|
statistic |
float
|
The computed test statistic (t or F). |
pvalue |
float
|
The p-value of the statistical test. |
main_stat |
str
|
The formatted test statistic string for display. |
expression |
str
|
Full LaTeX-style annotation string. |
is_ANOVA |
bool
|
True if test is ANOVA, False if t-test. |
is_paired |
bool
|
Whether a paired test was used. |
dof |
int
|
Degrees of freedom for t-tests. |
dof_between |
int
|
Between-group degrees of freedom (for ANOVA). |
dof_within |
int
|
Within-group degrees of freedom (for ANOVA). |
n_cat |
int
|
Number of unique categories in the group column. |
n_obs |
int
|
Total number of observations. |
ax |
Axes
|
The matplotlib axes used for plotting. |
__init__(x, y, data=None, paired=False, **kwargs)
Initialize a BetweenStats()
instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Union[str, SeriesT, Iterable]
|
Colname of |
required |
y
|
Union[str, SeriesT, Iterable]
|
Colname of |
required |
data
|
Optional[Frame]
|
An optional dataframe. |
None
|
paired
|
bool
|
If True, perform paired t-test (only for 2 groups). |
False
|
kwargs
|
Additional arguments passed to the scipy test function.
Either |
{}
|
plot(*, orientation='vertical', colors=None, show_stats=True, violin=True, box=True, scatter=True, violin_kws=None, box_kws=None, scatter_kws=None, ax=None)
Plot and fit the BetweenStats
class to data and render a statistical
comparison plot. It detects how many groups you have and apply the required
test for this number. All arguments must be passed as keyword arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
orientation
|
str
|
'vertical' or 'horizontal' orientation of plots. |
'vertical'
|
colors
|
Optional[list]
|
List of colors for each group. |
None
|
show_stats
|
bool
|
If True, display statistics on the plot. |
True
|
violin
|
bool
|
Whether to include violin plot. |
True
|
box
|
bool
|
Whether to include box plot. |
True
|
scatter
|
bool
|
Whether to include scatter plot of raw data. |
True
|
violin_kws
|
Union[dict, None]
|
Keyword args for violinplot customization. |
None
|
box_kws
|
Union[dict, None]
|
Keyword args for boxplot customization. |
None
|
scatter_kws
|
Union[dict, None]
|
Keyword args for scatter plot customization. |
None
|
ax
|
(Axes,)
|
Existing Axes to plot on. If None, uses current Axes. |
None
|
Returns:
Type | Description |
---|---|
Figure
|
A matplotlib Figure. |
summary()
Print a text summary of the statistical test performed.
Displays the type of test conducted (t-test or ANOVA), number of groups, and the formatted test statistic with p-value and sample size.
Examples
- Minimalist example
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).plot()
- Change colors
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).plot(
colors=["#005f73", "#ee9b00", "#9b2226"]
)
- Change orientation
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).plot(
orientation="horizontal"
)
- Remove elements
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).plot(
box=False,
scatter=False,
)
- Hide statistics
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).plot(show_stats=False)
- Print summary statistics
# mkdocs: render
from fleur import BetweenStats
from fleur import datasets
df = datasets.load_iris()
BetweenStats(df["sepal_length"], df["species"]).summary()
Between stats comparison
Test: One-way ANOVA with 3 groups
F(2, 147) = 119.26, p = 0.0000, n_obs = 150
Statistical details
When trying to compare groups, you should first answer the following questions:
- Number of groups: the two cases are when there are 2 groups and when there 3 or more groups.
- Independence of sample: are the group we're comparing the same person?
- Paired groups: comparing the same people before and after giving them a drug
- Independent groups: comparing a placebo and a treatment group
- Data distribution:
- Normal distribution: we use parametric tests (rely on a statistical law)
- Equality of variance: in parametric tests, we need to know if the variance in each group is the same or not
- Not normal distribution: we use non-parametric tests (don't assume any statistical law)
- Sample size: a too small sample size (n < 30) can be an issue because we lack statistical power
Comparing 2 groups
Independent samples
There are 2 cases here: whether we assume the data distribution is normal or not. Many time, not assuming normality is more realistic, but it also reduces the power of the test (the probability of detecting a given effect if that effect actually exists).
Here we assume the data distribution is normal.
- Equal variance: if the groups have equal variances: independent t-test.
- Unequal variance: if the groups have unequal variances: Welch's t-test.
In both cases, we use the scipy.stats.ttest_ind()
function.
Here we don't assume anything about the distribution and we need to use the Mann-Whitney U test.
Note that the Mann-Whitney U test compares distributions and not means. But this makes sense since not assuming normality (e.g having skewed distributions, for instance) implies that comparing means is not the best way to compare groups, which is what we want to do at the end.
In this case, we use the scipy.stats.mannwhitneyu()
function.
Dependent (paired) samples
Here we assume the data distribution is normal and we need to use a paired t-test.
Here we don't assume anything about the distribution and we need to use the Wilcoxon signed-rank test.
Comparing 3 or more groups
Independent samples
Again, there are parametric and non-parametric approaches depending on the assumption of normality. When normality is assumed, these tests compare group means; otherwise, they compare distributions more generally.
- Equal variance: if the groups have equal variances and normal distributions, use one-way ANOVA.
- Unequal variance: if the groups have unequal variances, use Welch’s ANOVA.
Use the Kruskal-Wallis test, which does not assume normality and compares the overall distributions across groups.
Dependent (repeated measures) samples
Assuming normality, use repeated measures ANOVA to compare means across related groups.
If normality is not assumed, use the Friedman test, which compares distributions across related groups without assuming normality.