Hypothesis Testing: P-value Interpretation

Statistical significance implies the difference observed is likely not due to random variability. I wonder if a test result in a p-value of 7%, can we say that there’s a 7% that the difference occured due to chance instead of that we fail to reject the null hypothesis given the threshold 5%?

Thank you very much.

Hello, @giangnh,

I am careful when interpreting hypothesis testing result, always bearing in mind that it does not tell us the reason of the difference. Therefore, if I were you, I would not say, as I quote, “there’s a 7% that the difference occured due to chance” because we do not know what causes the difference or whether it is due to chance or not.

Cheers,
Raymond

3 Likes

Let me add my 2 cents, and I may be wrong in thinking about this:

The statement “the difference occurred due to chance” is ill defined (isnt’ it always the case that a difference occurs due to chance?)

What we know is, under assumption of

  • having a proper measure of the notion “surprising”
  • the model induces a probability distribution on the multiverse matching the probability distribution of model predictions (because the set of events is really the multiverse)

then we have that p(\text{be_in_a_universe_in_which_obs_is_surprising}|\text{model is accurate}) = 7\%

probability of
“being in a universe in which we get observations as surprising or more surprising than the one obtained (as compared to the model)”
given that
“model is accurate”
= 7%

And what is maybe asked is whether the above is the same as

probability of
“model is accurate” (i.e. “deviations are due to chance”)
given that
“being in a universe in which we get observations as surprising or more surprising than the one obtained (as compared to the model)”

i.e. whether

p(\text{model is accurate}|\text{be_in_a_universe_in_which_obs_is_surprising}) = 7\%

That is certainly not the case .

P.S.

For those with a deep interest in the philosophy of statistical testing:

Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction
Deborah G. Mayo and Aris Spanos

Deborah G. Mayo wrote a whole book about this, here is a review:

3 Likes

Thank you for your help with my question :folded_hands::folded_hands::folded_hands: @rmwkwok @dtonhofer

I want to clarify what I mean by “due to chance”. Let’s consider an A/B test with a random user split. Let’s say we observed a 5% higher conversion rate in the treatment group, yielding a p-value of 0.07.

My interpretation of this p-value is that if there was no actual difference between the control and treatment groups (the null hypothesis), there is a 7% probability of observing a difference of 5% or greater solely due to the random assignment of users. Is it correct?

The seemingly arbitrary nature of p-value thresholds like 0.1, 0.05, and 0.01 is haunting me. While I grasp the general idea of stricter thresholds for stronger evidence and more lenient ones for lower-stakes scenarios, the specific cutoffs feel somewhat subjective. The proximity of our observed p-value (0.07) to these common thresholds highlights this. If my interpretation of the p-value is correct, I believe presenting this 7% probability of a chance occurrence directly to stakeholders would empower them to apply their own desired level of confidence in the results.

I agree with the quoted above but I would, again, refrain from discussing “due to” here because I don’t know what caused the difference. This is why I didn’t quote the “due to” part.

Perhaps you were stating a condition or an assumption of your test? However, when I read “due to”, it seems discussing cause of the difference, so I would refrain from that.

Below are some of my doubts of this idea.

First, this does not reduce the degree of subjectiveness, because stakeholders are going to make their subjective assessments of the test results.

Second, the alpha value is supposed to be set before the test begins, however, given the intent to pass the p-value to the stakeholders after the test, is the aforementioned practice going to change? If so, how setting it afterwards makes a more justified alpha value than setting it before?

Third, how should non-technical stakeholders suppose to reason with these p-values, for example, in their day-to-day meeting seeing all different p-values?

2 Likes

Correct me if I’m wrong, but I think this reasoning is the erroneous way around.

In the case of treatment/non-treatment:

  • You start with the assumption of having similar control and treatment groups (based on randomly sampling from a population, although in real life this is generally not possible, especially in patient care conditions which makes life interesting)
  • You write down the hypothesis that the treatment will change nothing (“null” hypothesis)
  • You submit treatment group to treatment, control group to no treatment, making sure that no ancillary information is given to the group members as to which group they are in (placebos/nocebos and double-blind testing)
  • Later you collect information as to the status of the members of both groups and check whether the hypothesis should be maintained, given that member’s status has improved or worsened.

In the case of A/B testing:

  • You start with the assumption of having similar groups for exposition to A and B
  • You write down the hypothesis that exposure to A and B has similar effects (“null” hypothesis)
  • You submit A group to A, B group to B
  • Later you collect information as to the status of the members of both groups and check whether the hypothesis should be maintained, depending some observed behaviour (e.g. conversion rate)

It’s not that you perform A/B testing on the population to try to separate them into two distinct groups. That would be just “measuring a feature”. For example, separation into groups by blood type is done by checking people’s antigens.

As for the cutoffs, they are subjective. It all comes down to a bet: “How much money am I going to lose if I erroneously maintain a wrong null hypothesis” or “How much money am I going to lose if I erroneously reject a correct null hypothesis”. You may want to be much surer than you currently are right now before making the decision and more invest money & time in collecting additional data. (Amazingly this is absolutely not what happened for the mRNA anti-coof, where the test were apparently absolutely clown-tier and then the stuff went into the vials, and the control group was erased by giving them the vaxx. Weird that)

This happens in physics for example. Massive amounts of measurement on a physical phenomenon may reveal that measurements are different than would be expected from the null hypothesis “nothing unexpected” by “3 sigma” (this corresponds “the result is due to random variation with p ≈ 0.0027”, I had to look that up). This is generally enough to start discussions in the corridor and, if there is no competition with other groups, make a small-group presentation and then maybe a breathless announcement in the press. But they want more assurance and thus continue to collect data until “5 sigma” (this corresponds “the result is due to random variation with p ≈ 0.0000003”), where the announcement of a proper discovery is made.

1 Like

Chance implies what we can’t possibly know, and what p-value tells us is what we don’t know given the data we have. What we don’t know could be due to chance, but also can be lack of data, which we don’t know yet.

In a sense, chance can also be interpreted as what we don’t know yet. That would be a more philosophical than scientifical discussion, although they are the same thing at the extreme.

1 Like

according to my understanding of statistics, hypothesis is just not rejected based on p-value but if the p value is equal or less than alpha(a threshold before conducting the hypothesis testing)

usual alpha value is set 0.05 or 5% but this can be varied based on the sample data one is working.

Sample size: if one is working on a larger sample size, p value might be smaller, even if the effect size is small.

Usually p value is correlated with F-statistic

F-statistic is calculated based on the variance between groups compared to the variance within groups, and the p-value is the probability of observing such an F-statistic (or a more extreme one) if the null hypothesis (no difference between group means) is true.

larger F-statistic would suggest that the differences between group means are greater than the variability within each group, indicating a stronger likelihood that the observed differences are not due to random chance alone.

The p-value is calculated by comparing the calculated F-statistic to the F-distribution, considering the degrees of freedom for the numerator (between-group variance) and the denominator (within-group variance)

A small p-value (typically less than 0.05) suggests that the observed results are unlikely to have occurred if the null hypothesis were true, leading to the rejection of the null hypothesis and the acceptance of the alternative hypothesis (that there is a significant difference between group means)

The p-value determination would be based on sample size, so if you have chosen 7% check the reason for this threshold value and then correlate with data distribution. You should be able to find the reason.

Because p-value is still a probability statistics and doesn’t signifies the real value, so correlating your data sample with t-test(based on the kind of data you are working), f-score and degree of freedom, you determine your hypothetical testing.

So check your sample mean and sample variance between the two groups and also within groups. then correlate with your standard error of mean which would help you to check how precise is your sample mean, that would lead you help calculate the t-test and then calculate the degree of freedom which will determine the critical value for the t-statistic, are calculated based on the sample sizes of the groups being compared. This comparison will help you determine how significant is your t-test score which in turn helps you determine the p value, letting you know if your p value of 7% holds probabilistic significant based on the sample size you are working.

2 Likes

Randomly splitting users into two groups is intended to control for all existing factors and the treatments become the only differentiating factor. Therefore, I think we can establish a casual relationship, and thus the use of ‘due to’ is reasonable - the difference is due to the treatment or random variation. :sweat_smile::sweat_smile::sweat_smile:

My goal isn’t to eliminate subjective judgment, but to provide transparency on the level of uncertainty so stakeholders can assess risk effectively. Instead of a simple ‘The test failed,’ we can say, for example:

  • ‘Assuming no real effect, there’s a 7% chance of observing a difference as large as, or larger than, 5%.’
  • Or, for easier understanding: ‘We are 93% confident that the observed 5% difference is not due to random chance.’ :thinking::thinking::thinking:

What do you think? @rmwkwok @dtonhofer @powerpig @Deepti_Prasad

Thank you all for your help and patience. I’m new to statistics, I’m not sure I understand you correctly, so that’s why I keep asking “silly” questions :folded_hands::folded_hands::folded_hands:

my response was only after reading your main query but your subsequent reply made me read @rmwkwok response which actually is stating on what I too stated about the alpha value(the given designated alpha value) based on which your p value is checked and if it is smaller than this p value, you reject the null hypothesis.

you also didn’t mention the steps you took to get a p value of 7%

as I stated before again any hypothesis testing is not just rejected or accepted just based on p-value, and that’s what my earlier response explained you in detail where it checks as you mention you dividing randomly into groups.

So determine your t-test, degree of freedom, standard deviation, standard error of mean, which would then probably help you statistically determine more if p value of 7% is correct and if correct (based on your sample size) does it hold signifcance.

The problem in responding here is we don’t know what kind of data you are working, like does it have multiple variance (MANOVA) or ANOVA which analyses variance between two or more groups for a single dependent variable where as MANOVA analyses multiple dependent variables between different groups.

I don’t know if you know SAS programming, but they have a great programming related statistics which will gives statistical analysis with your data provided it is correctly provided without any errors or null fundings.

1 Like

Let’s assume our tests are always perfect (which is a bold assumption).

You said “the difference is due to the treatment or random variation”, but can we be sure which one - treatment or random - is the cause? Do we know the cause?

My 2nd & 3rd doubt still hold… I think we need to not just think from our end such as what we can offer, but also from how stakeholders can effectively use what we offer. Are your stakeholders well trained to make the assessment? How can they convert “7%” and “5%” (which are numbers in your specialty) into risk factors/figures relevant in their specialties?

2 Likes

@giangnh, there are a couple of things in your last response that worth discussing, however, I think it is better to first clear the parts covered in my last response and see what you will think about them.

I guess you might be looking for a way to communicate with your stakeholders, but for my role here, my primary concern is the science.

2 Likes

but for my role here, my primary concern is the science.

Now I’m having flashbacks to the GlaDOS song :laughing: which is possibly about survivorship bias.

1 Like

I think I see my error now. I was mixing up hypothesis testing and A/B testing. Hypothesis testing is about finding evidence of a relationship or association. In contrast, A/B testing is a controlled experiment designed to prove that one thing directly causes another. This is why I keep using ‘due to’ when discussing the impact observed in A/B tests.

I agree that defining thresholds before conducting tests is a good practice to avoid confirmation bias. I just recalled that a failed test means not enough evidence, a failed test doesn’t indicate no effect.

Thank you indeed for your help.
Wish you a nice weekend.
Gigi Ng.

3 Likes

As a complement to the notion of “causality”:

and of course by Judea Pearl:

Hello, Gigi, @giangnh

Sometimes I hear people say we only either “reject or not to reject a hypothesis”, and “hypothesis test does not prove either the null or the alternative hypothesis is true”. I think the latter is pretty consistent with that hypothesis test does not tell us the cause. Your mentioning of both “the treatment or random variation” when discussing the difference actually pointed out that all possible causes remain possible.

The development of a causal model is pretty interesting. One may first assume such a model and then measure the causal effect with an experiment (that involves hypothesis derived from such a model), while sometimes observation in any data may give us idea of how to develop a model. In any way, I think a causal model is always assumed by us and get verified/supported by our experiments (that involve hypothesis derived from it).

Cheers,
Raymond

2 Likes