Wk 1 Birthday problem 4 might have wrong algebra at the minimum (the logic might be flawed as well)

Hi everyone,

I don’t understand how we ended up with (1-1/365)^(n^2) as a final solution.

The question itself is ambiguous:

  1. Is it 2 rooms each with n students? Or is it 2 rooms with a total of n students?
  2. If it is the latter, then how is n divided up in the 2 rooms?

Assuming we have n students in each room, then the probability of a student in Rm1 NOT matching his b-day with another student in Rm 1 is given by : Q = (1-1/365)^(n); (same for rm 2)

Now, if we cross-match Rm1 with Rm2, the solution given in the lab implies population n is double (which is true). But doubling should not be n^2, it should be 2n. Let’s suppose, Room 1 with say 3 kids, and Room 2 with 3 kids: this gives 6 kids (2n), not 9 kids (n^2).

Also, in the problem, it stated that: 𝑄1=(1−1/365)^n; this is not exactly correct.

Because: Q1 = ( 1-1/365)^0 = 1 where n = 0

And if Q (with 2 rooms) = Q(room1) x Q (Room2), assuming event independence, we won’t end up with Q(rm1 & 2) = (1-1/365)^(n^2)

I would appreciate an explanation.

1 Like

Hi Kevin,

I agree with you that the question itself is ambiguous (Is it 2 rooms each with n students? Or is it 2 rooms with a total of n students?). Furthermore, the logic in the solution seems incorrect as well.

I explained why the solution involves n^2 in this post: Seemingly incorrect analytical solution to Problem 4 of Birthday Lab - #5 by Jason7

Anyway, it would be great if the course provider could provide clarification and update the materials.

1 Like

Pang,

Thank you for sharing your insights. I agree with your stated logic.

Have you been able to get any feedback on an alternative final solution to the problem from other students or class instructors?

I am holding off on working on this problem until someone who has a credible answer steps in.

Kevin

1 Like

@AeryB, can you look into this?

Hi @TMosh,
Forgive me for arriving this late!


So, hey @Kevin_Shey!

[P: problem, C: classroom, P_x: probability associated with x^{th} problem]

At first, I will say, P4 assumes 2 different rooms; each containing n students.
(And so, I am skipping your 2nd question. Though, I should agree that it needed that gone silent “each”!)

More clearly,

C1 C2
Students n n

Now, let’s address your 2nd doubt head-on,

  • The n^2 in power doesn’t have anything to do with the total population.
  • We treat one of the classrooms (say, C1) as the default set of values defined in P1.
  • The other classroom (C2) may be assumed as the generator for the predefined values in P1. Whereas, the values generated (here, the birthdays of the students) are expected to be independent of each other.
  • Now, i^{th} value (b-day) in C2, Q_i is equivalent to P_1(S^{c}) = (1 - \frac{1}{365})^n.
  • Then, using the independence of all the n values (b-days) in C2,
    Q = Q_1 Q_2...Q_n = (1 - \frac{1}{365})^n(1 - \frac{1}{365})^n...n times = (1 - \frac{1}{365})^{n+n+...(n\_times)} = (1 - \frac{1}{365})^{n^2}
  • Finally, we obtain, P_4 = 1 - Q

Tell me if your doubt remains!
This applies to you as well, @pang.luo!


Best wishes…^^

Thank you @AeryB for explaining the solution. It’s actually the same as my explanation in the other post: Seemingly incorrect analytical solution to Problem 4 of Birthday Lab, although in that post the roles of C1 and C2 are opposite to yours.

However, I’m sorry to say the solution is incorrect. Apologies :smile:

To verify my claim, assume each classroom consists of 2 people and there are only 5 days in a year instead of 365. You can use simple math to figure out the true answer is 365/625. However, the solution will produce 369/625. My post shares a simple python script about that.

The root cause is that Q = Q1 * Q2 * … Qn is incorrect. The events denoted by Q1, Q2, … Qn are not independent.

I feel that a correct analytical solution to this problem would have a complex form unfortunately.

1 Like

Now, I see your pov.

However, there is a fatal issue. You are using addition rule instead of multiplication rule. Subtracting the result of addition rule from 1, however, won’t obtain P_4.

We won’t assume it actually. Our independency signifies only the fact that birthdays of 2 students in C1 (or C2, it really doesn’t matter) don’t somehow affect each other.

Where it may not be independent situation - say, there are people who are twins (or even more).

If you don’t mind, let me mention @Jason7.


Best!

Thank you for following up @AeryB . Let’s forget all the reasoning I gave in the two posts earlier :smile: I think a fast way to verify that the given solution to P4 is incorrect is write a simple Python script and compare the true answer to the solution. This is the screenshot of a simple Python script assuming there are 2 persons for each classroom and there are 5 days in a year. You can see clearly the solution 369/625 [1 - (1 - 1/5)^4] is different from the true answer (365/625). Let me also attach the script and the detail of the output. It’s very simple: a and b are in classroom 1 and c and d are in classroom 2.

Script:

Output:
image

detail.csv (6.1 KB)
birthday.py (533 Bytes)

I guess you are still using same logic as your previous post for your counting operations.
Now, what if there are multiple matches in your conditional statement?
Say, if a = c and a = d at the same time, why your algo won’t increase counter by 2?

Tell me how you think about it.


Best…^^

Hi AeryB,

The attachment lists all 625 combinations for 2 people in each classroom and 5 days a year, and 365 of them are matches. One row denotes one combination and the match field is 1 if the combination is a match, otherwise 0. So the probability would be 365/625. If you use the given solution, you will reach 1- (1 - 1/5) ^ (2^2) = 369/625.

Now returning to your statement “say, if a = c and a = d at the same time”, there will be multiple combinations that align with it:
a=0, b=0, c=0, d=0
a=0, b=1, c=0, d=0
a=0, b=2, c=0, d=0
a=0, b=3, c=0, d=0
a=0, b=4, c=0, d=0
a=1, b=0, c=1, d=1
a=1, b=1, c=1, d=1
etc
where a and b are for classroom 1; c and d are for classroom 2.
Each row will be treated as one match.

Let me put this question in another way, if you open the attached detail.csv, do you think the way of counting is wrong? If it’s wrong, which rows are wrong? If detail.csv is correct, then the solution is incorrect. The root cause is that Q = Q1 * Q2 * … Qn is incorrect.

In fact, I just talked to one of my friends who majors in maths. He also points out the solution is incorrect.

detail.csv (6.1 KB)

This is becoming really interesting.

Well, first thing, it is not that I am supporting the analytical approach and against what your algorithm is trying to do. :sweat_smile:

However, as I said in my previous post, what is your reasoning behind not increasing the counter (matches in your code) likewise when your algo satisfies more than one boolean propositions. Why you’d increase it by only 1 in such cases?
This is what I am using against your algorithm. But, for the analytical case, I, from my current pov, have nothing to use against it. That’s why I can’t deny it. If you have any valid proof/logic, I must reconsider.

Finally, I am not saying that the analytic one or your algorithm is correct. I am just suggesting that I can’t deny the former, whereas, for the latter I have reasons (that I already pointed out) to deny it. (Here, I am using similar approach as hypothesis testing that comes later in the course…:sweat_smile:).


Best wishes…^^

Also, I really appreciate your effort. I am getting what you’re trying to say using your code.
Finally, I overlook neither your code, nor the database you generated.

But, you must use proper reasoning to counter what I told against it.


Thanks and best wishes…^^

Thanks for all the effort Pang and AeryB.

I have modified Pang’s code a little. I am counting the additional matches beyond 1 for each trial. So if A=C and A=D and B=C and B=D, this will be counted as 4 matches.

I have enclosed the modified code
b-day match.py (1.3 KB)
and the output.

Output Example: A=4, B=4, C=4, D=4, Match=4, Match_tries = 625
birthday match output.csv (37.2 KB)

true answer: 500 / 625 = 0.8
solution: 0.5903999999999999

The true answer probability: 0.8 vs analytic solution: 0.59

Obviously, 0.8 is the brute force match probability. So if the logic is correct, the probability is correct.

Since the adjustment of match count has been adjusted, would both of you agree with this new result?

@AeryB

Probably my language skills are pretty bad, but I’ll try to make everything clear.

Well, first thing, it is not that I am supporting the analytical approach and against what your algorithm is trying to do. :sweat_smile:

My Python script is not an algorithm or anything fancy. It just uses brute force to list all the combinations for 4 people’s birthdays in our universe where a year consists of only 5 days.

However, as I said in my previous post, what is your reasoning behind not increasing the counter (matches in your code) likewise when your algo satisfies more than one boolean propositions. Why you’d increase it by only 1 in such cases?

Let me talk through what this code does to answer your question. In the screenshot below, a denotes the birthday of person A in classroom 1, b for person B in classroom 1; c for C in classroom 2; d for D in 2. Line 5 - 8 consists of 4 nested loops, which generates all the 625 combinations for our setting (5 days in a year, n=5). What does it mean? It means that the probability that person A’s birthday is day 0, B day 0, C day 0 and D day 0 is 1/625; the probability that person A’ birthday is day 0, B day 0, C day 0 and D day 1 is also 1/625; the probability that person A’ birthday is day 0, B day 0, C day 0 and D day 2 is also 1/625; and so on.

Now we come to line 9. This is the first iteration of the code and a = b = c = d = 0. This denotes one of the 625 combinations. Does it satisfy the condition that the birthday of a person from classroom 1 is the same as another one from classroom 2? Yes, so we assign 1 to the match variable to express that one of all the combinations meets the condition. You asked why I’d increase match by only 1 when a=b=c=d=0? Well, this is just one combination. If it meets the condition, match is 1, otherwise is 0. So very naturally match can only be 1 or 0 for one combination. Maybe I made you confused by naming this variable match? Maybe I should have named it is_condition_satisfied?

This is what I am using against your algorithm. But, for the analytical case, I, from my current pov, have nothing to use against it . That’s why I can’t deny it. If you have any valid proof/logic, I must reconsider.

OK. I’ll show why the analytic solution is incorrect with another example. It seems that my initial setting of 5 days a year is not simple enough to show why the solution is incorrect. Let’s make it very, very simple so that we can do the calculation by hand.

Let’s say in our even stranger universe a year just consists of 2 days. First, how many combinations do we have? 16. So there will be 16 rows below. Each row denotes one possibility out of 16. For example, “0,0,0,0,yes” means that the probability that the birthday of all the 4 persons is day 0 in our strange universe is 1/16 and it satisfies the condition that
the birthday of a person from classroom 1 is the same as another one from classroom 2.

There are 14 yes out of 16 rows, so the probability will be 14/16. What does the analytical solution produce? It would be 1 - (1- 1/2) ^ (2^2) = 15/16. 15/16 != 14/16, which shows that the solution is incorrect.

A birthday,B birthday,C birthday,D birthday,is_condition_satisfied
0,0,0,0,yes
0,0,0,1,yes
0,0,1,0,yes
0,0,1,1,no
0,1,0,0,yes
0,1,0,1,yes
0,1,1,0,yes
0,1,1,1,yes
1,0,0,0,yes
1,0,0,1,yes
1,0,1,0,yes
1,0,1,1,yes
1,1,0,0,no
1,1,0,1,yes
1,1,1,0,yes
1,1,1,1,yes

@Kevin_Shey

Thank you for using my little toy script to play around. :grinning:

Sorry but I don’t think these lines make sense and I explained the reason in my latest reply to AeryB. Simply put, match can only be 0 or 1 because you’re dealing with one combination right now. Sorry but match may not be a good name. Maybe I should have named it is_condition_satisfied.

Above I showed that the analytical solution is incorrect using a setting where there are only 2 days in a year and there are two students in each classroom. In a nutshell, we can easily figure out that the probability that there is at least one match between students in one classroom and students in the other is 14/16, using any brute force method, while we will get a probability of 15/16 with the analytical solution.

I will explain why the solution is incorrect below, but unfortunately I’ve found that it’s hard to find a correct one. I hope some smart peer will figure it out and share it.

Denote by s the number of students in each classroom, n the number of days in a year (365 days on earth, 687 on Mars, and 2 in my assumed universe), nomatch_k a macro meaning student k in classroom 1 has no match with anyone in classroom 2, Q_k the probability of nomatch_k, i.e. Q_k = P(nomatch_k). Please note that I use s to denote the number of students, which is different from the lab. This is because I already used n to denote the number of days in a year in my code.

The solution in the lab states Q = Q_1 * Q_2 * ... * Q_s, i.e. Q = P(nomatch_1) * P(nomatch_2) * ... * P(nomatch_s), which is incorrect. The correct formula should be

Q = P(nomatch_1) * P(nomatch_2 | nomatch_1) * ... * \\ \hspace{1cm} P(nomatch_s | nomatch_1\ and\ nomatch_2\ ... and\ nomatch_{s-1})

The solution assumes
P(nomatch_2) = P(nomatch_2|nomatch_1) and P(nomatch_3) = P(nomatch_3|nomatch_1\ and\ nomatch_2) and so on. This is incorrect.

Given nomatch_1, the birthday combinations for students in classroom 2 have already been restricted. This is why P(nomatch_2) = P(nomatch_2|nomatch_1) is incorrect.

We can verify the above statement using the setting where s=2 and n=2 (2 students in each classroom and 2 days in a year). Given nomatch_1, there are only two combinations of birthdays for classroom 2, i.e. both students were born on day 0, or both were born on day 1. It is easy to see that P(nomatch_2|nomatch_1) = 1/2 and P(nomatch_1) = P(nomatch_2) = 1/4.
So Q = P(nomatch_1) * P(nomatch_2 | nomatch_1) = 1/4 * 1/2 = 1/8.
With the complement rule we will have 1 - Q = 7/8 = 14/16, which is the correct answer.
The analytical solution will produce Q = P(nomatch_1) * P(nomatch_2) = 1/4 * 1/4 = 1/16 and 1- Q = 15/16, which is incorrect.

There is a minor thing I need to point out. There is a missing step in the reasoning process of the analytical solution. Basically Problem 1 asks about the probability that there is not a match between a predefined day and a classroom of students, while for Problem 4 what we really need as a starting step is the probability that there is a not match between a student in classroom 1 and another classroom of students. They are different because the birthday of that student in the latter scenario is not fixed while in the former it’s predefined, but interestingly the two probabilities turn out to be equal to each other. It is easy to figure that out.

Update

@Kevin_Shey @TMosh @AeryB @Jason7

The analytical solution is indeed incorrect. Finally I found out the correct solution at wikipedia (Birthday problem - Wikipedia):
The basic problem considers all trials to be of one “type”. The birthday problem has been generalized to consider an arbitrary number of types.[18] In the simplest extension there are two types of people, say m men and n women, and the problem becomes characterizing the probability of a shared birthday between at least one man and one woman. (Shared birthdays between two men or two women do not count.) The probability of no shared birthdays here is

When d=2, m=2, n=2 (denoting there are 2 days in a year, and there are 2 students in each classroom), p_0=1/8, which is the same as the one I produced by brute force or by the Bayes rule.
The analytical solution will produce 1/16, which is incorrect.

I suggest that the lab be updated.

Another update:
My friend and I implemented the formula on wikipedia and ran some comparison between the the value it generates with the analytical solution. I found out that when d (the number of days in a year) is large enough, the two values are close. So the analytical solution can be seen as an approximation of the wikipedia solution when d is large enough. However, I reckon it is still worth updating the lab explaining that Q = Q_1 * Q_2 * ... * Q_s is not mathematically correct and it is just an approximation when d is large enough.
I attached the code so that anyone interested can play with it.
stirling.py (1.2 KB)

1 Like