Direct link to notebook:
In the code snippet below, the goal is to estimate a confidence interval specifically on the holiday days of the dataset:
# Select the data only for holidays
daily_rides_holidays = daily_rides[daily_rides["date"] > "2022-12-17"]
# Compute sample mean and standard deviation for holidays
mean_rides_per_day_holidays = daily_rides_holidays['daily_rides'].mean()
std_rides_per_day_holidays = daily_rides_holidays['daily_rides'].std()
print(f'Mean number of rides per day: {mean_rides_per_day_holidays:.2f} +/- {std_rides_per_day_holidays:.2f}')
# Get the confidence interval for the population mean for the holidays.
critical_value_holidays = scipy.stats.t.ppf(1 - (1 - confidence)/2, df=len(daily_rides_holidays)-1)
total_days_holidays = daily_rides_holidays['date'].count()
confidence_interval_holidays = critical_value * std_rides_per_day_holidays / np.sqrt(total_days_holidays)
print(f"With a {100 * confidence}% confidence you can say that your error will be no more than {confidence_interval_holidays} rides per day.")
I noticed that the formula for confidence_interval_holidays
is using the previous code cell’s critical_value
shown below:
critical_value = scipy.stats.t.ppf(1 - (1 - confidence)/2, df=len(daily_rides)-1)
While the formula is correct, it seems like it’s using the whole year’s data from the previous calculations instead of the holiday specific ones.
In my head, calculating the confidence interval for holidays should be:
confidence_interval_holidays = critical_value_holidays * std_rides_per_day_holidays / np.sqrt(total_days_holidays)
Where you use critical_value_holidays
instead of critical_value
.
Is my thinking correct or did I miss something? Thanks in advance for the clarifications!