P-Hacking a 42-Day Forecast - GD Consulting LLC

Scatter plot of Hartford, CT daily high and low temperatures from 1960 to 2026, featuring a blue 'recent range' ribbon and highlighted 'inconvenient' days exceeding 90°F and 10°F.

Weather data from NOAA: This chart is similar to a previous post titled, Temperature Data Visualization in Python, except that this chart includes significantly more historic data. Instead of looking for new daily records, this chart looks at annual highs and lows as defined by set limits. The physics of the climate are one thing; the way we bracket and catergorize data to tell a story is another. This post isn’t about meteorology–it’s about how one can apply chosen limits and catagorical titles to a data set in support of a personally preferred narrative. This post satirically explores setting those limits and uses p-hacking to construct and support a narrative.

Executive summary for the impatient, there are exactly 42 days every year that meet the ‘too hot, too cold’ criteria, 21 in both directions. The climate is balanced.

As part of the details in this analysis, I appreciate and applaud the structured data supplied by NOAA. It’s so easy to download and explore. Select export to csv and you’ll find Dates in the Date column and temperatures in the temperature columns.

In this exercise, I downloaded historic weather data collected from 2 weather stations in Hartford CT, located at 2 airports roughly 18 miles apart. The data set starts off with roughly 48 thousand rows, but needs to be cleaned.

Some cleaning is straight forward, where one of the stations wasn’t able to collect a high or low temperature that day. In other cases, there is a temperature listed, but it’s questionable. This data set had a few freezing days in April, but upon inspection, there was a large difference in temperature readings between the two airports, revealing that the low reading was an error.

I spot checked several dates for which the temperature swing from high to low exceeded 20 deg F, and in the final chart and analysis, I dropped roughly 300 dates with temperature spreads greater than 20 deg F. In addition, roughly 300 dates had at least 1 missing value, so the final data set after cleaning had roughly 47.4K rows.

This data cleaning is fairly trivial, and I compared the results with and without dropping rows that had unrealistic values. The chart and analysis presented here include the cleaning, because even in a chart with 96K data points, outliers are obvious, such as sub-zero temperatures in late April. Since these points can also be researched via internet search, I feel justified dropping them, but I’m highlighting this decision, simply to show it is a decision.

In this case, since the data is structured and redundant, and since I have data from 2 stations, and since the data represent numerical sensor readings, it is easy to justify and explain how the data was cleaned. This is not the case for all data sets. Consider unstructured data, such as internet reviews and crime statistics. When crime stats are presented, somebody determined what constituted a crime and what triggered that event to be included in the data set. Historically, even what is a crime varies year to year. I bet that is why, when states and cities are really serious about crime going down, they look at the murder rate–something that is harder to define away.

Back to the NOAA data, let’s start p-hacking. I recall for many years, when people would claim, it’s so hot this year or it’s such a mild winter this year. I would occasionally reply with the claim, here in New England, for as long as I can remember, there are about 2 weeks worth of really, really cold temperatures every winter, and roughly 2 weeks of uncomfortably hot temperatures every summer.

I decided to make this my null hypothesis, download the historical temperature data and run a t-test. Makes sense because it’s easy to do. Technically, I felt pretty justified assuming my 2 weeks could mean anything from 10 to 16 days ( 2 business weeks or 2 calendar weeks plus a weekend). But what constitutes a really cold or hot day. I’ll tell you; 7.5 deg F and 91.5 degrees hot. How do I know, because at those temperatures, I failed to reject the null hypothesis.

Those temperatures made sense but lacked simple explanation. They weren’t memorable. They don’t make a good story, so I decided to adjust my 2 week claim. I’m sure everyone can agree that single digit temperatures are cold, and 10 (as in less than or equal to 10 deg F) can fly under the radar of the single digit claim. For too hot, I’m sure all agree that temps in the 90s fit that definition.

What’s happening is that the temperatures themselves are unquestionably objective. They are numerical facts that lend credibility to the story. However, the story itself is dependent on a subjective category that I constructed—what is ‘too hot’ and ‘too cold.’ The implications of this setup can be seen in my post titled, Categorized as Overweight, where mass and height serve as objective measurements to calculate BMI, but then those BMI values are categorized into arbitrary labels. This can be compared to another article exploring a Binary Classification Machine Learning Exercise, where body measurements are used to predict if a subject is male or female. In that case, the prediction is made by an algorithm, and it can be argued that neither the measurements nor the classifications are subjective.

Concerning the categorization of temperatures being set as too hot and too cold, I slowly added one day at a time. I felt the symmetry aspect of the story was key, so note that there are two null hypothesis statements, one for the number of cold days and one for hot. I found that at 21 days, both nulls failed to be rejected. At first glance, I saw this as 3 weeks, but science fiction got the better of me, and I recognized 42. Grab your towels.

Glenn DiCostanzo
March 2026

Please take a look around the website. Reach out here or send a note on LinkedIn if you would like to discuss a potential project.