Metrics like Probability of Profit (POP) and Probability of Max Profit (PMP) serve as vital tools for strategy formulation and risk management. However, a pertinent question often arises amongst traders: Does the POP/PMP maintain its predictive power throughout the trade's lifespan?
This issue is colloquially referred to as the "moment in time" problem. That is, are probabilities only valid at the moment in time they’re observed, and immediately become invalidated as time passes? Or are all probabilities valid no matter when they’re calculated, assuming we have a robust probability model?
Leveraging a dataset of over 1 million trades, this article aims to assess the temporal stability of probability calculations empirically. We will focus particularly on options with 30 days to expiration (DTE) and employ Brier scores as a statistical measure to evaluate prediction accuracy. The objective is to ascertain whether the calculated probability at trade inception remains a reliable indicator across various time intervals.
The implications of this investigation are significant. If POP/PMP maintains its accuracy over time, it not only validates the metric at trade entry but also provides validity for a data-backed framework for optimizing trade exit strategies in future studies.
Accuracy of Option Alpha Probabilities
First, let’s examine if OA probabilities of max profit are accurate at all. Let’s take a look at the “tradable” probability range between 50-80% PMP at trade entry. This range was chosen because it represents the bulk of Trade Ideas output and the region that traders are most likely to select from when making a trading decision.
In Figure 1, we see three distinct probabilities of max profit zones:
- 50-60%
- 60-70%
- 70-80%
In each zone, for each day to expiration, we found the actual win percentage of all trade samples. The rationale here is that assuming there are sufficient samples to be somewhat evenly distributed throughout the zone, then the average win rate should fall right in the middle of the zone. We can then analyze how far above or below the real results are in relation to the dashed midline.
This chart represents approximately 1.3M trades. The 50-60% zone encompasses approximately 296k trades, 282k in the 60-70% zone, and 210k in the 70-80% zone. So the trades are distributed fairly evenly across zones, although the sample density is higher below the 70% zone.
It is interesting to note that short-dated probabilities seem to be significantly less accurate than medium- to long-term probabilities, relative to the time frame presented here.
To the naked eye, there does appear to be some overstatement of probability by a few percentage points in several areas. However, we can’t forget that this snapshot is only representative of a very short time frame, and what we think we see in raw charts can be deceptive. We need a more robust way to measure if these are “good” probabilities or not – math to the rescue!
Receiver Operating Characteristic
To be certain, we need something better than “naked eye” analysis to tell us whether or not PMP is a usable model for probability. The Receiver Operating Characteristic Area Under Curve (ROC AUC) is a performance metric used to evaluate the quality of a binary classification model. It quantifies the model's ability to discriminate between positive and negative classes, with a value of 1 indicating perfect discrimination and a value of 0.5 representing a model no better than random guessing.
In our case, we’re distinguishing between the probability model’s ability to identify trades that will end up in the maximum profit zone and those that won’t, classified as 1 and 0, respectively.
In Figure 2, we see the ROC AUC of both Option Alpha’s probability metrics using Black-Scholes with a 30-day Historical Volatility (HV) input and compare it against the Probability of Max Profit (options expiring out-of-the-money) provided by the Greek value delta captured from the market, which can be used as a proxy for the probability of expiring in-the-money. Its complement can be used as a proxy for expiring out-of-the-money.
An ROC AUC score of 0.7 is good. It means our model definitely has predictive power better than random chance and that our model, after 1 million trades, is at least as accurate as delta.
ROC AUC is a valuable metric for assessing the discriminative power of the model, but it does not provide information about the calibration or overall accuracy of the probability estimates. To do that, we will need something called a Brier Score.
What is a Brier Score?
Brier score is a metric for evaluating the accuracy of probabilistic forecasts and originated in the field of meteorology. Introduced by Glenn W. Brier in 1950, the score quantifies the difference between predicted probabilities and actual outcomes in a set of events. It's based on the Euclidean distance between the actual outcome and the predicted probability for each observation.
The score ranges from 0 to 1, with lower values indicating more accurate forecasts. Specifically, a Brier score is calculated as the mean squared difference between the predicted probabilities (for binary classification: 0 or 1) and the actual outcomes, thus serving as a proper scoring rule that encourages calibrated probability forecasts. A lower Brier score indicates better calibration of the predictions.
Brier scoring is mostly used for binary events, such as "rain" or "no rain." For example, if two models correctly predict sunny weather, one with a probability of 0.51 and the other with 0.93, the second model is better.
Moment in Time Methodology
We want to know if the probability calculations made on trade entry are accurate. But more than that, we want to know if the recalculation of those probabilities at any point between entry and expiration also holds.
First, we will calculate the Brier score based on all trade entry opportunities at 30 days to expiration or less (approximately 905k trades). We select 30 DTE somewhat arbitrarily. There are sufficient trades in the database at that entry time frame that will produce enough samples over the trade lifetime. Plus, it is a nice round number. Realistically, the days to expiration chosen do not matter for this study.
Next, we will select only the Trade Ideas opportunities with exactly 30 days to expiration, approximately 30k trades. For each of these trades, we captured quotes at a maximum every 15 minutes, if available, and produced updated probability metrics based on the market conditions at that moment in time (spot price, volatility, days to expiration, etc.).
We calculated quotes for 11.2 million “moment in time” probabilities and determined their profitability at expiration. That’s approximately 373 quotes per trade over a 30 DTE lifetime. We are going to use the Brier score of those 30k trades at entry and compare it to the Brier score of those 11M+ post-entry quotes.
To put it simply, we are comparing the probability accuracy of new trades made at 15 DTE, for example, to quoted probabilities of trades made at 30 DTE with 15 DTE remaining, new trades at 14 DTE to trades with 14 DTE remaining, etc. If the probabilities hold, we will see a relationship between the Brier scores of both the trades at entry and the millions of historical quotes.
We’re once again using our typical data set: 3:55 PM ET Trade Ideas captured since mid-July 2023, 3 strategy types (short put spread, short call spread, and iron condors), 155 underlying symbols, and most days to expiration fall within 1 - 45 DTE. We skipped trades that crossed earnings or ex-dividend dates and assumed cash settlement to record partial profits and losses at expiration.
Data Analysis & Results
Let’s zoom in on 30 days to expiration to understand how well Trade Ideas is modeling PMP at that time frame.
Calibration of Trade Ideas PMP, 30 DTE
Similar to our other studies on EV and Alpha, we want to see how closely PMP is being predicted in a “perfect world.” So let’s compare the predicted probability of max profit to the percentage of trades that ended in max profit territory at expiration. We can do this by creating average bins or buckets at 10% intervals. So, for example, if we average all of the PMP values between 50-60%, we’d expect the mean PMP to be around 55%. We’d also expect the trade would end up in max profit around 55% of the time.
In Figure 3, we can see 30 DTE is pretty well calibrated but having some trouble in the 80%+ range. We can also verify this starting to happen in our Zones chart above (Figure 1) in the 70-80% range. The slight dip below the perfect calibration line means PMP is being overstated in that range, but not enough that it causes serious concern.
Comparing Brier Scores: Entry vs. Trade Lifetime
In Figure 4, we plot the Brier scores of all trades entered with 30 DTE or less and compare them to the Brier scores from probabilities quoted approximately every 15 minutes for all trades opened at exactly 30 DTE. The impact and importance of this simple and unassuming line graph cannot be overstated.
Based on the dataset provided and the results in Figure 4, the following observations can be made:
1. Stable. The Brier scores for both at-entry probabilities and quoted probabilities exhibit a mostly stable temporal trend over the time-to-expiration (measured in days). There aren't significant spikes or drops, indicating that the predictive quality remains fairly consistent across different time frames.
2. Well-calibrated. The at-entry scores are generally higher than the quoted scores across most days to expiration. This suggests that the model corresponding to quoted probabilities is generally better calibrated, meaning it provides more accurate probability estimates as compared to at-entry. This could also be due to scale, i.e., the law of large numbers. There are 350x more data points compared to the at-entry data.
3. Consistent. Both at-entry and quoted scores appear to be fairly consistent over time, which suggests that the conditions being modeled are stable over the range of days examined. However, there are some fluctuations, most notably at around day 5 for at-entry, where it drops to 0.1693. But we also know from our other research (and Figure 1) that there are very few data points available in specific DTE ranges simply because of how the trading calendar works with weekends.
4. Reliable. Both Brier lines are greater than zero but less than 0.25, indicating that the forecasts have a certain level of reliability. Yet, they are not perfect (a Brier score of 0 would imply a perfect forecast).
The difficult thing for most traders to understand is just because the probability of profit has changed, and even changed significantly, doesn’t mean the original probability doesn’t hold. The 60% probability of max profit trade opened N days ago, for example, if made thousands of times, will still expire in profit around 60% of the time regardless of what happens in the interim.
Conclusion
This study empirically validates the temporal stability of Probabilities of Profit in option trading. Both at-entry and real-time probabilities exhibit consistent Brier scores, mitigating the "moment in time" concern. The data suggests that real-time probabilities are generally better calibrated. These findings endorse the use of POP/PMP not just at trade inception but also for ongoing strategy optimization.
The results have two key implications. First, they provide empirical validation for relying on POP/PMP metrics for both entry and ongoing risk management in options trading. Second, the established temporal stability lays the groundwork for future studies aimed at optimizing trade exit strategies based on these probability models.
It stands to reason that if probabilities remain accurate throughout the trade lifetime, then EV remains accurate as well. Areas of future research include verification that this discovery remains true for both EV & Alpha and guidelines or actionable information on how to optimize this discovery for trade exit.