Option FanaticOptions, stock, futures, and system trading, backtesting, money management, and much more!

Automated Backtester Research Plan (Part 9)

With digressions on position sizing for spreads and deceptive butterfly trading plans complete, I will now resume with the automated backtester research plan.

We can study [iron, perhaps, for better execution] butterfly trades entered daily from 10-90 days to expiration (DTE). We can center the trade 0% (ATM) to 5% OTM (bullish or bearish) by increments of 1% [perhaps using caution to stick to the most liquid (10- or 25-point) strikes especially when open interest is low*]. We can vary wing width from 1-5% of the underlying price by increments of 1%. We can vary contract size to keep notional risk as consistent as possible (given granularity constraints of the most liquid strikes).

An alternative approach to wing selection would be to buy longs at particular delta values (e.g. 2-4 potential delta values for each such as 16-delta put and 25-delta call). This could be especially useful to backtest asymmetrical structures, which are a combination of symmetrical butterflies and vertical spreads (as mentioned in the second-to-last paragraph here).

With trades held to expiration, I’d like to track and plot maximum adverse (favorable) excursion for the winners (losers) along with final PnL and total number of trades to determine whether a logical stop-loss (profit target) may exist. We can also analyze differences between holding to expiration, managing winners at 5-25% profit by increments of 5%, or exiting at 1-3x profit target by increments of 0.25x. We can also study exiting at 7-28 (proportionally less on the upper end for short-term trades) DTE by increments of seven.

As an alternative not previously mentioned, we can use DIT as an exit criterion. This could be 20-40 days by increments of five. Longer-dated trades have greater profit (and loss) potential than shorter-dated trades given a fixed DIT, though. To keep things proportional, we could instead backtest exiting at 20-80% of the original DTE by increments of 15%.

Trade statistics to track include winning percentage, average (avg.) win, avg. loss, largest loss, largest win, profit factor, avg. trade (avg. PnL), PnL per day, standard deviation of winning trades, standard deviation of losing trades, avg. days in trade (DIT), avg. DIT for winning trades, and avg. DIT for losing trades. Reg T margin should be calculated and will remain constant throughout the trade. Initial PMR should be calculated along with the maximum value of the subsequent/initial PMR ratio.

We can later consider relatively simple adjustment criteria. I may spend some time later brainstorming some ideas on this, but I am most interested at this point in seeing raw statistics for the butterfly trades.

I will continue next time.

* This would be a liquidity filter coded into the backtester. A separate study to see how open interest for
   different strikes varies across a range of DTE might also be useful.

Butterfly Skepticism (Part 2)

The presentations I see on butterfly strategies often get my skeptical juices flowing.

I will not accept edge that occurs at one particular DTE and not another because I think this is one of the great fallacies in all of option trading (see here and here). I find many butterfly strategies to be guilty of this.

From across the mountaintop, the plethora of butterfly trading plans look like an attempt to place Band Aids in all combinations over any potential risk including high/low underlying price, volatility, or days to expiration. I think any advanced trader would recognize this as the Holy Grail or a flat position. The former is a mirage that does not exist. The latter poses no risk with no opportunity for reward. Any statistically-minded trader might recognize this as curve-fitting: trying different things until you find one that works perfectly. Despite that perfect match to past data, curve-fit systems are unlikely to work in the future. Also by [statistical] chance alone, every 20 attempts is likely to produce one success [alpha = 0.05].

Studying butterflies was one of my early reasons for wanting an automated backtester. Of all the butterfly trading plans I have seen pitched, many were accompanied with no backtesting at all (optionScam.com). Some of them have limited backtesting, which I have often found to be vulnerable to criticism (e.g. transaction fees not included, insufficient sample sizes, limited subset of market environments studied, survivorship bias, no OOS validation). I want to be able to take a defined trading plan and backtest it comprehensively.

For different reasons, many butterfly trading plans cannot be backtested. Some are complex enough to fall prey to the curse of dimensionality (discussed and here and here). Some trading plans emphasize multiple positions in the whole portfolio placed over long periods of time to properly hedge each other. Chewing up such a large time interval for a single instance will only allow for a small number of total instances that cannot be divided up in sufficient sample size for the different categories of market environments. Some trading plans are specific enough to be curve-fit (worthless). Other trading plans incorporate discretion. For all practical purposes, discretion can never be backtested. Any discretionary plan thereby becomes a story, which falls into the domain of sales, rapport building, advertising, and marketing (optionScam.com?). I alluded to this in the second-to-last paragraph here.

I have yet to trade butterflies with any consistency because I do not even know if they are better than naked puts, which I consider the most basic of option trades. At the very least, the automated backtester research plan should be able to address this.

Butterfly Skepticism (Part 1)

Before I continue the research plan, I want to express some skepticism I have toward butterfly trades based on what I have seen in recent years.

Butterflies have been a “secret sauce” of the options trading community for some time. They come in all different shapes and sizes with diverse trading plans incorporating varying degrees of discretion. I have seen many traders and vendors describe more/less elaborate adjustment plans that look, at cursory glance, very impressive in their presentation.

Most of these adjustment plans are simply position overlays added later in the trade. Adding [a] management criteria [criterion], giving the trade a fancy name, and selling it as a new idea for $$$ seems alarmingly deceptive to me especially in the absence of valid backtesting to support it.

Most of the complexity amounts to adding subsequent positions within specified time constraints. For example, adding a put credit spread (PCS) later in the trade to raise the upper expiration line may seem appealing. Consider these:


If they happen at all then either is going to be triggered much later than butterfly initiation since decay occurs over time. I strongly believe nothing to be magical about any particular DTE for trade inception or adjustment. Failure to explore any other time a trade could be placed or adjustment made is short-sighted as I have written about here, here, and here.

If I would consider doing the adjustment later, then I should backtest the adjustment earlier as well. With regard to the above example, I already plan to study shorter- and longer-dated PCS as part of the automated backtester research plan. Whether the trading plan works as a whole is simply a question of whether the sum of its parts (e.g. symmetrical butterfly plus shorter-dated PCS) is profitable.

I will continue next time.

Constant Position Sizing of Spreads Revisited (Part 2)

I’m doing a Part 2 because early this morning, I had another flash of confusion about the meaning of “homogeneous backtest.”

The confusion originated from my current trading approach. Despite my backtesting, I still trade with a fixed credit. If I used a fixed delta then 2x-5x initial credit (stop loss) would be larger at higher underlying prices. Gross drawdown as a percentage of the initial account value would consequently be higher. This means drawdown percentage could not be compared on an apples-to-apples basis across the entire backtesting interval.

Read the “with regard to backtesting” paragraph under the graph shown here. Constant position size (e.g. number of contracts or notional value?), apples-to-apples comparison of PnL changes (e.g. gross or percentage of initial/current account value?) throughout, and evaluating any drawdown (e.g. gross or as a percentage of initial/current account value?) as if it happened from Day 1 are all nebulous and potentially contradictory references (as described).

In this post, I argue:

     > Sticking with the conservative theme, I should also calculate
     > DD as a percentage of initial equity because this will give a
     > larger DD value and a smaller position size. For a backtest
     > from 2001-2015, 2008 was horrific but as a percentage of
     > total equity it might not look so bad if the system had
     > doubled initial equity up to that point.

If I trade fixed credit then I am less likely to incur drawdown altogether at higher underlying price, which makes for a heterogeneous backtest when looking at the entire sample of daily trades. If I trade fixed delta then see the last sentence of (above) paragraph #2.

I focused the discussion on position size in this 2016 post where I stressed constant number of contracts. Recent discussion has neither focused on fixed contracts nor fixed credit.

“Things” seem to “get screwed up” (intentionally nebulous) if I attempt to normalize to allow for an apples-to-apples comparison of any drawdown as if it occurred from Day 1.

If I allow spread width [if backtesting a spread] to vary with underlying price and I sell a fixed delta—as discussed in Part 1—then a better solution may be to calculate gross drawdowns as a percentage of the highwater account value to date. I will leave this to simmer until my next blogging session for review.

I was going to end with one further point but I think this post has been sufficiently thick to leave it here. I will conclude with Part 3 of this blogging detour next year!

Constant Position Sizing of Spreads Revisited (Part 1)

In Part 7, I said constant position size is easy to do with vertical spreads by maintaining a fixed spread width. I now question whether a fixed spread width is sufficient to achieve my goal of a homogeneous backtest throughout.

I enter this deliberation with reason to believe it will be a real mess. I have addressed this point before without a successful resolution. This post provides additional background.

The most recent episode of my thinking on the matter began with the next part of the research plan on butterflies. I want to backtest ATM structures and perhaps one strike OTM/ITM, two strikes OTM/ITM, etc. Rather than number of strikes, which would not be consistent by percentage of underlying price, a better approach may be to specify % OTM/ITM.

I then started thinking about my previous backtesting along with reports of backtests from others suggesting spread width to be inversely proportional to ROI (%). It makes sense to think the wider the spread, the more moderate the losses because it’s more likely for a 30-point (for example) spread to go ITM than it is a 50-point spread with the underlying having to move an additional 20 points in the latter case. This begs the question about whether an optimal spread width exists because while wider spreads will incur fewer max losses, wider spreads also carry proportionally higher margin requirements.

Also realize that a 30-point spread at a low underlying value is relatively wide compared to a 30-point spread at a high underlying price. I mentioned graphing this spread-width-to-underlying-price (SWUP) percentage in Part 7. We could look to maintain a constant SWUP percentage if granularity is sufficient; with the 10- and 25-point strikes most liquid, having to round to the nearest liquid strike could force SWUP percentage to vary significantly (especially at lower underlying prices).

All of this is to suggest that spread width should be left to fluctuate with underlying price, which contradicts what I said about fixed spread width and constant capital. We can attempt to normalize total capital by varying the number of contracts as discussed earlier with regard to naked puts. From the archives, similar considerations about normalizing capital and granularity were discussed here and here.

Aside from notional value, I think the other essential factor to hold constant for a homogeneous backtest is moneyness. As mentioned above, spreads should probably not be sold X strikes OTM/ITM. We should look to sell spreads at fixed delta values (e.g. “short strike nearest to Y delta”) since delta takes into account days to expiration, implied volatility, and underlying price.

An interesting empirical question is how well “long strike nearest to Z delta” does to maintain a constant SWUP percentage.

Automated Backtester Research Plan (Part 8)

After studying put credit spreads (PCS) as daily trades, the next step is to study them as non-overlapping trades.

As discussed earlier, I would like to tabulate several statistics for the serial approach. These include number (and temporal distribution) of trades, winning percentage, compound annualized growth rate (CAGR), maximum drawdown, average days in trade, PnL per day, risk-adjusted return (RAR), and profit factor (PF). Equity curves will represent just one potential sequence of trades and some consideration could be given to Monte Carlo simulation. We can plot equity curves for different account allocations such as 10% to 70% initial account value by increments of 5% or 10% for a $50M account. A 30% allocation (for example) would then be $15M per trade. By holding spread width constant, drawdowns throughout the backtesting interval may be considered normalized.

As an example of the serial approach, I would like to backtest “The Bull” with the following guidelines:


I will not detail a research plan for call credit spreads. If we see encouraging results from looking at naked calls then this can be done as described for PCS.

I also am not interested in backtesting rolling adjustments for spreads due to potential execution difficulty.

Thus far, the automated backtester research plan has two major components: study of daily trades to maximize sample size and study of non-overlapping trades. I alluded to a third possibility when discussing filters and the concentration criterion: multiple open trades not to exceed or match one per day.

This is suggestive of traditional backtesting I have seen over the years where trades are opened at a specified DTE. For trades lasting longer than 28 (or 35 every three months) days, overlapping positions will result. As discussed here, I am not a proponent of this approach. Nevertheless, for completeness I think it would be interesting to do this analysis from 30-64 DTE and compare results between groups, which I hypothesize would be similar. To avoid future leaks, position sizing should be done assuming two overlapping trades at all times. ROI should also be calculated based on double the capital.

Another aspect of traditional backtesting I have eschewed in this trading plan is the use of standard deviation (SD) units. I have discussed backtesting many trades from (for example) 0.10-0.40 delta by units of 0.10. More commonly used are 1 SD (0.16 delta corresponding to 68%), 1.5 SD (0.07 delta corresponding to 86.6%), and 2 SD (0.025 delta corresponding to 95%). Although not necessary, we could run additional backtests based on these unit measures for completeness.

Automated Backtester Research Plan (Part 7)

Once done with straddles and strangles, put credit spreads (PCS) are next in the automated backtester research plan.

The methodology is much the same for PCS as for naked puts, which I detailed here and here.

We can first study PCS placed every trading day to maximize sample size. Trades can be entered between 30-64 days to expiration (DTE). The short leg can be placed at the first strike under -0.10 to -0.50 delta by increments of -0.10. We can hold to expiration, manage winners at 25% (ATM options only?) or 50%, or exit at 7-21 DTE by increments of seven. We can also exit at 20-80% of the original DTE by increments of 15%. We can manage losers at 2x, 3x, 4x, and 5x initial credit. I’d like to track and plot maximum adverse (favorable) excursion (no management) for the winners (losers) along with final PnL and total number of trades. I want to monitor winning percentage, average win, average loss, largest loss, profit factor, average trade (average PnL), PnL per day, standard deviation of winning trades, standard deviation of losing trades, average days in trade (DIT), average DIT for winning trades, and average DIT for losing trades.

As always, I think maintenance of a constant position size is important. This is easier to do with vertical spreads because the width of the spread—to be held constant for each backtest—defines the risk. We can vary the width between 10-50 points by increments of 10 or 25-100 points by increments of 25 depending on underlying.

My gut says that we do not want long legs acting as unreactive units (standard options) at lower (higher) prices of the underlying. Rather than an apples-to-apples backtest throughout, this would actually be two different backtests with the long leg serving only as margin control at lower underlying prices and as an actual hedge otherwise. Unreactive units may result when the spread width is too large as a percentage of the underlying price: this percentage should be graphed over time. An alternative way of analyzing this is hedge ratio, which can also be graphed over time. Hedge ratio equals decay rate (theta divided by mark) of the short option divided by decay rate of the long. A hedge ratio less than 0.80 is suggestive of long option decay that is too rapid for the short. This may leave the short option unprotected.

The importance of this last paragraph is subject to debate. I alluded to the subject earlier where I cursorily addressed the feasibility of naked call backtesting altogether.

Shorter dated trades, which have not been discussed thus far in the research plan, may also be studied. I would be interested in studying trades placed at 4-30 DTE with all the permutations given above. We can also use weekly options [when available] subject to a liquidity filter. This filter can check for a minimum open interest requirement or a bid-ask spread below a specified percentage of the mark.

General filters can also be studied as discussed in Part 2 (linked in paragraph #2 above).

I will continue next time.

Automated Backtester Research Plan (Part 6)

In the last three posts, I detailed portfolio margin (PM) considerations with the automated backtester. After backtesting naked puts and naked calls separately, the next thing I want to do is add a naked call overlay to the naked puts.

This is not the previously-mentioned ATM call adjustment but rather study of 0.10-0.40- (by increments of 0.10) delta strangles. Strangles can be left to expiration, managed at 50%, or closed for loss at 2-5x initial credit. I want to track total number of trades, winning percentage, average PnL, average loss, largest loss, standard deviation of returns, days in trade, PnL per day, PF, RAR, and maximum adverse excursion. Strangles should be normalized for notional risk and with implementation of PM logic, notional risk can be replaced by theoretical loss from walking the chain up 15% (or up 12% plus a 10% vega adjustment). With this done, return on capital can then be calculated as return on PM (perhaps calculated as return on the largest PM ever seen in the trade since PMR varies from one day to the next). Maximum subsequent:initial PM ratio should be tracked. We can also study the effect of time stops.

If deemed useful then maximum favorable excursion (MFE) can also be studied for unmanaged trades. This could be studied and plotted in histogram format before looking at ranges of management levels (not mentioned in previous paragraph). With MFE and MAE, some thought may need to be given about whether to analyze in terms of dollars or percentages. If notional risk is somehow kept constant, though, then either may be acceptable.

Incorporating naked calls with filters can also be studied. Naked calls may or may not be part of the overall position at any given time. I am interested to study MAE by looking at underlying price change of different future periods given specified filter criteria. Any stable edges we identify could be higher probability entries for a naked call overlay. I approach this with some skepticism since it does imply market timing. As discussed in Part 3, this type of analylsis lends itself more to spreadsheet research than to the automated backtester, which would run simulated trades. We would primarily be studying histograms and running statistical analysis on distributions.

Backtesting of undefined risk strategies will conclude with naked straddles. Like strangles, straddles can be left to expiration, managed at 10-50% by increments of 5%, or closed for loss at 2-5x initial credit. I would want to monitor total number of trades, winning percentage, average PnL, average loss, largest loss, standard deviation of returns, days in trade, PnL per day, PF, RAR, and MAE (MFE?). The same comments given above for straddles regarding PM logic, return on PM, and PM ratios also apply here. We can also study the effect of time stops (managing early from 7-21 DTE by increments of seven).

As discussed with naked puts and calls, I would like to study rolling as a trade management tool. We can reposition strangles in the same (subject to a minimum DTE, perhaps) or following month back to the original delta values when a short strike gets tested or when down 2x-5x initial credit. We can do the same for straddles when an expiration breakeven (easily calculated) is tested as well as rolling just the profitable side to ATM.

Aside from studying straddles and strangles as daily trades, serial [non-overlapping] backtesting can be done in order to general equity curves and study relevant system metrics as discussed previously with regard to naked puts and naked calls.

Portfolio Margin Considerations with the Automated Backtester (Part 3)

Today I continue discussion of portfolio margin (PM) [requirement (PMR)] and the automated backtester.

Please recall that I have described two research approaches. The first analyzes trades opened daily to collect statistics on the largest sample size possible. The second approach studies serial backtesting of non-overlapping trades to generate an account equity curve and to study things like maximum drawdown and risk-adjusted return. The latter lends itself to one sequence of trades out of an infinite number of potential permutations, which is suggestive of a Monte Carlo simulation.

I can definitely see a use for PMR calculations in the daily trades category. For each trade, the automated backtester could approximate PMR at trade inception and for each [subsequent] day in trade. To get a sense of how much capital would be required to maintain the position, we would want to track the maximum value of the subsequent/initial PMR ratio. The amount of capital needed to implement a trading strategy is at least [possibly much more if done conservatively as discussed here and here] the maximum value of the subsequent/initial PMR ratio observed across all backtested trades. In addition to this single number, I would be interested in seeing the ratio distribution of all trades plotted as a histogram and perhaps also as a graph with date on the x-axis, ratio on the left y-axis (scatter plot), and underlying price on the right y-axis (line graph).

PMR calculations might have a place in the serial trades category as well. Plotting equity curves of different allocation percentages is different from whether those portfolios could be maintained depending on max PMR relative to account value. If PMR exceeds account value, then at least some positions would have to be closed. Since it’s impossible to know which positions this would involve or even whether the broker would do it automatically (at random), I might assume a worst-case scenario where the account would be completely liquidated. On the graph, the equity curve would go horizontal at this point. With a consequence this drastic, I think PMR monitoring is worth doing.

In addition to PM, some brokerages have a concentration requirement. One brokerage, for example, looks at the account PnL with the underlying down 25%. The projected loss must be less than 3x the net liquidation value of the account. Violation of this criterion will result in a “concentration call,” which is akin to a margin call. An account can be in the former but not the latter if it holds DOTM positions that would (not) significantly change in value with the underlying down 25% (12%). Closing these options (typically for $0.30 or less) will often resolve the concentration call.

Building concentration logic might be useful for backtesting with filters. A large enough account could actually be traded by opening daily positions. Otherwise, implementation of filters could result in multiple open positions (albeit less than one new position per day). Stressing the whole portfolio by walking the chain up 25% would be useful because a strategy that looks good in backtesting but violates the concentration criterion is not viable. Put another way, I cannot project a 20% annual return on capital when the capital actually needed to maintain a strategy is quadruple (quite possible with PM) that projected. In this case, 5% annualized would be a more accurate estimate.

Portfolio Margin Considerations with the Automated Backtester (Part 2)

Last time I started to explain portfolio margin (PM) and why a model is needed to calculate it.

I previously thought the automated backtester incapable of calculating/tracking PM requirement (PMR) without modeling equations [differential, such as Black-Scholes] and dedicated code, but this is not entirely correct. The database will have historical price data for all options. The automated backtester can simulate position value at different underlying price changes by “walking the chain.” In order to know the price of a particular option if the underlying were to instantaneously move X% higher (lower), I can look to the strike price that is X% lower (higher) than the current market price. Some rounding will have to be involved because +/- 3%, 6%, 9%, and 12% will not fall square on 10- or 25-point increment strikes that may be available in the data set (in highest liquidity and therefore most reliable prices).

The automated backtester would not be able to perfectly calculate PMR. In order to be perfect, the backtester would need to model the risk graph continuously on today’s date, which would require implementation of differential calculus. Rounding of the sort that I described above is not entirely precise. Also in order to be perfect, we would have to match the PM algorithm used by the brokerage(s). These are kept proprietary.

Another reason the automated backtester would not be able to perfectly calculate PMR is because walking the chain does not take into account implied volatility (IV) changes. [Some] brokerages also stress portfolios with increased IV changes to the downside when calculating PMR.

We can approximate the additional IV stress a couple different ways. First, instead of stressing up and down X% we could stress more to the downside. Second, we could use vega values from the data set in addition to walking the chain. Vega is the change in option price per 1% change in IV. If we want to simulate a 10% IV increase, then, we could add vega * 10 to short positions. This would probably not be exact because vega does not remain constant as IV changes. Vomma, a second-order greek that will not be included in the data set, is the change in vega per 1% increase in IV. The change in option price is actually the sum of X unequal terms in a series as defined by vega and vomma (along with third-order greeks and beyond to be absolutely precise).

Regardless of the imprecision, I think some PM estimate given by logic built into the automated backtester would be better than nothing. And my preference would always be to overestimate rather than underestimate PMR.