Walking it Forward with System Validation (Part 3)
Posted by Mark on January 31, 2013 at 04:53 | Last modified: January 28, 2013 13:25In http://www.optionfanatic.com/2013/01/30/walking-it-forward-with-system-validation-part-2/, I introduced the concept of in-sample (IS) and out-of-sample (OOS) data. I will continue today by going into more detail about the third example.
In this example, I have divided the 15 years of historical data into 13 years of IS data and two years of OOS data. What I can now do is test the system established using the first 13 years of data on the last two years. This is an improvement over the second example where optimization was performed over all 15 years of historical data. Only real money was available to test that system’s effectiveness, which is why I said it was somewhat of a gamble.
If the system does not perform well on the OOS data then I would not trade it with real money. From my perspective, this is like bringing my wife along to purchase new clothes. Neither of us are fashion mavens. However, if I pick something out that she doesn’t like then an increased probability exists that other people won’t like it. If I pick something out that she does like then an increased probability exists that other people will like it. In both cases, one validation is better than none.
I believe it important to stress that neither one validation nor one rejection is any guarantee of future success or failure. If the system performs well on the OOS data then it is not guaranteed to profit in live trading. Similarly, if the system fails to perform well on the OOS data then it is not guaranteed to lose in live trading. The goal of system development is simply to generate enough confidence in a system to consistently trade it with real money. This premise guides my decision making.
In my next post, I will take 1-step validation to the next level.
Categories: System Development | Comments (1) | PermalinkWalking it Forward with System Validation (Part 2)
Posted by Mark on January 30, 2013 at 04:26 | Last modified: January 25, 2013 06:01Last time, I began to build up to an example illustrating the process of Walk-Forward Analysis (WFA). I introduced two examples of system development with the second including an optimization step. This prevents “flying blind,” to which the first example would be subject, and therefore makes the first example a more risky undertaking.
Just because a system traded well in the past, what reason do I have to think it will trade well in the future? Robustness in terms of neighboring parameter values generating profitable backtesting returns is encouraging, but I still don’t have any future comparison to look at. If I start to trade live without future comparison then I will eventually have the answer in terms of real performance, but this is another form of “flying blind” and seems a bit like gambling to me.
For this reason, it seems logical to divide up historical data into in-sample (IS) and out-of-sample (OOS) data. IS data is that used to optimize the trading system and is known as the “training set” of data. OOS data is that used to test the findings from the IS optimization and is known as the “test set” of data. The key is to avoid use of the OOS data until the time of testing. Do not use any OOS data to optimize variables during the IS period.
My third example will divide up the 15 years of historical data into the first 13 years, which I designate as IS, and the last two years, which I designate as OOS. This time, I optimize over the first 13 years of data to find what SMA length works best. I also check neighboring values of SMA length to make sure the optimal performance is not a fluke. I will then backtest this system on the OOS data to see how performance compares.
I will continue the discussion in my next post.
Categories: System Development | Comments (1) | PermalinkWalking it Forward with System Validation (Part 1)
Posted by Mark on January 29, 2013 at 02:49 | Last modified: January 25, 2013 05:42As discussed in http://www.optionfanatic.com/2012/10/29/drive-the-monte-carlo-to-consistent-trading-profits, Monte Carlo analysis is a way to get a broader statistical view, or a range of what to expect from my trading system. According to Howard Bandy, who has written many books on AmiBroker and System Development, the process is not complete until it passes validation in the form of Walk-Forward Analysis (WFA). WFA combats the tendency of system developers to curve-fit.
In order to illustrate this, I will present four examples.
My hypothetical system enters long (short) at the next open if closing price is above (below) a simple moving average (SMA). Trades are held until a reverse signal appears. One market will be traded and only one position will be held at a time.
For the first example, suppose I backtest this system over the last 15 years using a 20-period SMA. The equity curve looks great and the subjective function value is high. Thinking I have found the Holy Grail, I start trading this tomorrow. This is how things might look for many people who find trading strategies to backtest in books or on web sites.
Contrast this with a second example where I backtest over the last 15 years and optimize by varying the SMA length from 5 to 100 in five-day increments. With 20 potential SMA lengths, I am testing 20 different systems. I choose the best performing system to trade live starting tomorrow.
While this example is probably the epitome of curve fitting, I would consider it better than the first because I can see how the system performs with neighboring SMA lengths. As discussed in http://www.optionfanatic.com/2012/09/28/trading-system-1-spy-vix-part-1, through the optimization process I have more data that enables me to determine whether my impressive performance is a spike peak on the graph (fluke) or part of a high plateau (more robust). In the first example, I am flying completely blind.
I will continue with more examples in the next post.
Categories: System Development | Comments (1) | PermalinkLingering Quandaries about System Development (Part 6)
Posted by Mark on January 28, 2013 at 05:02 | Last modified: January 24, 2013 06:21My last post ended with a thud. Today, I’m going to reframe the discussion in terms of robustness.
My entire reason for studying System Development is to identify valid systems: systems that will make money in live trading. As was the case in http://www.optionfanatic.com/2012/10/18/laziness-dissected, I have now discovered another hidden personal bias. In addition to wanting a system that is valid, I have wanted a system that is robust. I have believed a system cannot be valid without being robust.
Investopedia defines “robust” as:
> a characteristic describing a model’s, test’s or system’s ability to effectively perform
> while its variables or assumptions are altered. A robust concept can operate without
> failure under a variety of conditions… For statistics, a test is claimed as robust if it
> still provides insight to a problem despite having its assumptions altered or violated…
> In general, being robust means a system can handle variability and remain effective.”
I illustrated robustness in http://www.optionfanatic.com/2012/09/28/trading-system-1-spy-vix-part-1 with the graphs showing a high plateau region vs. a spike high.
If one of the variables to be altered is the ticker itself, then we have another definition of robustness commonly used by traders. Many traders think a system is only valid if it is profitable across different markets. Now I see this is personal bias, at best. Whether a system needs to be effective on multiple markets is the unanswered question that concluded http://www.optionfanatic.com/2013/01/25/lingering-quandaries-about-system-development-part-5/.
Ultimately, System Development is about giving me the confidence required to consistently trade a system in real-time thereby giving me the opportunity to profit. If I believe a system must work across markets then I will demand this in the development process. If I believe a system may work on one market then successful development on just one market is all I need to see.
Categories: System Development | Comments (1) | PermalinkLingering Quandaries about System Development (Part 5)
Posted by Mark on January 25, 2013 at 04:15 | Last modified: January 24, 2013 05:01I am using this series of blog posts to articulate seeming details about System Development and my cognitive framework that just do not get along. I identified the first one in http://www.optionfanatic.com/2013/01/18/lingering-quandries-about-system-development-part-1/, which was difficulty interpreting RAR/MDD. The second conflict was uncovered in http://www.optionfanatic.com/2013/01/24/lingering-quandaries-about-system-development-part-4/.
To review, I have defined a trading system that works well with one ticker but does not trade frequently enough. I have expanded from one ticker to a 5-ETF basket that also backtests well. Before I trade this live, however, I must screen for selection bias to ensure the positive results may be attributed to the trading system rather than a lucky choice of ETF basket.
Suppose I therefore backtest the system on 2.54 billion 5-ETF baskets and find the results from my original basket to be within the average of all 5-ETF baskets. Great! There is no selection bias and I may therefore proceed with trading the system live.
On another hand, suppose I backtest the system on 2.54 billion 5-ETF baskets and find the results from my original basket to be more than 2 SD better than the average of all 5-ETF baskets. The probability of this occurring is under 5% so perhaps I should not trade the system live because the outperformance of my original basket may have been the result of a lucky selection of five ETFs–a fluke, if you will.
But wait… who says that with particular rules, some markets can’t trade better than others? Each market will, to some degree, reflect the cumulative personality of its largest institutional traders and it certainly seems possible that if many of them follow certain criteria then those criteria may carry some edge. By this line of reasoning, I should not only be encouraged to see the original basket perform significantly better than all baskets–I should demand it!
Yes, I should demand what–according to the former viewpoint–I absolutely did not want to see and would not proceed to trade live.
So which is it?
I don’t know, I don’t think I can know, and I don’t think there is any statistical method by which I can possibly know. The problem is too multivariate and complex.
Take that, Debbie Downer!
Categories: System Development | Comments (0) | PermalinkLingering Quandaries about System Development (Part 4)
Posted by Mark on January 24, 2013 at 05:57 | Last modified: January 23, 2013 04:13In this series, I’m trying to bring my confusion about System Development principles to the surface through a serial discussion of each and every one. In http://www.optionfanatic.com/2013/01/23/lingering-quandaries-about-system-development-part-3/, I discussed a case of backtesting multiple tickers. If a given ETF basket backtests well then I still need to rule out selection bias before feeling confident about trading the basket live.
Selection bias would be present if the specified ETF basket performs well when other baskets perform poorly. To test this, I could randomly create other ETF baskets and backtest those. If I identify 200 liquid ETFs, for example, and I want to trade five of them then the number of possible combinations is 2.54 billion. I could backtest all combinations and calculate the mean and standard deviation (SD) performance statistics. Ideally, the original basket I backtested should be within 1 SD of the mean performance for all combinations. If the original backtest is more than 2 SD better than the mean performance for all combinations then perhaps I decide this is an outlier and should not be trusted.
While this seems statistically sound, it does imply that no ETF basket should be significantly better than the rest.
Say what?
Many people believe different tickers to have unique trading personalities! Institutions, which dominate market action, would be the best explanation why. If many institutions trading a given ticker follow 50/200-SMA crossovers then I am then likely to see an edge when using a 50/200-SMA crossover trading rule. To the extent that different tickers are traded by different institutions, different tickers may be more or less influenced by different technical [or other types of] criteria applied by those institutions. Doesn’t it therefore seem plausible that any given trading system might work well for one ticker (or ETF basket) and not others? In theory, this system would continue to be effective until the institutions trading that ticker significantly change or until the traders responsible for those institutions’ trades significantly change (e.g. fund managers are replaced).
Categories: System Development | Comments (0) | PermalinkLingering Quandaries about System Development (Part 3)
Posted by Mark on January 23, 2013 at 03:45 | Last modified: January 23, 2013 03:59This series details the ongoing conceptual conflict that impedes my education about System Development. In http://www.optionfanatic.com/2013/01/22/lingering-quandaries-about-system-development-part-2/, I discussed potential ways to avoid apples-to-oranges comparisons using the subjective function RAR/MDD. Today I continue with discussion of challenges faced by systems that trade multiple tickers.
One reason I might want to implement multiple tickers is to generate more trades and more opportunity for profit. Suppose I develop a trading system for SPY with a PF of 2.20 that generated 70 trades over the last 14 years. On average, that is only five trades per year. While the high PF suggests I will likely get good bang-for-the-buck, it’s always dangerous to risk too much at once lest this be the trade that takes me to MDD. With my bet size limited, the only way to generate acceptable profit potential might be more trades per year.
To proceed with a plan for trading multiple tickers, one thing I could do is expand to a basket of ETFs. In Jeff Swanson’s post “The Ivy Portfolio” (http://systemtradersuccess.com/the-ivy-portfolio/), he mentions use of these five ETFs:
BND – Vanguard Total bond market (4-5 year)
DBC – PowerShares DB Commodity Index
VEU – Vanguard FTSE All-World ex-US
VNQ – Vanguard MSCI U.S. REIT
VTI – Vanguard MSCI Total U.S. Stock Market
If SPY alone generated five trades per year then increasing to a 5-ETF basket like this would dramatically increase the number of trades per year.
Suppose I backtest my system on this ETF basket and get solid results. Suppose, too, that I have performed walk-forward analysis (not yet discussed in my blog) and observed solid results. Should this give me the confidence required to trade this system live?
The answer to this question, and more, when we return!
Categories: System Development | Comments (1) | PermalinkLingering Quandaries about System Development (Part 2)
Posted by Mark on January 22, 2013 at 06:33 | Last modified: January 23, 2013 01:49This series details different aspects of nebulous understanding that prevent my advancement along the System Development learning curve. In http://www.optionfanatic.com/2013/01/18/lingering-quandries-about-system-development-part-1/, I began by describing confusion over the chosen subjective function RAR/MDD.
As mentioned, one possibility is to select a different subjective function and for me, Profit Factor (PF) leads the pack of contenders. PF describes the number of dollars made per $1 lost. This does not vary by number of trades or position size–two factors that affect exposure and therefore affect RAR/MDD. With PF, I can study one ticker or multiple tickers and generate comparable statistics regardless of initial equity and position sizing.
One downside to PF is that it does not account for distribution of profits and losses. Having one big winner among many fractional profits could generate a large PF where the equity curve would tell a different story. This argues for inclusion of standard deviation (SD) to the subjective function.
Another downside to PF is disregard for DD. For example, AAPL had a 50% return from 2008 to 2011, but there was also a 60% drawdown in the middle. While this spanned several years, it could certainly occur over any time frame and would be more than I could tolerate despite a decent overall PF. Perhaps some metric like PF / (MDD * SD) would be useful.
Another possibility is to stick with RAR/MDD and backtest just one ticker at a time. The cumulative effect of trading multiple tickers would likely be a lower RAR/MDD than reflected by any single ticker alone but that is unimportant because consistent comparisons would be achieved. Besides, whether by different tickers or different systems, I want multiple trades with the potential of concentrated profit (large RAR/MDD) on each one.
I believe the bigger issues with multiple tickers have to do with backtesting validity, portfolio heat, and what exactly it takes in order to be a robust finding.
To be continued…
Categories: System Development | Comments (1) | PermalinkLingering Quandaries about System Development (Part 1)
Posted by Mark on January 18, 2013 at 15:45 | Last modified: January 23, 2013 01:42While this series is strictly personal, I do believe that anyone seeking to master System Development will need to arrive at their own resolutions to these issues. On my journey, I remain “in consolidation” due to conflicting subdivisions of System Development. More than anything else, this series will help me keep track of where I’ve been and where I am so that I can better understand where I still have to go.
My first unresolved issue is identification of an acceptable subjective function. I addressed this in a six-part series beginning with http://www.optionfanatic.com/2012/10/04/the-subjective-function-part-1. In the end, I settled on Risk Adjusted Return / Maximum Drawdown (RAR/MDD). I like systems that are surgical in their efficiency–systems that trade infrequently but generate large profits when triggered. RAR rewards systems for decreased exposure. I also think MDD should be factored into system evaluation. Drawdown (DD) becoming too large to bear threatens to stop me from trading a system. DD will keep me awake at night because DD gets me thinking about Ruin.
My initial foray into System Development involved backtesting one ticker. Through that exercise, I began to develop a sense for how RAR/MDD varies. The problem arose when I expanded the backtest to include other tickers and encountered a drastic reduction in RAR/MDD. In retrospect, this makes sense because trading multiple tickers increases exposure proportionally. RAR is therefore decreased along with the subjective function. Comparing this to values seen previously was like comparing apples to oranges and I could not relate.
One alternative to circumvent this problem is to backtest one ticker at a time. I would then see RAR/MDD in an apples-to-apples fashion across tickers. Another alternative is to choose a different subjective function altogether.
I will explore these possibilities in my next post.
Categories: System Development | Comments (2) | Permalink