One of the things that I'd like to do with this blog is help create more 'educated consumers' of the vast amount of information that's available on the thoroughbred industry. An awful lot of that information is presented to bolster arguments that may or may not be supported by fact. Sometimes the information is intended to mislead, but more often the authors are simply unaware of some of the easier statistical errors to make when forming theories.
One of the easiest mistakes to make is the error of 'survivorship bias'. This is a well known phenomenon in the financial industry, where it can distort the appearance of past performance of indexes or funds. An example would be if a new ‘index’ is created which tracks the prices of a number of the largest companies in a specific industry. To show how the index would have performed historically, the historical price of the index is reported, going back many years. Invariably, these calculations show unrealistically strong performance. The problem is that by using the current largest firms, any firms that went out of business, or simply did poorly enough to shrink substantially were left out of the index. So the index by definition includes the firms that have done the best in the past. This doesn’t give any indication of how it might do in the future, or how it would have done if the largest firms from some time in the past had been used to create the index.
So what does this have to with the thoroughbred industry, and how would it impact research on racing or pedigree? Here’s an example, where I almost designed a study with the same flaw. I’ve mentioned before that I want to study what factors might predict potential future ‘breakout performance’ for claiming horses. I have access to a database of several thousand races of past performance data, and was thinking I could use that data for the study. I’d look at former claimers who had allowance or stakes wins, and identify some patterns in their history, and then use the past performance data to test the performance of all horses that exhibited the same patterns. The problem is that by using past performance data to test the patterns, I’d be automatically excluding all the horses that had the same pattern, but then didn’t make it back to the races, and I’d be reducing the impact on the overall data sample of horses that weren’t good enough to run often after exhibiting the pattern. The bias this would introduce to the data would have made my findings almost useless. It’s subtle problems like this in most existing research that lead me to believe that there’s a need for better research in the industry.