Wednesday, May 21, 2008

Predicting the Past

I’ve mentioned before that I think most research done on the thoroughbred industry doesn’t use correct research methods and statistical analysis. One of the really common errors I’ve noticed is studies that don’t actually evaluate the predictive value of their findings. Theories are tailored to produce the best possible fit with what’s happened in the past, but it is just assumed that they will predict the future equally well.

There are at least three things we can do to avoid drawing false conclusions of this sort:

1. Apply common sense. While the basic premise of this blog and my Thoroughmetrics research business is that relying on common sense isn’t enough, that doesn’t mean we should do without it entirely. For an extreme example, no matter what the data tell us, we wouldn’t put any faith in a theory that say “dams whose names begin with the letter S outperform all other letters”. While that may accurately describe what has happened in the past, there’s no logical reason to expect that to help us accurately predict the future.
2. Look for ‘smooth’ patterns in the results. For example, before the Florida Derby, there was a lot of talk about how Big Brown’s outside post position would make it tougher for him to win the race. Horses in the outside positions in routes run at Gulfstream have very low winning percentages. Some people discounted this with the argument that ‘horses in the 4 slot also have a low winning percentage’. That’s nonsense. If 1, 2, 3, 5, 6, and 7 have solid winning percentages, while 4, 8, 9, 10, 11, and 12 have low winning percentages (with 11 and 12 having no winners prior to Big Brown), it’s a safe bet that post position 4’s low winning percentage is a fluke, since 2, 3, 5, and 6 do just fine. On the other hand, the outer post positions impose a real handicap on a horse’s chance of winning…none of them have yielded good results.
3. Test multiple, independent samples to confirm the results. For example, if you’re looking at a variable that you think will best identify the top sires, don’t just lump all the data together. Try looking at it for individual years, and seeing how consistent the results are from one year to the next. You may find that rankings that you thought had real predicative value, simple describe what has already happened, rather than predict what will happen. This is particularly true with variables that can be thrown off by a small number of aberrations. Because of Big Brown, Boundary is likely to be one of this year’s leading sires based on total earnings of offspring. I hope nobody will now consider Boundary a top sire (forgetting for the moment that he’s been pensioned already).

No comments: