Reflections on a racing form: Research study, part II: Debut victories categorized by style

Holding off further discussion of debut winners and future success, my database provides good data on how often horses of each style win. The number of winners in the led, pressed, middle, and behind categories:

Table 1

Debut winners who led at first call: 251 (.302)
Debut winners who pressed at first call: 226 (.272)
Debut winners in the middle at first call: 267 (.322)
Debut winners behind at first call: 86 (.104)

Early leaders, pressers, and mid-pack first-time starters won in almost equal amounts, while back-of-the-pack winners were rare. However, the totals do not really give us the data we want. In order for the data to be of practical significance, we need to adjust for the fact that the three bottom categories usually comprise more than one horse in a race (see definition of the categories in Part I).

I didn't track field size in my sample. However, from another project, I have a dataset of 12,097 races from July 2007 to July 2008. I tracked number of betting interests there, which is very close to field size (probably differs by less than .1 on the average). From that sample, I was able to narrow to non-state-bred maiden special weights with 2-year-olds or 3-year-olds only, between 4 and 7 furlongs on the dirt, with five betting interests or more, and not originally carded for the turf. That left 238 races. By printing the freqency table for betting interests in those races, and multiplying the percentage frequencies by the number of led/pressed/middle/behind horses in each case, I was able to estimate an average number of horses associated with the led, pressed, middle, and behind categories in a race in my study. (The led category has one horse per race. The pressed category has one horse if the field has fewer than nine horses, and two if the field has nine horses or more. Behind is identical to pressed, except that the entries are two with fields of eight or fewer, and three with fields of nine or more. The pressed and behind average, then, is as simple as knowing the percentage of races with nine or more horses. It's 42.9%, so pressed has an average of 1.429 places in a race, and behind an average of 2.429. The average for the middle category is the only one that is an involved calculation.)

These numbers were

Table 2

Led: 1
Pressed: 1.429
Middle: 3.566
Behind: 2.429

The numbers sum to 8.424 (the average field size in the dataset). What we next want is is the percentage of places that each category covers (i.e., convert Table 2 into percentages).

These are

Table 3

Led: .119
Pressed: .170
Middle: .423
Behind: .288

Divide the number in Table 1 by the corresponding number in Table 3 and we get how much more likely than average each position is to win.

Table 4

Led 2.538
Pressed 1.600
Middle 0.761
Behind 0.361

Divide "Led" in Table 4 by the other categories in Table 4 and we discover that a first-time starter on the lead is 1.59 times as likely to win as a first-time starter who presses the pace, 3.34 times as likely to win as a first-time starter who lays mid-pack, and a whopping 7.03 times as likely to win as a first-time starter who comes from the back of the pack. The true rate of victories descends as the horses analyzed get farther back in the pack.

This is an expected conclusion, duplicating research done by I don't know how many people how long ago. The most effective styles in the debut itself were not something I was overtly studying, but since I have the data, I thought I would pass them along. Now I'm very curious to know, however, how these rates compare to non first-time starters in similar races, and to the rates in races with older horses. Does the winning percentage drop less steeply by descending category in those races? I would like trainers to know if they are particularly sacrificing a potential win if they do not send their first-time starter to or near the front, and are truly giving their horse a race by not doing so. This is something they should know, rather than just blindly sending their horse to the post, thinking the way he or she is ridden doesn't matter, and hoping for the best. Without comparable data for other kinds of races, however, it's unclear how strong of a message trainers should take from the rates I found.

Astute readers might pick up on one hole in the analysis. I assume that the led/pressed/middle/behind places are represented in the same proportion by first-time starters as overall. This isn't necessarily true. By their numeric presence, first-time starters may have a 28.8% chance of occupying one of the "behind" slots in a race. But if they are less sharp than horses who have run, they may in fact occupy the "behind" slots more frequently. If that is true, then the winning percentage of 36.1% average for behind horses is overestimated. They are even less likely to win than that.

The breakdown by meet

I compared number of led/pressed/middle/behind winners at each of the meets. While the evidence for real differences and patterns is strong, the meets cannot just be compared at face value. The percentage of starters that fit into each category are not exactly equivalent to table 3 but differ some depending on the particular distribution of field sizes at the meet. The smaller the average field, the higher the rate of "led" wins will be. Larger fields necessarily aid the aggregate win rates of the other styles, although the exact relationship between field size and expected win rate is unique for every style, and complex. Taking "pressed," for instance

Table 5. Col 1 fld size, Col 2 possible winners from style, Col 3 expected w%

5 1 .2
6 1 .17
7 1 .14
8 1 .13
9 2 .22
10 2 .20
11 2 .18
12 2 .17

Large fields are generally good for the pressed winning percentage, but very large fields are not as good as merely large ones, and five-horse fields are good as well. Given how nuanced the relationship between starters and style categorization percentage is, speculating on difference in biases from one meet is another is well-nigh impossible without hard data. My dataset of similar races might have shed some light, but the number of races per meet would have been frightfully small, since that dataset contained less than a third as many races as this one, many of which were at meets not included in this study. With just the raw data of number of winners by style, field size cannot be ruled out as a force behind different debut-winner style-distributions at different meets. Some of the quirks in the relationship between starters and expected winning percentage may diminish differences between meets, in any event. Again, this does not hold for "led", where there is a simple and inverse relationship between starters and winning percentage.

The style percentage of debut winners by meet, with meets presented chronologically, and percentages reflecting the led, pressed, middle, and behind styles from left to right:

Table 6

Belmont Spring/Summer 47.9, 21.9, 23.3, 6.8
Churchill Spring/Summer 34.1, 34.1, 27.3, 4.5
Hollywood Spring/Summer 35.4, 33.3, 27.1, 4.2
Del Mar 31.5, 28.8, 27.4, 12.3
Saratoga 37.2, 24.8, 28.9, 9.1
Belmont Fall 27.1, 22.0, 35.6, 15.3
Keeneland Fall 21.7, 32.6, 37.0, 8.7
Santa Anita Fall 21.9, 18.8, 37.5, 21.9
Aqueduct Fall 23.4, 25.5, 38.3, 12.8
Churchill Fall 23.6, 30.9, 36.4, 9.1
Hollywood Fall 16.7, 27.8, 36.1, 19.4
Gulfstream 30.4, 27.2, 33.6, 8.8
Santa Anita Winter 21.1, 29.6, 38.0, 11.3

I would attribute the very high "led" rate at Belmont Spring/Summer to a few factors. I believe Belmont's Spring/Summer maiden special weights typically drew very few horses. In that respect, the high percentage of wire-to-wire winners (well, first-call-leader winners; included are horses who lost the lead at one of the intervening calls) is artifactual.

Belmont's standard 2-year-old maiden special weight distances at that time of year are 5f and 5.5 furlongs, and those distances probably favor front runners more than 7f races, say. The emphasis of this study was on future success, and arguments can be made for either 5f or 7f maidens being a superior proving ground for top horses, or for neither distance being more telling in a first win. So I was comfortable including some range in distances. But it’s fairly intuitive that the shorter the maiden, the more important prominent early placement is in that race that day, making distance a variable that needs to be eliminated or controlled for in simple comparisons of wins by style. So you can count distance as one of those variables for which I now see there was some use, but left out of the initial data collection in the interests of time.

The relationship between time of year and effectiveness of style is difficult to gauge in general. In the spring/summer, Churchill's and Hollywood's front runners trail Belmont's dramatically, yet their percentage of debut wins is only surpassed by one of the other nine meets. The baby races at Churchill Spring/Summer and Hollywood Spring/Summer are usually short, too. Are the spring/summer 2-year-old maidens kind to frontrunners because early speed is at a premium with very young horses, or because early speed is at a premium in very short races? The bright side of distance and time of year being as strongly related as they are is that there are relatively few races at under 6f late in the year when one want separate estimates of winning based on early position.

A factor that might separate Belmont in the spring and summer from Churchill and Hollywood is that the Belmont 5f races feature a very short run to the first turn, and not being on the lead often means going wide and giving away lengths.

The spring/summer races are also distinct for their unkindness to "behind" runners. The low percentages in that category for Belmont, Churchill, and Hollywood don't seem to belong with the other meets.

There were nine such winners at the spring/summer meets, and I checked to see if they were guaranteed stars, but an expected two of nine went on to be graded stakes winners, and four earned less than $100,000. (One year before the study, however, there was a future Kentucky Derby winner who came from last place to take his debut at Belmont in June, and by 5 lengths -- Grindstone).

It might be coincidental, and the sample sizes are the study's smallest, but I saw some symmetry in the high "behind" winning percentages at Santa Anita Fall and Hollywood Fall. I can't think of a good artifactual explanation, of a characteristic of the qualifying races at these meets that would favor debuting closers more than other meets. So the success of the debuting closers may really indicate that good horses of that kind tend to debut at those meets.

It certainly appears that as the crop ages, speedy debut types have less and less of an advantage. Actually, at risk of analyzing the data too finely, I would submit that the front-runner advantage is particularly pronounced in the spring and summer meets, somewhat reduced at the boutique summer meets, and then stable at an again lower level through Gulfstream and Santa Anita Winter. The breakdown appears this way.

Table 7. Proportion of debut winners who led at first call

Spring/Summers 40.6%
Del Mar and Saratoga 35.1%
After Del Mar and Saratoga 24.6%

Given the general large fields and the carding of races 6f and longer, the 37.2% front-runner proportion at Saratoga might be the most notable of all in that category. We can perhaps attribute some of it to Saratoga frontrunners being potential superstars in the making, but if you see part III in my series, not very much. I'm such an opponent of the track-bias perspective, that for all the times I've been there, I can't tell you whether Saratoga is generally thought to be a speed-favoring track or not, but a speed-favoring track could certainly be a part of the high "led" proportion.

Reflections on a racing form

Monday, June 25, 2012

Research study, part II: Debut victories categorized by style

No comments:

Post a Comment