With the US growing season in full swing, a range of private companies and government organizations are using various data and modelling techniques to predict the outcome of the harvest.
At TellusLabs, our crop yield models are made from a variety of “ingredients” - primarily reflectance data from satellites and weather information from observatories. We use a hybrid of remote sensing expertise and machine learning to determine how to incorporate each ingredient.
The USDA yield figures take a set of ingredients into account too - albeit differently: their estimates are predominantly based on in-season surveys of farmers themselves. Surveys can be a powerful source of data though they are particularly susceptible to sample bias.
The smallest unit of analysis the USDA uses for its yield estimates and forecasts is a US county. Within each state, there have always been a few counties which fall below the threshold of individual reporting by the USDA. Some counties do not provide enough data, likely due to decreased participation, or that the USDA is not able to send representatives to assess crop health and potential yield. These counties are combined into an “Other Counties” group.
Over the past fifteen years, the percentage of counties being put into this group has grown from approximately 1% in 2003-2006 up to 10% in 2014-2017.
This steady reduction in the number of individually-reported counties would not be a problem if they were producing average yields, (i.e. yields in line with the reported counties). However, the types of counties that are included in the “Other Counties” category on average report lower than average yields - creating a strong bias. This suggests that national yield models that only take into account the individually identified counties would be biased upwards.
Our modeling team searches for such biases in the reporting and we correct them accordingly in our models. In this case, we correct our forecasts by adjusting the national yield estimates accordingly based on the expected direction and magnitude of the biases.
Throughout the season (and in the wake of the NASS August Report), we've had plenty of questions regarding some of the key drivers of our below-consensus view on corn. We don’t think the “other county” sample bias alone could explain such a wide difference between our current view and market consensus (in the current season, our correction for corn is to adjust our raw forecasts down 3.9 bu/ac). However, we do suspect that it is one of the elements in play.
We think that the identification and remediation of data biases are pretty fascinating - to us, they're crucial parts of model construction.
If you’re interested in discussing our models further and trialling the Kernel product, please contact us.