In their daily investment practices, quant investors can make use of machine learning models in order to predict one-month forward stock returns based on specific stock characteristics. Generally, investors believe that the more data that is used to train the machine learning model, the better it performs. However, this might not always be the case. During the Inquire Europe autumn seminar Clint Howard presented his research ‘Less is more? Biases and overfitting in cross-sectional machine learning return predictions” which addresses these issues.
For those who were unable to attend his presentation, a summary has been made:
“Machine learning is having its moment in the spotlight in finance in the moment. We’re seeing many successful applications of machine learning models in areas like predicting stock returns, bond returns, historical volatility, or beta forecasting. There is a pervasive view that more data is always better; the more data you use to train your machine learning model, the better it is going to perform in the future. The rationale being if you have more data, you will overfit less and thus can expect better performance out of sample.. However, there is a bit more nuance than this simple rule.
Financial literature is still in its early stages of investigating the use of machine learning models to forecast cross-sectional stock returns. Currently, there is a lack of a uniform modeling framework that allows for the comparison of findings among various research studies. Moreover, there is no consensus on whether more data equals a better machine learning model.
Normally, one would use a machine learning model to predict the entire cross section of stock returns, use these predictions to form portfolios and then analyze the performance of these portfolios. What I did differently, was split the data into three different groups: small caps, midcaps, large caps, and then train an individual model for each of those groups, combine the predictions, and then do portfolio analyses. When you look at the statistics, these sized-specific models do better in general. This goes against the general consensus, as I’ve used less data in these sized-specific models but somehow I’ve achieved significant improvements to the performance of the strategy overall. Training these machine learning models specific to market capitalization groups has resulted in enhanced predictions of stock-level returns and the performance of long-short portfolios.
These improvements are significant when compared to previous studies that used a broader cross-section of stock returns, challenging the common belief that more training data always leads to superior model performance. The observed improvement in performance can be attributed to the absence of regularization for the target stock returns. When the models are trained on the entire cross-section of stocks using unprocessed excess returns, they tend to overfit, particularly to smaller stocks, leading to subpar performance in value-weighted long-short portfolios. However, by applying suitable regularization techniques, such as deducting the cross-sectional median return within size groups, similar performance enhancements can be achieved without incurring the additional computational overhead of training extra models. These findings underscore the significance of cautious and guided utilization of machine learning models in the context of asset pricing”.
Inquire Europe members can access the research and the presentation slides via: https://www.inquire-europe.org/event/autumn-seminar-2023/