Beating the Stock Market with Machine Learning: Better Algorithms or Better Data? (DSI-SRP)
This DSI-SRP fellowship funded Qinlian Yang to work with Dr. Jesse Blocher in the Owen Graduate School of Management during the summer of 2021. Qinlian is a senior with major in Mathematics and minors in Computer Science and Business Administration.
The project funded by this fellowship aims to understand how to predict the stock market with enhanced precision using machine learning, with the aid of better algorithms or better data. Forecasting financial data is challenging as a result of unprecedented changes in economic trends and complex relationships between predictors and expected returns. Some people believe that adding more features to the dataset monotonically increases the accuracy of the model, while others assert that exploring more advanced approaches significantly improves the reliability of prediction results. The project focuses on how to allocate scarce resources to solve the difficulty of forecasting financial data. As much as data is needed, so are good models and theories that explain them. But what if we only have limited resources? Should we invest more in advancing algorithms or obtaining new, informative data? This project firstly uses machine learning and deep learning methods to predict stock returns and measures the performance of each algorithm. Second, it explores the interrelationship between data and algorithms, in so doing testing how different algorithm choices or datasets affect the end results.
We find that choosing a better algorithm matters more than gathering extra data in terms of improving prediction accuracy and gaining a higher portfolio return. As we can tell from the result summary table, Tree-based (Decision tree, random forest), Boosting-based (Gradient Boost, AdaBoost), Bagging-based, and Deep learning (CNN, RNN_LSTM) methods all perform significantly better than the rest of the algorithms, regardless of which dataset we are using. Also noteworthy is the fact that among these different groups, prediction results are almost identical. Among the algorithms that do not perform well, adding extra news analytics data does not seem to make a noticeable difference. For several of them, the performance difference is negative. Even if the difference is positive, the improvement is very small, meaning that adding more data features had little to no advancement. Overall, choosing the right algorithm leads to significantly better performance in all cases.
In addition to receiving support through a DSI-SRP fellowship, this project was supported and facilitated by the DSI Data Science Team through their regular summer workshops and demo sessions.