Debunking r/WallStreetBets with Machine Learning (2024)

Christophe Brown

Published in

Connecting the Dots

As I mentioned earlier, Beneath has data on stock mentions on the r/WallStreetBets subreddit. This data set specifically refers to top-level posts, rather than comments on a post. Some other data points provided include the number of stock symbol mentions found in the title, the time the post was made, and the length of the post body, among other features. Speaking of which, here are the top 50 stocks mentioned on r/WallStreetBets posts:

Debunking r/WallStreetBets with Machine Learning (5)

Furthermore, since we have the times at which posts were made we can break down posts by month. Here’s are the monthly breakdowns for the top two symbols GME and AMC:

Debunking r/WallStreetBets with Machine Learning (6)

Debunking r/WallStreetBets with Machine Learning (7)

Given that we have data available as a time series, we can intuitively map it to historic stock performance. We can look at the stocks growth numbers month-over-month and assess if there is a trend with, say, stock mentions or post length.

Though before we continue, I have to make a confession and make a slight adjustment — I am not a stock analyst (nor should my blogs be taken as financial advice!) and hadn’t immediately noticed that Alpha Vantage provides unadjusted stock data at its free tier. These means that the API will show prices that a stock was not actually traded at. For example, if this code sequence is ran:

from alpha_vantage.timeseries import TimeSeries
ts = TimeSeries(key=key, output_format='pandas') # key is an API key
nvda, _ = ts.get_daily(symbol='NVDA', outputsize='full')
nvda.loc['2021-01-14']

Debunking r/WallStreetBets with Machine Learning (8)

What returns is the stock for NVIDIA trading between $527.22 and $543.99 for the day, whereas at the time of writing the all-time high is actually closer to $325.85. Long story short, this results in discrepancies over time that make the data less accurate for the machine learning algorithm.

I could have used regression to approach this problem, but because the numbers don’t line up cleanly, we’ll instead shift this to a binary classification problem, where the classes will be whether a stock increased in the past month or decreased in the past month. To accomplish this, we manipulate the numbers as follows:

Get the daily average price for each day the market was open:
(high + low) / 2.
Get the monthly average price by averaging daily averages over a month.
Compare the monthly average of one month to the monthly average of the next month. Is there an increase or decrease?
An increase converts the data point to a +1, and a decrease is a -1. Now we have our class labels! We’ll cross reference both stocks and date to append the labels to the Reddit data.

Our cleaned dataset will look like this:

Here’s a list of all the features I have available:

num_mentions_title: How many time a stock symbols are mentioned in the post title.
num_mentions_body: How many stock symbols are mentioned in the post body.
num_unique_symbols_title: The number of unique symbols in the title. For example, having both “GOOGL” and “AMZN” in the post title would result in this value equal to 2.
num_unique_symbols_body: The number of unique symbols in the post body.
length_title: Character length of the title.
length_body: Character length of the body.
num_unique_posts: Number of posts that have mention the stock symbol in the post at least once. This is acquired through counting the corresponding the rows from the given data set.

With this prepared, let’s talk for a moment about machine learning application.

Since we’re addressing a classification problem, we’ll use the handy Gradient Boosted Tree. This decision is motivated by the robustness of the algorithm. See this excerpt from [1] that I often reference when practicing machine learning:

Similarly to other tree-based models, the algorithm works well without scaling and on a mixture of binary and continuous features. As with other tree-based models, it also often does not work well on high-dimensional sparse data.

If you’re new to Gradient Boosted Trees, they are a type of decision tree. Decision trees are learned classifiers (or regressors) that make a prediction based on thresholds learned from training data. For example, a stock price can go up or down, and my model might make that prediction depending on whether there were above 300 mentions of a stock on the subreddit last month, or below 300 mentions last month. For Gradient Boosted Trees, we first combine many trees in to a random forest, or a collection of decision trees working together to make a prediction. The “Gradient Boosted” element implies that one tree learns from another in training, rather than trees learning independently and averaging their prediction.

Here are the parameters I’ll modify to create different models. See the scikit-learn documentation for more details:

learning_rate: this modifies the contribution of each tree
n_estimators: the number of boosting stages, or number of trees that one tree can learn from
max_depth: the maximum depth of a tree
min_samples_split: the minimum number of samples required to split a node (i.e. for a branch-off to occur from that now)
min_samples_leaf: the minimum number of samples required to be at a leaf node
max_features: the number of features to consider when looking for the best split

With that, let’s zoom out for a moment and consider the next steps.

Metrics and Parameter Tuning

We’ll take the data shown above, attempt to fit a model to it, observe the performance, and tune parameters to get the best performance as we can. The performance metric I’m using is the AUC-ROC curve. Here’s a fantastic post by Sarang Narkhede explaining the metric in detail, but if you’re short on time, it helps us distinguish between our ground truth values and predicted values by measuring true positives, false negatives, true negatives, and false positives of our predictions. The AUC score sits between 0 and 1 with generally the higher, the better prediction performance.

To start off, I will train a model using just learning_rate=0.0001, n_estimators=10000, and max_depth=4. Once trained, I will use that model to predict on both the training data set and the testing data set.

Train AUC: 0.513355592654424
Test AUC: 0.4996242753882487

Pretty low scores! Scores around 0.5 suggest that the model is about just as effective as guessing whether a stock will go up or down. A respectable AUC should not be lower than, say, 0.65 to have any utility. Ideally we want to be predicting at a better score above 0.7. But we aren’t done, let’s conduct a sweep for all the parameters to see if there exists a sweet spot that will improve the AUC. We’ll try to find the best value for each parameter, then train a final model and examine its performance.

Below you will find an interval defined for each parameter, and the AUC performance from tuning only that parameter, which the remaining parameters are set to their defaults as outlined in the scikit-learn documentation.

Debunking r/WallStreetBets with Machine Learning (9)

Debunking r/WallStreetBets with Machine Learning (10)

Debunking r/WallStreetBets with Machine Learning (11)

Debunking r/WallStreetBets with Machine Learning (12)

Debunking r/WallStreetBets with Machine Learning (13)

Debunking r/WallStreetBets with Machine Learning (14)

We can gather a few things here. First, by observing at the blue lines for training, we see that we can increase AUC scores and therefore achieve some sort of “fit” to the training data. This is good news for starters, but our experiment’s hypothesis also depends on test data performance, which is very underwhelming at a glance.

You’ll also see the upward trends for learning rate, tree depth, and n_estimators, showing that our model capacity increases when using these parameters. Simply put, this means the we can reduce our training error using these parameters. Looking closely, we can also see a slight upward trend for testing AUC for these three values. Though no model reaches even a 0.6 AUC score, we’ll give the benefit of the doubt to the data, and now train a model using all parameters.

Here’s what I’ve set each parameter to:

learning_rate=0.5,
n_estimators=200,
max_depth=3,
min_samples_split=0.1,
min_samples_leaf=0.2,
max_features=6

And the results:

Train AUC: 0.6432344797435151
Test AUC: 0.5291709844559586

Unfortunately, we aren’t seeing any meaningful jump in scores here. I mentioned earlier that we can achieve some kind of fit to the training data, but alas this fails to translate to testing data. In context, this means that we can model some of the historical events of r/WallStreetBets, but we cannot model it such that we can predict how the subreddit will affect the future of the stock market (nor crypocurrency exchange, for that matter).

My last trick here was to select training data more carefully. Gamestop and AMC stock mentions on the subreddit dwarf that of any other stock, making them huge outliers that can affect the model interpretation for mentions of any other stock symbol. Yet, even after removing them, there is still no significant increase to AUC score:

Train AUC: 0.647909558084076
Test AUC: 0.5176788542312194

Does this come as a surprise?

The stock market is influenced by a variety of market forces. Regulations, politics, company performance, current events, and occasionally an onslaught of YOLO investors in the form of a subreddit. Nobody has been able to model the entire stock market perfectly (or at least, if they have, they did not tell anybody), and that largely includes the internet.

That said, hopefully you aren’t reading this blog seeking validation for chucking your life savings into r/WallStreetBets. If you are, I hear they have a subcommunity to console those who experience losses.

In this blog, we took a closer look at how r/WallStreetBets intersects with the stock market. We examined activity on the subreddit and conducted an analysis as to whether or not there was any significant correlation between mentions of stocks and and monthly increase or decrease in stock price. We found that, using machine learning, we failed to find any meaningful signs of influence that Reddit has on the market.

That is not exclusively to the fault of Reddit, however. Some shortcomings of this experiment include the limitation of data. Having the upvote-to-downvote ration of posts would have been a good indicator for how well a post is received, and subsequently its potential to influence. Additionally, our data sets pull information from top-level posts, and does not draw symbol mentions from comments, nor does it assess any kind of sentiment that the community has toward or against a post. Additionally, while gradient boosted trees area robust model, perhaps there is an alternative algorithm or architecture that could have better luck.

To wrap up, the stock market is complex, but it is finite. Various researchers, scientists, and analysts have spent years modeling trends in the market in hopes of finding a way to perfect the methodology of investment strategy. You won’t find it here though, nor on r/WallStreetBets, so I advise that you be smart about your investments by sticking to facts and listening to experts.

If you’d like to view the source code for this experiment, you are welcome e to check out my Github repo. It includes some things I did not talk about here, such as data preprocessing and additional parameter tuning methods.

[1] A. C. Müller, S. Guido, Introduction to machine learning with Python: a guide for data scientists. (“ O’Reilly Media, Inc.”, 2016)