Predicting stock or foreign exchange prices with ML models

Hi all,

I’ve been working on developing models that take historical data and use ML models to predict the future direction (up/down) for these prices. Below is a description of what I did to get 60% accuracy in my predictions for up/down and some of the issues I ran into. Any advice to help me improve the accuracy of my predictions would be greatly appreciated!

Data
I took historical candlesticks (Open, High, Low, Close, Volume) for the EURO/USD pair from the Oanda foreign currency exchange. I then added technical indicators via the TA-Lib library. I split the technical indicators into the feature groups that TA-Lib has in their official documentation.

Models
I tested for both classification (just predicting future up or down movements in closing price) and regression (predicting the next price) by iterating over these models:

  • AdaBoostClassifier
  • AdaBoostRegressor
  • CatBoostClassifier
  • CatBoostRegressor
  • ElasticNetCV
  • GradientBoostingClassifier
  • GradientBoostingRegressor
  • LassoCV
  • LGBMClassifier
  • LGBMRegressor
  • LinearRegression
  • LogisticRegression
  • RandomForestClassifier
  • RandomForestRegressor
  • RidgeClassifierCV
  • RidgeCV
  • XGBClassifier

Testing Methodology
I split the testing into 3 phases.

  • Phase I - initial Test (>10K variations) - I first tested over 10K variations of the models with different feature sets and hyperparameters on the first 2 weeks of 2023 to find the most promising models.
  • Phase II - CrossValidation (500 variations) - I then picked the 500 highest performing model variations from Phase I and tested them on data from mid Jan 2023 till October 1, 2023.
  • Phase III - Final Test (10 variations) - I then picked the top 10 variations from Phase II and tested them on data from October 1, 2023, till the end of December 2023.

Issues
I’ve been able to get pretty good accuracy of over 60% in my predictions for my test data in Phase I. But when I test some of the models on the cross validation set, some of the models still return around 60% accuracy, but it’s hard to predict a priori which of the model configurations will do well.

Ideally, I’d like to see the best performing models in Phase I do well in subsequent phases. But I’m not seeing that in my results. Rather, I’m seeing some models which didn’t do as well in Phase I rise to the top in Phase II and III.

10 Likes

A few observations:

  • 60% accuracy is barely better than flipping a coin. So your models are not doing very much.

  • If this was easy a easy problem to solve, everyone would be getting rich from stock market investments.

  • You seem to be trying to solve a very difficult problem by simply trying a zillion different models. Maybe the issue is with the data set.

  • Perhaps the data you have on historical performance does not help with predicting future performance.

8 Likes

Thanks @TMosh ! How would you approach this problem if you were in my shoes? I’m a relative newbie to using ML models. For example, I’m also interested in using DNN and LSTM but am not sure what the best approach to figuring out a good network architecture is.

I’m also tagging @elirod since he replied to my other post.

3 Likes

@TMosh and @elirod, here are some of the hypotheses I tested:

  • TALib Feature Groups I hypothesized that these provide valuable additional information. And so, I tested each model with different combinations of feature groups, where I added all the features from each feature group to the data set.
  • Larger Price Movements - I added a filter which only considered returns >= a multiple of a trailing standard deviation of historical returns to be good. Using this method, several models returned 100% accurate predictions but only yielded about 3 predictions across all of 2023. So, this could have been due to random chance.

Are there any other hypotheses you’d recommend I explore? Thanks!

1 Like

Frankly, I would pick a project that is more likely to have a solution. Predicting the future, when it is dominated by random events, is very difficult.

3 Likes

Thanks @TMosh. What types of projects would you recommend? Also, just to close out this topic, how do you recommend I look for a good neural network architecture in terms of hidden layers and activation functions for predicting the direction of future movement? Thanks!

2 Likes

The sizes of the hidden layers (and the number of hidden layers) is determined by experimentation. You want enough complexity to get good-enough results, but not so much that training takes excessively long or consumes too much memory.

I don’t have any recommendations about projects, because I don’t know what your purpose or goals are.

Thanks @TMosh. My main purpose is to learn about how to use ML models to predict future events. My background is I have a CS degree from MIT and used to be a product manager at Facebook. My team made heavy use of ML to make predictions of which notifications users would click or which apps users would install after seeing ads.

I’m wondering if I should do some competitions on Kaggle. What do you think about that?

Also, I noticed questions in these forums about exercises such as Loss Function Labeling Question? W3 Lab06 - Machine Learning Specialization / Supervised ML: Regression and Classification - DeepLearning.AI. What courses are those for?

2 Likes

This worked for social media because it’s not very much influenced by random events - it’s a relatively closed system.

And it doesn’t really work extremely well for social media either, witness the long history of people getting recommendations that make absolutely no sense.

Regarding what courses the questions come from - usually it’s in the thread header right below the title.

1 Like

Kaggle is an interesting place, it’s a combination of tutorial host, but mostly a host for competitions. The tutorials are largely used as a training ground to recruit people to participate in their commercial challenges.

Kaggle’s motivations are a different topic entirely. But it’s not a bad place to gain experience.

Another option is to download some free datasets from various online repositories, and do your own experiments.

1 Like

I think your model itself is hopeless. The stock market is not a self propelled isolated system. It is affected by many external events.

1 Like

I would use LSTM on financial articles to generate a sentiment on a stock. It will not be amazing and still not useful enough to predict price to a good accuracy but it will give some insight. If possible, I would then compare the sentiment predicted from the model to human sentiment (your own) then compare it to general price movements. One will correlate more (I imagine both will be poor). I do not think it will provide any financial gain for you but you may find it fun/interesting. Ultimately, it is a very difficult task that usually requires extensive training (e.g. a masters degree in financial engineering) that pays quantitative researchers a lot of money to solve. I would suggest enrolling in Udacity’s AI for Trading course as it will provide further insight but it is quite costly.

4 Likes

Hi David, perhaps what I’m about to say you have already done, but I did not find mention of it in your question.

  1. When preparing data, it is advisable to use logarithmic prices instead of raw prices. This will help avoid distortions from skewness and remove non-stationarity from the price series. Your time series will be transformed into a stationary form.
  2. Note that after logarithmic transformation, not all functions from the TA-lib library will work correctly, as it is not designed for such data.
  3. Your objective should not be predicting the price but achieving the right risk and reward ratio in a trade. It should be at least 1 to 3. With such a ratio, a 60% success rate is sufficient for good earnings :slight_smile:
  4. You need more data for training and testing your model.

Best of luck in reaching your goals.

3 Likes

Actually, my team made a killing with the ML models. I was the product manager who led the team responsible for Mobile App Install ads, which are ads for other mobile apps within Facebook. We were able to predict not only people who were likely to install the apps but those who would spend a lot of money within 48hours of viewing an ad for a specific app. This led to an increase in ad spending of over +$2billion/year. It was insane how well that model performed!

4 Likes

Thanks @Romanus. What specifically did you mean by logarithmic prices? I’ve used logarithmic returns before but not on prices themselves.

1 Like

One thing I learned venturing into ML world is that ‘Bad features beats Good Models’. If the underlying features do not capture patterns in data you will get bad results. To verify this concept check ‘insurance’ dataset, its a simple regression problem but a spot on feature importance.

For projects I do recommend starting simple and building your way up, focusing on feature engineering and EDA.

1 Like

Hi David @David_Park ,

You pose a good question. I have done some work and read material about this area. Here is a suggestion: predicting price is not a good idea; instead, focus on predicting some event or behavior that may affect the price. I hope this can help you.

2 Likes

I mean direct logarithmization of prices and further utilization of the resulting logarithmic scale: log_close = np.log(close)

In analysis, you can employ two approaches: log returns and a logarithmic scale of prices. The choice depends on your preference and convenience. Each approach has its own limitations.

Most technical analysis indicators are designed to work with positive asset prices rather than negative values or log returns. Therefore, if a sequence of positive and negative numbers (such as log returns) is provided as input to an indicator, there may be some issues. Some indicators may produce incorrect results or may not make sense when working with negative values.

Examples of indicators that may encounter problems:

  • Moving Average (MA): This indicator does not always work appropriately with negative values or oscillations around zero.
  • Relative Strength Index (RSI): RSI is also calculated based on positive and negative price changes, and its interpretation may be challenging when using log returns.
  • Moving Average Convergence Divergence (MACD): This indicator is also constructed based on the difference between two moving averages and may give incorrect results with negative values.
  • Stochastic Oscillator: Similarly, the Stochastic Oscillator may behave unexpectedly when using log returns.

To correctly use indicators with log returns or other data with negative values, modifications to formulas or the search for alternative indicators designed to work with such data may be necessary.

By the way, you can always obtain log returns between any two points in your price series from the logarithmic scale; you just need to subtract the values from each other. This follows from the property of logarithms: log_b(x/y) = log_b(x) - log_b(y)

2 Likes

Yeah, I took the course but I had a big discount (40%) as I was a student at the time. I also took the monthly subscription and completed it in 2-3 months so I paid a lot less overall. Personally, the course was amazing; it gave me great insight into quantitative research and machine learning within finance. In terms of free resources, academic papers may be useful as this is what some quantitative researchers actually use to generate alpha for investment. I would suggest Machine Learning for Trading by Gordon Ritter :: SSRN

1 Like

Thanks @Romanus. I’ll check that out.

Thanks @Philip_Ilono . Ii just signed up yesterday and started taking the course. What other courses, if any, have you taken on Udacity / Coursera?

1 Like