Dear all,
After taking week1 of Structuring ML Projects, I am trying to apply what I have learned to my area of work. I work with time series data in industrial environments (forecasting temperatures, recommending machine setpoints…).
I always struggle with the split train/dev, because I have never found a better way than just splitting them over time (no shuffling). If I shuffle, that leaks data from the future into the training set, and the model overfits the dev set.
But then, most of the time, when the train error is good enough, the dev and especially the test set are very bad, probably because the data distribution is different.
So, I end up not knowing what to do. I have to split over time because I can’t leak data from the future into the training set, but I know the test and dev sets will have different distributions (because machine behavior and raw material are always evolving), and I can’t get more data on these evolving behaviors because the data is generated in real time up to the present time. How could I proceed?
Lastly, in the industrial environment, it is very difficult for me to estimate the Bayes error, or Human error. What is the Bayes error when predicting an oven temperature that depends on 20 variables? Then this makes it impossible for me to decide whether I should try to improve bias or variance.
Time series are a different kettle of fish, because the sequence is vital - it isn’t just a pile of data where you can randomize the order without impact.
I am still understanding the situation. What’s the time frame of the shift of the distribution? Does it change like every week? month? Then, how much data you can get in this time frame? Then, were you just training model given data in that time frame?
Is your data just one time series? If more than one, is there any timeseries that the behavior does not shift that quickly? I am imagining, what if you have like 5 time series and only 2 of them have a short time frame?
Did you do cross validation without time leaking?
I think “20 variables” is not really relevant to the question. Human error is just how well a human can do for just the objective (temperature prediction here). Human is certainly not good at that, so I would ask, what is the best that I have seen? If my competitor reaches an error of 1%, then I would take that as my Human error.
Hello Raymond,
Thank you for your answer!
Every case and machine is different, but to narrow it down to one case:
I have data from the last 12 months, and I saved the last 1.5 months as a test set. For the training/dev set, I use 5-fold cross-validation without shuffling.
The data has a 1-minute frequency, and I generate a prediction every 5 minutes.
To give you the full picture, I have to recommend every 5 minutes 3 setpoints to operate this machine with the goal of optimizing 2 parameters (maximize yield, minimize temperature).
My design consists of: filtering the data for those cases in which the yield is high and the temperature is low.
With the remaining “Good behavior” data, train a Clustering k-Means applied to the 3 Setpoints only, to have a finite number of Setpoints combinations. The centroids will represent a fixed combination of these 3 Setpoints.
Then, I trained an XGBoost (normally, at this point, I end up having 3k-15k cases) that, given the exogenous variables (~20 to 40 variables), tells me which is the best of the setpoints combinations to apply.
On this classifier, I get good results on the train set, quite worse in the validation (average of the folds,) and quite catastrophic results in the final test set.
I haven’t checked how the variables change their behavior 1 by 1, but I see that the distribution of the setpoint clusters can change a lot.
In the first 10.5 months of the year, I can have a proportion between the 3 clusters of 50%-40%-10% and in the last 1.5 months, which I keep for testing, have 30%-30%-40%.
Often, my benchmark model gives better results in the test set than my “best” model after hours of hyperparameters search.
In response to your questions, I have 20 to 40 time series variables, 1-minute synchronous, 3 of which are setpoints to be recommended, and 2 of them are the variables to be optimized.
I think I don’t leak time when splitting train/dev/test sets. For the test set, I keep aside the latest 1.5 months. For train/dev split. I use 5-fold cross-validation with the option shuffle False. So it keeps 1/5 of the train set as dev set, respecting the natural flow of time.
Regarding the Bayes error question, it is quite clear. The mentioned 20 variables were just to emphasize that it is not a case involving 3 or 4 variables that a human can still consider to make an educated guess; many more variables are involved. Probably an expert human would do it pretty bad.
BTW, if I may ask a bonus question, any idea of other designs I could try to recommend N setpoints to optimize M variables, having X exogenous variables?
Sorry again this time for the super long post!
Regards!
Sorry for getting back late. I have been (quite happily actually) occupied these few days for both promises and some of my own adventures, so I just came back here now and read your post. Thanks for the detailed explanation, but I need to find a proper time to fully focus, think through it and form a picture of what’s happening, so let me respond tomorrow morning my time (GMT+8) and we can decide where this discussion should go next.
Btw, your work looks interesting, too! (I am still in the mood of enjoying my own adventure)
If I were in your situation investigating a prediction system, I would ask myself a lot of questions, so I am going to write some of them down. I don’t expect your answers to my questions (of course unless you choose to, then we can further discuss), but I hope these questions can inspire.
First, I drew my understanding below and this is what I based on:
My biggest concern is, where should the “bad data” place in my understanding because its use was not mentioned.
I can see reasons behind this strategy, but as a trade-off, it also limited your choices because it is a set of fixed combinations which is a source of bias. Therefore, I would first wonder, if I randomly sampled 1000 good data and 1000 bad data, then around, let’s say, centroid number 1, what’s the ratio of good-to-bad?
If the ratio was like 50-50, then was this really a good centroid?
My understanding is that, a good result means that the model correctly predicted the closest centroid, and it does not mean that the temperature turns out lowest or lower than an acceptable line, am I right?
I said the variables were not important because, for example, when we talked about the human error for image recognition, we didn’t ask how many variables. Right? We didn’t ask how it was achieved. Our focus was simply, how good can human tell the image is a cat, right? I think the same applies here. I think the human error for your task could simply be that what’s the best in your industry. I do not need to care about how the best is done, including how many variables anyone has. It’s just like, if the best company could achieve 90% of the time good temperature and yield, then 100%-90%=10% would be my human error, regardless how they achieved that. Then I will need to translate this 10% into something comparable with my model’s prediction capability, such as, my model needed to achieve 5% error in order to achieve a 90% good temperature and yield. 10% is the human error for the task and 5% is the “translated human error” for my model.
(edit: on second thought, one may argue that, yes, we asked how many variables for image recognition, because they are just the image’s pixels. We are asking, if we present the training images to human, what’s the human error? I can’t argue against this, but I think the spirit of human error is just “what’s the general level” or even “where is the ceiling” and I will consider the choice of the set of variables as one of my hyperparameters so it is not part of the question.)
@albert_c, I am still not sure if my understanding was correct, because I think “bad data” should be used and I am not sure if this should be a classification problem. I infer your design as a classification problem because I am thinking that you were using those selected centroids as labels for training XGBoost.
These concerns are really about design, so I think your following question is most relevant.
You might consider the reinforcement learning framework. In one way, this framework takes the so-called (s, a, r, s’)-tuple as a sample data. s (and s') means state and it can be your X+M variables; a stands for action which will be associated with your N setpoints; r is reward which should be estimated based on your M variables.
The neural network for this framework is usually called a Q-Network, and its job is to predict the Q-value for each of the possible actions, so you can pick the action with the highest Q-value. The Q-value is a weighted sum of rewards, and during training, you can assign higher reward to action that lower the temperature and increase the yield. For action, it can be like “increase setpoint number 1 by one step, holds setpoint number 2, and decrease setpoint number 3”, so you are no longer limited by any fixed set of setpoint values because we are considering relative change instead of absolute value.
This approach expects you to have access to a working system (or a simulation of the system) that you can continuously take actions and observe the states and it’s very likely that for the purpose of model training, you want to act and observe more frequently than the actual recommendation period.
The reinforcement learning (RL) framework looks to me the best fit to your task because it’s a control problem. While I have given a very short intro in my last response, I recommend you to take whatever you can from the wikipedia (in which you will see the feedback loop very familiar to any control system engineer), chatbots, and perhaps to find some online articles. DL.AI (the organization that offers the DLS you are taking) also offers the Machine Learning Specialization (MLS) and the 3rd week of the 3rd course of the MLS is exactly about reinforcement learning, so you might as well consider to spend some time on that week. You won’t need the previous weeks.
While I think RL is the best fit, another possibility is that, you may train a model that takes inputs from those X+M variables, and predict the likelihood of better outcomes for the M variables among a fixed set of actions (which are some relative changes to the current N setpoints). You may have N prediction heads, and each head can have like n relative changes, so you will have n \times N predictions instead of n^N predictions in total.
Thank you very much for your insightful answers and for your time!
I apologize for the delay in my response. I needed the due amount of time to process all your comments.
Starting with the “human error” discussion, indeed, that’s about the ceiling of how good the solution could be. In this case, it’s really something that remains very difficult for me to assess.
I don’t think there are many companies doing this, and since normally the machines have owners and the projects have private sponsors, the results are always confidential and under an NDA.
In addition, every machine is different. Even machines of the same manufacturer, depending on the material they produce, their mechanical condition, raw materials… it is always hard to translate metrics from one case to another. It is annoying from an orthodox data science point of view not being able to establish this ceiling, but after all, the results have to be always “good enough” according to the sponsor’s criteria. I assume I will never have this ceiling available.
Regarding the juicy stuff, I totally agree Reinforcement Learning would be the go-to procedure in this case. The reason I didn’t apply was my inexperience with these algorithms, the small amount of data I might have, and the fact that in industry, it is always better to have an algorithm that plays it safe, provides stable, consistent recommendations, and doesn’t compromise the machine.
And if I’m not wrong, the model learns as it makes real recommendations, once it is deployed, and I wasn’t sure the manufacturer would allow operating the machine crazily in the first iterations. Overall, given my inexperience in the subject, I didn’t have the confidence to apply it and went for a more conservative solution, despite its limitations.
I should totally visit the course you mentioned!!
Regarding your alternative solution, I think it is an approach I also had in mind. In fact, I kind of follow it when predicting the best Cluster, I tune the inputs M in the direction I want them to go by adding a small delta to them. Like asking the model to tell me the best cluster, not for the current situation, but for the improved scenario I want the machine to go to.
Please correct me if I am wrong. What you propose is training a model with M (yield+Temp) + X as inputs to predict N (Setpoints). f(M+X)=N
In this case, I would do something similar to what I am doing. Tune M (yield = yield+delta, temp = temp-delta), but instead of predicting clusters as classes, I will predict directly the 3 setpoints as in a regression problem, right?
Or you propose doing f(N+X)=M. Apply modifications to N, defining the actions (Setpoint1+delta, Setpoint1–delta, Setpoint1+2*delta… for all the Setpoints) and keep the action that causes the best effect on the predicted M?
I was also considering fitting a model to predict yield and temperature, including the Setpoints as variables, and with that model, with an optimization technique, try to “solve the equation” for the Setpoints. Something like
f(Setpoints, exogenous) = yield, temp.
And then yield = yield+delta, temp = temp-delta, and solve for the Setpoints.
Would this make sense to you??
Regarding the “bad data”. These data correspond to the cases in which the machine was not well-operated. You can imagine 2 machine operators, the good and the bad one. I tried to learn from the good operator and avoid learning from the bad one.
I remove the “bad data” from all the datasets.
In defining the centroids, I use the “good data” because I want to learn the setpoint combinations that the good operator tends to use.
In the Train set, I use only the good data because I want my model to learn how to match the machine situation (exogenous vars) with a combination of SPs. Same reason for Dev and Test sets.
Then, in real-time, the model has to make predictions with any data coming in, and ideally turn badly managed situations into good ones, by matching that situation with the “best” cluster as the good operator would have done.
I haven’t checked the good-to-bad ratio for the centroids. I know that just recommending a certain cluster doesn’t guarantee good results (high yield, temperature under control). Linking it to the machine situation should increase the chances. But I am sure that in many cases, using a certain Centroid combination leads to bad results. To give flexibility to the solution, once a cluster is predicted, small corrections around the cluster are made following “hard” rules based on the experience of the plant engineers.
I still have problems with how to make the Train/DEv/Test splits without leaking data…
The discussion is so rich that we now have several open fronts! I really miss that; most of the time I have to work “alone” as a DS.
I don’t mean to be intrusive, but since you mentioned you were having adventures.. I am very curious to know what you are working on, if it can be said
I take the chance to thank you again for your time and your thoughts.
I don’t mean to argue against any of your response, but it’s just natural for me to try to address the challenge when I read one. Maybe my opinions below are immature, impractical or even non-sense because I don’t know what you know, so please just take my response below as some brainstorming for ideas that’s aimed for creating value.
For human error, I think that, even though we can’t establish the real ceiling, it is still good to have a number about how other company can do. Even though the immediate results are under NDA, what about some downstream numbers? I will record them and see how well my machines need to work to get to those downstream results. The max of these numbers will be “the recorded ceiling”, whereas the average or median could be “my baseline target”.
Are they useful, considering them perhaps not even representative enough? I think they are, and for 3 reasons. (1) We need a human error to know when to stop (although we can always try to go beyond that) (2) Just like any engineering project, we need a target to communicate with the management and others in the company. (3) At least we can try to beat competitors that work with my clients - “hello client, we can get you 20% boost in that (downstream) production rate, sounds good?”
For RL, there is an area called constrained RL. In fact, after my last reply, I tried to dig up some readings about that. These constraints are about safety. We are not the first to worry about safety, and it means that we can stand on the shoulders of giants. To dream bigger, we need to acquire more skills, and with more skills, we go farther and dream even bigger - very positive feedback loop and sometimes people (including me) like this kind of loop.
That’s an interesting idea and I will respond to your following explanation below. However, my original idea was “f(current M, current X, current N) = delta N” and “new N = N + delta N”
We can test two choices - 1. delta N as a regression problem; 2. delta N a fixed possible set which is a classification problem. I was thinking about (2) for my original idea.
The first one might be more aggressive because it can predict a delta too large but we can also apply safety measure to cap the actual delta we apply to our valuable machines (just like one of the many constrained RL approaches ).
The second one might require multiple rounds of recursive predictions if the fixed set of deltas is too conservative.
If I were you, I would test both approaches at the same time, and see which one or a hybrid of the two makes the most sense, given all other engineering, business and project constraints I had.
I had thought about this, too. The thing is, I suppose the “setpoints” in the input of your function f includes not just the current setpoints (as used by my “original idea” above), but also the setpoints to-use (otherwise, where can it be?). In this case, when you look for the best {yield, temp}, you need to try multiple possible values for the “setpoints to-use”. “Multiple try” is the problem for this approach because we don’t know how many try is enough to get the best {yield, temp}.
Your question reminds me of something a few years ago. After I had asked a colleague to get me more data, he asked me the following question: “shouldn’t we just learn from the good cases?”
I think I didn’t answer him well at that time, but I will take this chance to give it another try.
Machine learning is discriminative (there is also generative approach but since we are not doing it here, I will focus on the very nature of the discriminative approach), which means it takes both good and bad samples for it to set the good predictions apart from the bad predictions.
Excluding the bad ones at training won’t stop the model from predicting the bad ones, so the risk of excluding the bad ones is that the model does not know the bad ones are the bad ones.
I think you could feel this statement more with either my original idea or your idea, because both are predicting relative changes to N which means the final N can be any value - good or bad.
However, if we think deeper, this statement is still true even we have a fixed set of clusters of “good absolute values of setpoints”. Why? Because this good cluster may be good for this input X, but this very same good cluster may also be the bad cluster of that input X of a bad data that we have dropped. This is why I asked about the good-to-bad ratio. How many bad data is considering this good cluster their cluster?
We need our model to know how to tell the good from the bad. We need bad data.
No matter we go with clusters, your idea, my original idea, or even constrained RL, we always need this final safety layer. We always need the experience of the plant engineers.
I know how it feels I had never heard of the term “data science” until I was around 30, and I picked much of them up all by myself. There were even no friends to talk with about data science because none of them are in this career. In fact, this career was not even known when we were students - at least I had never seen anything like data scientist in any job board in the job market of my city.
Another is actually a small project which I got the idea when I was coding this learning project. I am making a class based upon python’s “dataclasses” but it is going to be self-documenting. It means that, you only need to write down the name, type and description of your program’s parameters once, and your program’s entry point (python’s “argparse”) and the docstring for the class that holds all those parameters will be automatically generated. This means that, to change / add / remove anything about a parameter or the parameter itself, you need only to do it in one place. This class itself can also be passed among different functions, so that, in practice, you don’t need to typehint or docstring the parameters in any of those functions. All in all, I want less repetitive work around parameters and that will mean less errors and more efficiency.
You are welcome, Albert!
I would just like to share one more view of mine, because we have talked about being conservative and RL.
I am a physics graduate. Before data science, I had been an engineer (mostly self-trained) for years writing software and building systems (gas, optical, radon). For systems, my work was primarily mechanical and about the sensors and, only sometimes, electrical, but I think I also share the view about making sure the system works under safety and up to the standard, because I also manage the operation of those projects.
What I want to say is that, it’s not a bad thing to have both conservative and a bit of progressive mindset at the same time. In fact, it’s better to have both balanced than just any one of them, because then we create and we innovate.
Btw, I just wanted to reiterate this message again. Sometimes a brainstorming can somehow become an argument between what’s correct and what’s not correct. I am not saying we are in such situation, but since we don’t know each other, it may be better for me to be expressing my intention more.
My deepest apologies for the delayed response. It was never my intention to leave your message unanswered for so long, it’s been crazy lately…
I appreciate a lot all your time and specially the insight you have provided in all the topics discussed, train/test splits, the selction of good bas cases, it will be super useful for me your ideas on different designs (predicting setpoint deltas) and bringing the case to RL.
I will totally take this direction in my next iteration!
I won’t ask to go deep on this discussion again after all this time!
I will keep thinking about how to choose the cases that go into my model as training cases.
In the case of predicting the Setpoints deltas; should I include deltas that cause both a good and a bad consequence because the model will have to make predictions also for bad cases, or should I filter and train only for those deltas that made a bad situation become good and a good situation remain good? After all, this is what I want my model to do, turn the bad situations into better ones, and don’t break it when it is already working well… It still seems to me that the right option is the second, because otherwise my model will learn to predict deltas (good or bad), not to recommend good deltas…
Your field of work seems very interesting and somehow very related to mine.
I can only wish you the best in your current and future projects!
It is just wonderful to hear from you! I always think that we participate here in our free time, so it is no delay at all but it was just the right time. Isn’t it a miracle that we, from two corners of the world, are able to exchange ideas here?
Now, it is very clear. Since you use the model when it is in a bad situation, then the model should be able to differentiate, given a bad situation, between a good and a bad delta. This means that we need “bad → bad” and “bad → good” data points in our datasets. Make sense?
What’s being left out are “good → good” and “good → bad”, and perhaps it would be a good idea to compare the model performance with and without them.
Thank you for your kind answer again!
It is indeed a sort of miracle to be talking with a -not so much anymore-stranger about something it’s deep inside my brain and I can’t talk about with anyone around me.
I use my model in all situations (when things are good and when things are bad), but I definitely want to always predict deltas that make it good.
“good → good” and “bad → good” is what I aim for.
This is why I don’t see how adding “bad → bad” or “good → bad” deltas in the training will help.
From my point of view, if I introduce in the train set deltas that made it “bad”, the model will learn how to predict deltas (and blindly copy the operator behavior), but not how to end in a good situation.
I am stuck here , maybe I haven’t understood your point, or we see it the same way after all?
Oh, I see… I think our difference is the modeling approach. I have always wanted to leverage the bad deltas (though I may change my mind if it turns out there are not much bad deltas in your existing data pool which I don’t know).
Your question first. Yes, yes, if we want to predict good deltas given any situation, we want “good → good” and “bad → good”. Period.
I focused on “??? → bad” because I was all thinking about how to leverage the “??? → bad” data, and maybe this is the part I had never made myself clear enough. First, the following should be how your data looks like:
There should be more details about how to use and evaluate these models, but I hope my previous messages would make more sense given an example of the approach.