C3 W1 - Train/Dev/Test split with Timeseries data & Establishing Bayes Error

rmwkwok · September 22, 2025, 1:18am

Hello Albert @albert_c,

Good stuff takes time

I don’t mean to argue against any of your response, but it’s just natural for me to try to address the challenge when I read one. Maybe my opinions below are immature, impractical or even non-sense because I don’t know what you know, so please just take my response below as some brainstorming for ideas that’s aimed for creating value.

For human error, I think that, even though we can’t establish the real ceiling, it is still good to have a number about how other company can do. Even though the immediate results are under NDA, what about some downstream numbers? I will record them and see how well my machines need to work to get to those downstream results. The max of these numbers will be “the recorded ceiling”, whereas the average or median could be “my baseline target”.

Are they useful, considering them perhaps not even representative enough? I think they are, and for 3 reasons. (1) We need a human error to know when to stop (although we can always try to go beyond that) (2) Just like any engineering project, we need a target to communicate with the management and others in the company. (3) At least we can try to beat competitors that work with my clients - “hello client, we can get you 20% boost in that (downstream) production rate, sounds good?”

For RL, there is an area called constrained RL. In fact, after my last reply, I tried to dig up some readings about that. These constraints are about safety. We are not the first to worry about safety, and it means that we can stand on the shoulders of giants. To dream bigger, we need to acquire more skills, and with more skills, we go farther and dream even bigger - very positive feedback loop and sometimes people (including me) like this kind of loop.

That’s an interesting idea and I will respond to your following explanation below. However, my original idea was “f(current M, current X, current N) = delta N” and “new N = N + delta N”

We can test two choices - 1. delta N as a regression problem; 2. delta N a fixed possible set which is a classification problem. I was thinking about (2) for my original idea.

The first one might be more aggressive because it can predict a delta too large but we can also apply safety measure to cap the actual delta we apply to our valuable machines (just like one of the many constrained RL approaches ).

The second one might require multiple rounds of recursive predictions if the fixed set of deltas is too conservative.

If I were you, I would test both approaches at the same time, and see which one or a hybrid of the two makes the most sense, given all other engineering, business and project constraints I had.

I had thought about this, too. The thing is, I suppose the “setpoints” in the input of your function f includes not just the current setpoints (as used by my “original idea” above), but also the setpoints to-use (otherwise, where can it be?). In this case, when you look for the best {yield, temp}, you need to try multiple possible values for the “setpoints to-use”. “Multiple try” is the problem for this approach because we don’t know how many try is enough to get the best {yield, temp}.

Your question reminds me of something a few years ago. After I had asked a colleague to get me more data, he asked me the following question: “shouldn’t we just learn from the good cases?”

I think I didn’t answer him well at that time, but I will take this chance to give it another try.

Machine learning is discriminative (there is also generative approach but since we are not doing it here, I will focus on the very nature of the discriminative approach), which means it takes both good and bad samples for it to set the good predictions apart from the bad predictions.

Excluding the bad ones at training won’t stop the model from predicting the bad ones, so the risk of excluding the bad ones is that the model does not know the bad ones are the bad ones.

I think you could feel this statement more with either my original idea or your idea, because both are predicting relative changes to N which means the final N can be any value - good or bad.

However, if we think deeper, this statement is still true even we have a fixed set of clusters of “good absolute values of setpoints”. Why? Because this good cluster may be good for this input X, but this very same good cluster may also be the bad cluster of that input X of a bad data that we have dropped. This is why I asked about the good-to-bad ratio. How many bad data is considering this good cluster their cluster?

We need our model to know how to tell the good from the bad. We need bad data.

No matter we go with clusters, your idea, my original idea, or even constrained RL, we always need this final safety layer. We always need the experience of the plant engineers.

I know how it feels I had never heard of the term “data science” until I was around 30, and I picked much of them up all by myself. There were even no friends to talk with about data science because none of them are in this career. In fact, this career was not even known when we were students - at least I had never seen anything like data scientist in any job board in the job market of my city.

Sure! Actually, one of them is happening in this place. See this one for a learning project for RL.

Another is actually a small project which I got the idea when I was coding this learning project. I am making a class based upon python’s “dataclasses” but it is going to be self-documenting. It means that, you only need to write down the name, type and description of your program’s parameters once, and your program’s entry point (python’s “argparse”) and the docstring for the class that holds all those parameters will be automatically generated. This means that, to change / add / remove anything about a parameter or the parameter itself, you need only to do it in one place. This class itself can also be passed among different functions, so that, in practice, you don’t need to typehint or docstring the parameters in any of those functions. All in all, I want less repetitive work around parameters and that will mean less errors and more efficiency.

You are welcome, Albert!

I would just like to share one more view of mine, because we have talked about being conservative and RL.

I am a physics graduate. Before data science, I had been an engineer (mostly self-trained) for years writing software and building systems (gas, optical, radon). For systems, my work was primarily mechanical and about the sensors and, only sometimes, electrical, but I think I also share the view about making sure the system works under safety and up to the standard, because I also manage the operation of those projects.

What I want to say is that, it’s not a bad thing to have both conservative and a bit of progressive mindset at the same time. In fact, it’s better to have both balanced than just any one of them, because then we create and we innovate.

Cheers,
Raymond

Topic		Replies	Views
Module1, Setting Up your Goal: Is one test set sufficient for an adequate model performance estimation? Structuring Machine Learning Projects coursera-platform	11	539	March 29, 2023
Bias and variance; Advanced Learning Algorithms week-module-3	50	329	June 16, 2025
Trying out Model selection Advanced Learning Algorithms week-module-3	17	640	February 19, 2023
DNN model occasionally gets terrible error Advanced Learning Algorithms week-module-3	13	601	February 3, 2023
Test Accuracy Higher than Train accuracy? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	42	2649	August 9, 2024

C3 W1 - Train/Dev/Test split with Timeseries data & Establishing Bayes Error

Related topics