Why the semi-supervised labeling, like graph-based approach is reliable to be applied to build training data? If there is a way to prove it then way not just apply it for prediction directly but to train another supervised model for that?

I donâ€™t follow your questions. Could you please rephrase the topic? Donâ€™t forget to refer lectures / reading items in terms of links.

Here is the context lecture link of semi-supervised labeling method

Thanks for the link but I still donâ€™t understand your questions. Please help me by rephrasing your text.

Letâ€™s say X_0 is the human labeled data, and we train the semi-supervised learning model with X_0 and then produce X_1 data with the semi-supervised model. Then we train the final model with (X_0 + X_1) data.

If I had to guess, I would say the goal with semi-supervised learning is to sort of intentionally overfit to the data instead of building a reliable model to predict.

Additionally, we might want a different model as the final one. Letâ€™s say we use K-means clustering to label the data, but then we want to use an SVM to predict with the data due to the constraints of the problem.

It is a good question to think about though.

Hey @Otto_Vintola

Can you explain more about this?

If I had to guess, I would say the goal with semi-supervised learning is to sort of intentionally overfit to the data instead of building a reliable model to predict.

Not so sure if your SVM case would be the potential reason to introduce this technique given it didnâ€™t explained in the lecture, but the similar idea indeed provided in teacher/student architecture in course 3 which applied to overcome constraints of the problem.

@balaji.ambresh

Here is the transcript context in the lecture:

Using Semi-supervised labeling is advantageous for really two main reasons, combining labeled and unlabeled data can improve the accuracy of machine learning models. Getting unlabeled data is often very inexpensive, since it doesnâ€™t require people to assign labels. Often unlabeled data is easily available in large quantities. Label propagation is an algorithm that assigns labels to previously unlabeled examples. This makes it a semi-supervised algorithm where a subset of the data points have labels. The algorithm propagates the labels to data points without labels. It does that based on the similarity or community structure of the labeled data points and the unlabeled data points. This similarity or structure is used to assign labels to the unlabeled data. For example, thereâ€™s graph-based and in this figure you can see â€¦

My question is if the semi-supervised learning generated labeled data is reliable enough to train a model, then why not just apply the labeling method as our model for prediction. Letâ€™s say we have labelling algorithm label(X) = y to generate training dataset, and the dataset is applied to train a model f(X) = y. If label(X) is the ground truth, why f(X) still required? I didnâ€™t see the two reasons introduced in the lecture answered my question.

No, label(X) is not the ground truth. The labelling algorithm is able to generate data that is good enough so that we can find another function from the generated data to the ground truth.

The goal of a labeling method like label propagation in semi-supervised learning is to assign labels to unlabeled data. This process doesnâ€™t involve explicitly learning the mapping from input features to the label but using a heuristic like label of nearest neighbor to this unlabeled point (like KNN).

Thereâ€™s no harm in using KNN approach as your final model as long as it meets your needs.

From the video transcript:

Label propagation itself is considered transductive learning, meaning that weâ€™re mapping from the examples themselves without learning a function for the mapping.

Thereâ€™s no harm in using KNN approach as your final model as long as it meets your needs.

Thatâ€™s truth there is no harm to apply KNN or any other heuristic. But given we just treat it as labeling method, whatâ€™s the benefit to train a model based on the labelled data comparing to just apply the heuristic? It must be some motive there right? Or why not use heuristic given it already available to give a prediction on new instance just like labeling process? Anyway to train a model there is cost introduced.

No, label(X) is not the ground truth. The labelling algorithm is able to generate data that is good enough so that we can find another function from the generated data to the ground truth.

Correct, the labelled is not ground truth in the physical world and maybe I should quote it, but how can we get a model / function to the ground truth or more close to ground truth than the â€śnot ground truthâ€ť labeled data through training on it?

Well, there is no guarantee of it â€“ as is the case for most ML tasks â€“ but we can always try. Finding a model with a 100% accuracy is not realistic, but maybe if we generate pseudo-labels derived from the data, then we have enough data to train a more suitable model to map to real predictions.

The issue comes from not having enough data to train a suitable model for the task. The distinction lies in the nature of the model used to generate labeled data versus the model intended for making predictions or generalizations on new, unseen data.

The semi-supervised model might not capture the full complexity or intricacies present in the underlying data distribution. It might rely on assumptions or specific rules that might not generalize well to the entire dataset or new, unseen data.

Consider the following regression problem:

- Black points have labels.
- We want to label the green data point.

If we were to use a KNN regressor with average of 2 neighbors to predict the green point, we donâ€™t consider the unlabeled red points. Weâ€™ll choose the 2 closest black points and compute the label.

With label propagation, thereâ€™s going to be a better estimation in the label since weâ€™d compute labels for the red points before labeling the green point using the red points.

The semi-supervised model might not capture the full complexity or intricacies present in the underlying data distribution. It might rely on assumptions or specific rules that might not generalize well to the entire dataset or new, unseen data.

Thatâ€™s possible that with the more sophisticate model (ANN maybe) we can find some complicate structure and heuristic labeling donâ€™t have such perception.

Let me reconsider this, since the most critical part would be, the labeling is the process of generating the information and the training as consuming the information, is that possible that the information the generator donâ€™t know but be captured by the consumer.

Anyway if the specific model is required and we donâ€™t have enough data, then the semi-supervised learning addressed the problem.

@balaji.ambresh

I think this is the example of presenting the value of labeling method. Since with the progress of labeling propagation it help the model more confident on its prediction (I guess the role of KNN regressor here is the model to be trained)

Please rephrase your reply to me. I find it hard to understand the text.

I meant your example showed the value of apply labeling, it help to build a more accurate model with more labelled data. But it still didnâ€™t answer the original question, the reason we need to train the model with the labelled data.

Yeah, the generator does not necessarily capture all of the information from the data â€“ nor, does any model completely, we can only approximate. But typically, a labelling algorithm does not produce the best accuracy compared to an ANN, but this can also vary between problems.

Additionally, the labelling algorithm is not error free, so each label that is created with it, introduces some error into the dataset as a whole. So, the more data that is created with a labelling algorithm, the less truthful the whole dataset is.

Consequently, there is a limit to how much data should be labelled before it is skewed in the direction that the algorithm is pushing it in, for instance, imagine the true distribution resembles a circle, but the labelling algorithm is creating labels that do not resemble a circle at all. This would produce a distortion in the dataset and if we train on it, inaccurate predictions.

So, label some data â€“ there is no blueprint for how much, you have to figure it out with analytics, domain knowledge and data visualisation. Then use a separate model to capture all of the information from the dataset which will most likely produce a model that is more accurate than the labelling algorithm.

But if the labelling algorithm is better for that specific problem, then there is no issue with using that too, as @balaji.ambresh said, if it fits your problem, then its ok to use it.

â€¦

So, label some data â€“ there is no blueprint for how much, you have to figure it out with analytics, domain knowledge and data visualisation â€¦

I think this make sense that to apply the labelled data cautiously given the limitation of semi-supervised labelling algorithm introduced, if I didnâ€™t get it wrong that would be a iterative process just like tuning hyper parameters for ml model.

Hey @balaji.ambresh

Can we say itâ€™s the general use case of semi-supervised labeling in the industry and proven to be a good practice to apply it efficiently?

Adding a few mentors to share their industry experience:

@Isaak_Kamau

@Th_o_Vy_Le_Nguy_n

@reinoudbosch

@arosacastillo

Yeah, you could say that, but ML and software development is largely iterative as a whole. Making something â†’ evaluating it â†’ if you are happy with the result, keep it, otherwise go to the first step.

Sorry reading this thread nowâ€¦

Hereâ€™s a general overview of label propagation in a graph-based setting:

**Graph Representation**: The first step involves constructing a graph using the available data. Nodes in the graph represent data points, and edges between nodes indicate relationships or similarities between them. This graph can be constructed based on various criteria such as similarity, distance, or some other measure.**Initial Labeling**: Initially, a subset of nodes in the graph is labeled with known classes. These are the instances for which you have ground truth labels. The rest of the nodes are unlabeled.**Label Propagation**: The algorithm then iteratively updates the labels of unlabeled nodes based on the labels of their neighboring nodes. The basic idea is that nodes with similar attributes or connections should have similar labels. This process is often repeated until the labels stabilize or a predefined number of iterations is reached.**Propagation Rule**: The way labels are propagated from labeled to unlabeled nodes depends on the specific label propagation algorithm. Commonly used approaches include a weighted average of the labels of neighboring nodes or a diffusion process.**Stopping Criteria**: Label propagation algorithms typically include a stopping criterion to determine when the process should halt. This could be a fixed number of iterations, a threshold for label changes, or other convergence criteria.**Final Predictions**: Once the label propagation process is complete, the final labels assigned to the unlabeled nodes can be used as predictions for those instances.

Label propagation in a graph-based semi-supervised setting has several advantages:

- It can effectively utilize information from both labeled and unlabeled data.
- It leverages the inherent structure and relationships in the data, which is especially useful in scenarios where similar instances are likely to have similar labels.

However, it also has challenges, such as sensitivity to the choice of the graph structure, the potential for noise propagation, and the need for careful tuning of parameters. The specific details of label propagation algorithms may vary, and there are different approaches and variations in the literature.