Statistical Data Bias - What if Data "appears" biased as no option

In the detecting credit card fraud model example it was mentioned that the data could be biased as most transactions are not fraudulent.
My question is - What if there is a bank that has “no” fraudulent transactions at all for modeling.
And considering what they observe in the industry, they want to become “proactive” and implement AWS Sagemaker End point fraud detection model.
Can the model be trained so that it “looks/compares” at the non fraudulent “features” in the transaction data used to create the model and any transaction that is “anomalous” or “not matching” the features/parameters - it can “flag” for “fraud possibility” and more scrutinizing from the concerned department.
For want of a better term, lets call it “negative or anomalous” feature matching. Is this possible ?

The other was the reviews and determining “sentiment” of the data.
What if the company has “in real terms” only “majority” - positive reviews or “negative” reviews.
In other words the data “appears” skewed, statistically speaking.
But in reality it is not so.
How do we go about resolving this to make it “unbiased” to suit our model creation. And do we have to ?

Hi @mykle,

Your problem can be generally formulated as a classification problem where the set of classes C is not fixed, Since test data may come from new categories. You would like to detect these new categories. This is called Out of distribution detection. You can refer to “open set recognition” in Chapter 16 of Probabilistic Machine learning: An Introduction by Kevin Murphy to learn more.

My understanding of the second question is that in numerical terms you have skewed data but at a semantic/syntactic level, both are balanced. I do not know the answer to this question. One example is the Initial Visual Question Answering dataset has a bug or bias, The issues are discussed in this paper and how did they overcome. Statistics can be misleading also, watch these two videos video1 and video2. That’s why whenever a new model is deployment-ready, It will go through A/B testing to know whether the model is ready for production level.

Happy learning,

Best Regards,
A. Sriharsha

1 Like

Thanks a lot that answered my query.