How are we deciding the score threshold for categories in Moderation API ?
Tl;dr
-
flagged: Set totrueif the model classifies the content as violating OpenAI’s usage policies,falseotherwise. -
categories: Contains a dictionary of per-category binary usage policies violation flags. For each category, the value istrueif the model flags the corresponding category as violated,falseotherwise. -
category_scores: Contains a dictionary of per-category raw scores output by the model, denoting the model’s confidence that the input violates the OpenAI’s policy for the category. The value is between 0 and 1, where higher values denote higher confidence. The scores should not be interpreted as probabilities.
My takeaway is the ‘threshold’ is binary. It either is flagged for one or more categories or it isn’t,