When Your “Labels” Aren’t Really Labels: Dealing with Entity-Based NLP Datasets

I’m working on an NLP news classification task, but my dataset is structured in an unusual way.

Each article has multiple “topics” per row, but these topics are actually named entities, not true categories. For example:

Row 1 topics: [“Doctors”, “NHS”, “British Medical Association (BMA)”] → clearly belongs to a broader domain like Health

Row 2 topic: [“Glasgow”] → a location

Row 3 topics: [“Sutton in Ashfield,” “Annesley,” “M1 motorway”] → places/infrastructure

Row 4 topics: [“Elon Musk,” “Tesla”] → could belong to Business or Technology

So the problem is:

My labels are inconsistent and too granular (entities instead of domains). Each row has multi-label outputs, but they don’t directly map to meaningful categories. There is no predefined mapping from entities → domain. Some entities are ambiguous (e.g., Elon Musk could be Business or tech).

What I’m trying to do:

Convert these entity-level labels into higher-level domains (like Health, Business, Tech, Geography, etc.). Then train a multi-label classifier on those domains

My main questions:

  1. What is the best way to map entities → domains at scale?

  2. Should this mapping be manual, rule-based, or embedding-based?

  3. How should I handle ambiguous entities that can belong to multiple domains?

  4. Is this still a classification problem, or should I rethink it entirely?

Any guidance on restructuring this dataset or designing a proper pipeline would help.