Q about mergeIntervals in Transformer Network Application: Named-Entity Recognition

I found this method mergeIntervals without any code comment. (in fact, many custom methods not from keras or hugging face don’t have much), and not easy to understand. So I sort to print out what action it took on those “intervals”. I modified the method to return 2 entities, one thats transformed and the other its original input

I assume that any change in the entities will show up as a change of its length (i may be wrong), so I print those out. The 1st one seemed understandable. It merged 2 intervals where one is a total subset of the other and they are of same entity kinds.

The 2nd one is more perplexing, it effectively did this:

(476, 486, ‘Companies worked at’),
(466, 476, ‘Designation’),

(466, 486, ‘Companies worked at’),

It took 2 consecutive intervals of different kinds, and create one interval (same span) with one of the kind “companies worked at”, and with ‘Designation’ gone.

Seeing the 1st example i formed a hypothesis of what the merge could be doing, but the 2nd example seemed I am not able to understand why.

I hope more code comment and explanations can be added, or if this is sort of detailed in the original dataset, maybe a reference.

I reviewed a couple more entity examples. I now came to recognize what is happening. I also noticed the 2 “Before” intervals in my comment above, has an intersection of 1 at “476”, so they are not non-overlapping+consecutive.

a) if 2 interval intersect, they merge into one.
b) if 2 interval intersect and they have different entity kind, then the last interval’s kind seemed to be picked.

I believe this piece of code maybe buggy, and I am not sure if it also depends on particular python version, when it comes to dictionary representation and iteration.

E.g. I have seen this:

(714, 719, ‘Location’),
(707, 712, ‘Location’),
(677, 719, ‘College Name’),

(677, 719, ‘Location’)

This looks wrong to me. [677, 719] is most likely “College Name”, since sometimes, a college’s name will contain part of the city/town name. One may have to look at the actual content to be sure. A better heuristics is to use the entity kind of the interval with the largest size.

In my opinion, there may be a better merging algorithm (and a clearer one that dont involve a nest of if-else).

With this context, now I do see this is all to deal with noisy or ambiguous labels. I really wish the notebook could say a word or 2 about this data cleansing.