About Research+Code Blog

Labelling Schemes for Sequence Tagging Tasks



The reason that this post exists is because it would have exceedingly useful to me when experimenting with sequence labeling. It seems inconsequential and probably something that's looked over in lieu of more important details about the data, implementation and model architecture, but it's interesting how these small things make a difference. There are so many ways to label your data and so many ways in which it affects the output precision and recall. The inferences below are a result of a series of extensive experiments on a structured prediction/sequence tagging task -- both my own and a few others, that I managed to put together.


Let's first briefly define the task at hand. What we're working with is unstructured text, that we want to structure. Structuring could mean converting to graphs, trees, tables, relations or sequences; and here, I'm focusing on sequences. So for each sentence (represented by its constituent words), we want to know which sequences of words mean something to us, and we want to clearly demarcate the boundaries. The easiest way to do this is to just appoint to each word a tag e.g. in NER or POS tagging.


Consider a simple example. Let's say we have a document; a series of sentences and for each sentence we want to mark terms relevant to us. And since we're computational linguists, terms relevant to us are NLP terms. We want to mark everything that relates to NLP as "inside", with an "I" tag and everything else as irrelevant or "outside" with an "O" tag.


Example:

 O       O           O       O      O      O      O        I                 I                 I

The coolest thing I've ever learnt is Semantic Machine Translation.

 O       O           O       O      O      O      O        I                 I                 I

 O       O           O       O      O      O      O        B-I             I                 I

 O       O           O       O      O      O      O        I                 I                 E-I

 O       O           O       O      O      O      O        B-I             I                 E-I

There are obviously a number of different ways, given the length of the sequences (what if we have a one word sequence? Should we have a separate S tag in place of the B-I and E-I?) and the number of categories we want to mark (we're interested in more than just NLP words?).



We want our models to be able to predict sequences in the best way, and we therefore want to label our input data to the model in the best way. For simple sequence tagging, let's look at the most obvious ways to tag word sequences.


1. IO.

We mark words inside a sequence with the "I" tag and all other words are marked with an "O" tag. We're not concerned about indicating where the beginning and end of the sequence is. This is less complex, but obviously inefficient when we have 2 tags of the same category that we want to explicitly identify, positioned near another in the same sentence. If we're not concerned about discerning between elements of a category, this seems to work best.

2. BIO.

Here we want to explicitly identify the beginning. This seems to be the most popular scheme, and is especially useful when other encoded features correlate well.

3. IOE.

Here we want to explicitly identify only the end.

4. BIOES.

Here we want to mark both the beginning and the end. The "S" is required when there's only a single word we want to mark, and therefore don't need the "B-I" and "E-I".

I believe these were introduced in Andrew Borthwick's Thesis, titled A Maximum Entropy Approach to NER, but it may have been used earlier.



Results:


Firstly, for whatever the task at hand is, it would be useful to look at the type of sequences we want to extract. Are they of variable length? Are they always long word sequences? Or do they consist of one word? Also are there certain syntactic markers or begin/end words? If certain features correlate well with beginning or end symbols, we see dramatic variation of results.


In most cases, for multiple category tagging, IO encoding is useful and has the best complexity. The Ratinov and Roth survey concluded that BIOES is superior to BIO or IO for a second order logistic model.


In experiments for biomedical information extraction by sequence tagging, we see the following:

For longer sequences and sequence whose lengths vary inconsistently, IO seems to work the best. This provides both the highest precision and recall, compared to BIO, IOE and BIOES.

For a category that has significant one-word sequences, BIOES seems to work the best. This has the highest recall, but precision is still better using just IO. Using, the "S" tag therefore seems to be useful when we actually have sequences of only one word.



Things to note:

Complexity: First-best decoding in an HMM or first order CRF is quadratic in the number of tags (linear in the number of tokens).