title	description	prev	next	type	id
Chapter 4: Training a neural network model	In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.	/chapter3		chapter	4

While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost always want to fine-tune them with more examples. You can do this by training them with more labelled data.

What does training not help with?

If a pre-trained model doesn't perform well on your data, training it with more examples is often a good solution.

You can use training to teach the model new labels, entity types or other classification schemes.

spaCy's components are supervised models for text annotations, meaning they can only learn to reproduce examples, not guess new labels from raw text.

spaCy's rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as "GADGET".

Write a pattern for two tokens whose lowercase forms match "iphone" and "x".
Write a pattern for two tokens: one token whose lowercase form matches "iphone" and a digit.

To match the lowercase form of a token, you can use the "LOWER" attribute. For example: {"LOWER": "apple"}.
To find a digit token, you can use the "IS_DIGIT" flag. For example: {"IS_DIGIT": True}.

Let's use the match patterns we've created in the previous exercise to bootstrap a set of training examples. A list of sentences is available as the variable TEXTS.

Create a doc object for each text using nlp.pipe.
Match on the doc and create a list of matched spans.
Get (start character, end character, label) tuples of matched spans.
Format each example as a tuple of the text and a dict, mapping "entities" to the entity tuples.
Append the example to TRAINING_DATA and inspect the printed data.

To find matches, call the matcher on a doc.
The returned matches are (match_id, start, end) tuples.
To add an example to the list of training examples, you can use TRAINING_DATA.append().

In this exercise, you'll prepare a spaCy pipeline to train the entity recognizer to recognize "GADGET" entities in a text – for example, "iPhone X".

Create a blank "en" model, for example using the spacy.blank method.
Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.
Add the new label "GADGET" to the entity recognizer using the add_label method on the pipeline component.

To create a blank entity recognizer, you can call nlp.create_pipe with the string "ner".
To add the component to the pipeline, use the nlp.add_pipe method.
The add_label method is a method of the entity recognizer pipeline component, which you've stored in the variable ner. To add a label to it, you can call ner.add_label with the string name of the label, for example ner.add_label("SOME_LABEL").

Let's write a simple training loop from scratch!

The pipeline you've created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label "GADGET".

The small set of labelled examples that you've created previously is available as TRAINING_DATA. To see the examples, you can print them in your script.

Call nlp.begin_training, create a training loop for 10 iterations and shuffle the training data.
Create batches of training data using spacy.util.minibatch and iterate over the batches.
Convert the (text, annotations) tuples in the batch to lists of texts and annotations.
For each batch, use nlp.update to update the model with the texts and annotations.

To start the training and reset the weights call, the nlp.begin_training() method.
To divide the training data into batches, call the spacy.util.minibatch function on the list of training examples.

Let's see how the model performs on unseen data! To speed things up a little, we already ran a trained model for the label "GADGET" over some text. Here are some of the results:

Text	Entities
Apple is slowing down the iPhone 8 and iPhone X - how to stop it	`(iPhone 8, iPhone X)`
I finally understand what the iPhone X 'notch' is for	`(iPhone X,)`
Everything you need to know about the Samsung Galaxy S9	`(Samsung Galaxy,)`
Looking to compare iPad models? Here’s how the 2018 lineup stacks up	`(iPad,)`
The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple	`(iPhone 8, iPhone 8)`
what is the cheapest ipad, especially ipad pro???	`(ipad, ipad)`
Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics	`(Samsung Galaxy,)`

Out of all the entities in the texts, how many did the model get correct? Keep in mind that incomplete entity spans count as mistakes, too! Tip: Count the number of entities that the model should have predicted. Then count the number of entities it actually predicted correctly and divide it by the number of total correct entities.

Try counting the number of correctly predicted entities and divide it by the number of total correct entities the model should have predicted.

On our test data, the model achieved an accuracy of 70%.

Try counting the number of correctly predicted entities and divide it by the number of total correct entities the model should have predicted.

Here's an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]

Part 1

Why is this data and label scheme problematic?

A much better approach would be to only label "GPE" (geopolitical entity) or "LOCATION" and then use a rule-based system to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

While it's possible that Paris, AK is also a tourist attraction, this only highlights how subjective the label scheme is and how difficult it will be to decide whether the label applies or not. As a result, this distinction will also be very difficult to learn for the entity recognizer.

Even very uncommon words or misspellings can be labelled as entities. In fact, being able to predict categories in misspelled text based on the context is one of the big advantages of statistical named entity recognition.

Part 2

Rewrite the TRAINING_DATA to only use the label "GPE" (cities, states, countries) instead of "TOURIST_DESTINATION".
Don't forget to add tuples for the "GPE" entities that weren't labeled in the old data.

For the spans that are already labelled, you'll only need to change the label name from "TOURIST_DESTINATION" to "GPE".
One text includes a city and a state that aren't labeled yet. To add the entity spans, count the characters to find out where the entity span starts and where it ends. Then add (start, end, label) tuples to the entities.

Here's a small sample of a dataset created to train a new entity type "WEBSITE". The original dataset contains a few thousand sentences. In this exercise, you'll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, Brat, a popular open-source solution, or Prodigy, our own annotation tool that integrates with spaCy.

Part 1

Complete the entity offsets for the "WEBSITE" entities in the data. Feel free to use len() if you don't want to count the characters.

The start and end offset of an entity span are the character offsets into the text. For example, if an entity starts at position 5, the start offset is 5. Remember that the end offsets are exclusive – so 10 means up to character 10.

Part 2

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it's doing great on "WEBSITE", but doesn't recognize "PERSON" anymore. Why could this be happening?

It's definitely possible for a model to learn about very different categories. For example, spaCy's pre-trained English models can recognize persons, but also organizations or percentages.

If "PERSON" entities occur in the training data but aren't labelled, the model will learn that they shouldn't be predicted. Similarly, if an existing entity type isn't present in the training data, the model may "forget" and stop predicting it.

While the hyperparameters can influence a model's accuracy, they're likely not the problem here.

Part 3

Update the training data to include annotations for the "PERSON" entities "PewDiePie" and "Alexis Ohanian".

To add more entities, append another (start, end, label) tuple to the list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter4.md

chapter4.md

Part 1

Part 2

Part 1

Part 2

Part 3

Files

chapter4.md

Latest commit

History

chapter4.md

File metadata and controls

Part 1

Part 2

Part 1

Part 2

Part 3