Skip to content

Latest commit

Β 

History

History
427 lines (275 loc) Β· 12.5 KB

chapter2.md

File metadata and controls

427 lines (275 loc) Β· 12.5 KB
title description prev next type id
Chapter 2: Large-scale data analysis with spaCy
In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.
/chapter1
/chapter3
chapter
2

Part 1

  • Look up the string "cat" in nlp.vocab.strings to get the hash.
  • Look up the hash to get back the string.
  • You can use the string store in nlp.vocab.strings like a regular Python dictionary. For example, nlp.vocab.strings["unicorn"] will return the hash, and looking up the hash again will return the string "unicorn".

Part 2

  • Look up the string label "PERSON" in nlp.vocab.strings to get the hash.
  • Look up the hash to get back the string.
  • You can use the string store in nlp.vocab.strings like a regular Python dictionary. For example, nlp.vocab.strings["unicorn"] will return the hash, and looking up the hash again will return the string "unicorn".

Why does this code throw an error?

from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)

# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])

Hashes can't be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

Any string can be converted to a hash.

The variable name nlp is only a convention. If the code used the variable name nlp instead of nlp_de, it'd overwrite the existing nlp object, including the vocab.

Let's create some Doc objects from scratch!

Part 1

  • Import the Doc from spacy.tokens.
  • Create a Doc from the words and spaces. Don't forget to pass in the vocab!

The Doc class takes 3 arguments: the shared vocabulary, usually nlp.vocab, a list of words and a list of spaces, boolean values, indicating whether the word is followed by a space or not.

Part 2

  • Import the Doc from spacy.tokens.
  • Create a Doc from the words and spaces. Don't forget to pass in the vocab!

Look at each word in the desired text output and check if it's followed by a space. If so, the spaces value should be True. If not, it should be False.

Part 3

  • Import the Doc from spacy.tokens.
  • Complete the words and spaces to match the desired text and create a doc.

Pay attention to the individual tokens. To see how spaCy usually tokenizes that string, you can try it and print the tokens for nlp("Oh, really?!").

In this exercise, you'll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

  • Import the Doc and Span classes from spacy.tokens.
  • Use the Doc class directly to create a doc from the words and spaces.
  • Create a Span for "David Bowie" from the doc and assign it the label "PERSON".
  • Overwrite the doc.ents with a list of one entity, the "David Bowie" span.
  • The Doc is initialized with three arguments: the shared vocab, e.g. nlp.vocab, a list of words and a list of boolean values indicating whether the word should be followed by a space.
  • The Span class takes four arguments: the reference doc, the start token index, the end token index and an optional label.
  • The doc.ents property is writable, so you can assign it any iterable consisting of Span objects.

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Part 1

Why is the code bad?

It shouldn't be necessary to convert strings back to Token objects. Instead, try to avoid converting tokens to strings if you still need to access their attributes and relationships.

Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.

The .pos_ attribute returns the coarse-grained part-of-speech tag and "PROPN" is the correct tag to check for proper nouns.

Part 2

  • Rewrite the code to use the native token attributes instead of lists of token_texts and pos_tags.
  • Loop over each token in the doc and check the token.pos_ attribute.
  • Use doc[token.i + 1] to check for the next token and its .pos_ attribute.
  • If a proper noun before a verb is found, print its token.text.
  • Remove the token_texts and pos_tags – we don't need to compile lists of strings upfront!
  • Instead of iterating over the pos_tags, loop over each token in the doc and check the token.pos_ attribute.
  • To check if the next token is a verb, take a look at doc[token.i + 1].pos_.

In this exercise, you'll use a larger English model, which includes around 20.000 word vectors. The model is already pre-installed.

  • Load the medium "en_core_web_md" model with word vectors.
  • Print the vector for "bananas" using the token.vector attribute.
  • To load a statistical model, call spacy.load with its string name.
  • To access a token in a doc, you can index into it. For example, doc[4].

In this exercise, you'll be using spaCy's similarity methods to compare Doc, Token and Span objects and get similarity scores.

Part 1

  • Use the doc.similarity method to compare doc1 to doc2 and print the result.

The doc.similarity method takes one argument: the other object to compare the current object to.

Part 2

  • Use the token.similarity method to compare token1 to token2 and print the result.
  • The token.similarity method takes one argument: the other object to compare the current object to.

Part 3

  • Create spans for "great restaurant"/"really nice bar".
  • Use span.similarity to compare them and print the result.

Why does this pattern not match the tokens "Silicon Valley" in the doc?

pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]
doc = nlp("Can Silicon Valley workers rein in big tech from within?")

The "LOWER" attribute in the pattern describes tokens whose lowercase form matches a given value. So {"LOWER": "valley"} will match tokens like "Valley", "VALLEY", "valley" etc.

The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.

By default, all tokens described by a pattern will be matched exactly once. Operators are only needed to change this behavior – for example, to match zero or more times.

Both patterns in this exercise contain mistakes and won't match as expected. Can you fix them? If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

  • Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
  • Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.
  • Try processing the strings that should be matched with the nlp object – for example [token.text for token in nlp("ad-free viewing")].
  • Inspect the tokens and make sure each dictionary in the pattern correctly describes one token.

Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

  • Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
  • Add the phrase patterns and call the matcher on the doc.

The shared vocab is available as nlp.vocab.

In the previous exercise, you wrote a script using spaCy's PhraseMatcher to find country names in text. Let's use that country matcher on a longer text, analyze the syntax and update the document's entities with the matched countries.

  • Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
  • Overwrite the entities in doc.ents and add the matched span.
  • Get the matched span's root head token.
  • Print the text of the head token and the span.
  • Remember that the text is available as the variable text.
  • The span's root token is available as span.root. A token's head is available via the token.head attribute.