title | description | prev | next | type | id |
---|---|---|---|---|---|
Chapter 2: Large-scale data analysis with spaCy |
In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis. |
/chapter1 |
/chapter3 |
chapter |
2 |
- Look up the string "cat" in
nlp.vocab.strings
to get the hash. - Look up the hash to get back the string.
- You can use the string store in
nlp.vocab.strings
like a regular Python dictionary. For example,nlp.vocab.strings["unicorn"]
will return the hash, and looking up the hash again will return the string"unicorn"
.
- Look up the string label "PERSON" in
nlp.vocab.strings
to get the hash. - Look up the hash to get back the string.
- You can use the string store in
nlp.vocab.strings
like a regular Python dictionary. For example,nlp.vocab.strings["unicorn"]
will return the hash, and looking up the hash again will return the string"unicorn"
.
Why does this code throw an error?
from spacy.lang.en import English
from spacy.lang.de import German
# Create an English and German nlp object
nlp = English()
nlp_de = German()
# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)
# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])
Hashes can't be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.
Any string can be converted to a hash.
The variable name nlp
is only a convention. If the code used the variable name
nlp
instead of nlp_de
, it'd overwrite the existing nlp
object, including
the vocab.
Let's create some Doc
objects from scratch!
- Import the
Doc
fromspacy.tokens
. - Create a
Doc
from thewords
andspaces
. Don't forget to pass in the vocab!
The Doc
class takes 3 arguments: the shared vocabulary, usually nlp.vocab
, a
list of words
and a list of spaces
, boolean values, indicating whether the
word is followed by a space or not.
- Import the
Doc
fromspacy.tokens
. - Create a
Doc
from thewords
andspaces
. Don't forget to pass in the vocab!
Look at each word in the desired text output and check if it's followed by a
space. If so, the spaces value should be True
. If not, it should be False
.
- Import the
Doc
fromspacy.tokens
. - Complete the
words
andspaces
to match the desired text and create adoc
.
Pay attention to the individual tokens. To see how spaCy usually tokenizes that
string, you can try it and print the tokens for nlp("Oh, really?!")
.
In this exercise, you'll create the Doc
and Span
objects manually, and
update the named entities β just like spaCy does behind the scenes. A shared
nlp
object has already been created.
- Import the
Doc
andSpan
classes fromspacy.tokens
. - Use the
Doc
class directly to create adoc
from the words and spaces. - Create a
Span
for "David Bowie" from thedoc
and assign it the label"PERSON"
. - Overwrite the
doc.ents
with a list of one entity, the "David Bowie"span
.
- The
Doc
is initialized with three arguments: the shared vocab, e.g.nlp.vocab
, a list of words and a list of boolean values indicating whether the word should be followed by a space. - The
Span
class takes four arguments: the referencedoc
, the start token index, the end token index and an optional label. - The
doc.ents
property is writable, so you can assign it any iterable consisting ofSpan
objects.
The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")
# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]
for index, pos in enumerate(pos_tags):
# Check if the current token is a proper noun
if pos == "PROPN":
# Check if the next token is a verb
if pos_tags[index + 1] == "VERB":
result = token_texts[index]
print("Found proper noun before a verb:", result)
Why is the code bad?
It shouldn't be necessary to convert strings back to Token
objects. Instead,
try to avoid converting tokens to strings if you still need to access their
attributes and relationships.
Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.
The .pos_
attribute returns the coarse-grained part-of-speech tag and
"PROPN"
is the correct tag to check for proper nouns.
- Rewrite the code to use the native token attributes instead of lists of
token_texts
andpos_tags
. - Loop over each
token
in thedoc
and check thetoken.pos_
attribute. - Use
doc[token.i + 1]
to check for the next token and its.pos_
attribute. - If a proper noun before a verb is found, print its
token.text
.
- Remove the
token_texts
andpos_tags
β we don't need to compile lists of strings upfront! - Instead of iterating over the
pos_tags
, loop over eachtoken
in thedoc
and check thetoken.pos_
attribute. - To check if the next token is a verb, take a look at
doc[token.i + 1].pos_
.
In this exercise, you'll use a larger English model, which includes around 20.000 word vectors. The model is already pre-installed.
- Load the medium
"en_core_web_md"
model with word vectors. - Print the vector for
"bananas"
using thetoken.vector
attribute.
- To load a statistical model, call
spacy.load
with its string name. - To access a token in a doc, you can index into it. For example,
doc[4]
.
In this exercise, you'll be using spaCy's similarity
methods to compare Doc
,
Token
and Span
objects and get similarity scores.
- Use the
doc.similarity
method to comparedoc1
todoc2
and print the result.
The doc.similarity
method takes one argument: the other object to compare the
current object to.
- Use the
token.similarity
method to comparetoken1
totoken2
and print the result.
- The
token.similarity
method takes one argument: the other object to compare the current object to.
- Create spans for "great restaurant"/"really nice bar".
- Use
span.similarity
to compare them and print the result.
Why does this pattern not match the tokens "Silicon Valley" in the doc
?
pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]
doc = nlp("Can Silicon Valley workers rein in big tech from within?")
The "LOWER"
attribute in the pattern describes tokens whose lowercase form
matches a given value. So {"LOWER": "valley"}
will match tokens like "Valley",
"VALLEY", "valley" etc.
The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.
By default, all tokens described by a pattern will be matched exactly once. Operators are only needed to change this behavior β for example, to match zero or more times.
Both patterns in this exercise contain mistakes and won't match as expected. Can
you fix them? If you get stuck, try printing the tokens in the doc
to see how
the text will be split and adjust the pattern so that each dictionary represents
one token.
- Edit
pattern1
so that it correctly matches all case-insensitive mentions of"Amazon"
plus a title-cased proper noun. - Edit
pattern2
so that it correctly matches all case-insensitive mentions of"ad-free"
, plus the following noun.
- Try processing the strings that should be matched with the
nlp
object β for example[token.text for token in nlp("ad-free viewing")]
. - Inspect the tokens and make sure each dictionary in the pattern correctly describes one token.
Sometimes it's more efficient to match exact strings instead of writing patterns
describing the individual tokens. This is especially true for finite categories
of things β like all countries of the world. We already have a list of
countries, so let's use this as the basis of our information extraction script.
A list of string names is available as the variable COUNTRIES
.
- Import the
PhraseMatcher
and initialize it with the sharedvocab
as the variablematcher
. - Add the phrase patterns and call the matcher on the
doc
.
The shared vocab
is available as nlp.vocab
.
In the previous exercise, you wrote a script using spaCy's PhraseMatcher
to
find country names in text. Let's use that country matcher on a longer text,
analyze the syntax and update the document's entities with the matched
countries.
- Iterate over the matches and create a
Span
with the label"GPE"
(geopolitical entity). - Overwrite the entities in
doc.ents
and add the matched span. - Get the matched span's root head token.
- Print the text of the head token and the span.
- Remember that the text is available as the variable
text
. - The span's root token is available as
span.root
. A token's head is available via thetoken.head
attribute.