Skip to content

Python implementation of the Rapid Automatic Keyword Extraction algorithm using spaCy

License

Notifications You must be signed in to change notification settings

andrewrosss/rake-spacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rake-spacy

Code Style Tests codecov

Python implementation of the RAKE (short for Rapid Automatic Keyword Extraction) algorithm using spaCy.

Installation

pip install rake-spacy

Since rake-spacy depends on spacy, and to used spacy one has to load a language model, by default, rake-spacy will try to load spacy's en_core_web_sm model, so also grab that language model as well.

python -m spacy download en_core_web_sm

While this is the model used by rake-spacy by default, you can easily provide rake spacy with any language model/pipeline of your chosing. (Just about any nlp object from the spacy docs.)

Getting Started

To quickly extract some ranked keywords

from rake_spacy import Rake

r = Rake()

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."

ranklist = r.apply(text)

print(ranklist)  # [(8.0, minimal generating sets), (8.0, minimal supporting set), (7.0, minimal set), (5.0, considered types), (5.0, system), (5.0, systems), ...]

Differnent Language Models

To specify a language model other than spacy's en_core_web_sm model, you simply instatiate the language model of your choosing, and pass it to Rake in instantiation:

from rake_spacy import Rake
import spacy

nlp = spacy.load('en_core_web_lg')  # assuming this model is installed
r = Rake(nlp=nlp)

text = 'This is a sentence.'

# Rake is now using the provided nlp object to prase the document
ranklist = r.apply(text)

Code Structure and Customization

At it's core, the RAKE algorithm is conceptually simple:

  1. Extract candidate spans (sequences of contiguous words/tokens) from a piece of text.
  2. Create the co-occurance graph based on the spans extracted from the previous step (represented as an adjacency matrix, where two words/tokens are "adjacent", i.e. share an edge in the graph, if they are found in the same span).
  3. Compute a score for each word/token based from the co-occurance graph.
  4. Compute a score for each span by aggregating word scores (determined in the previous step)
  5. Order the spans based on the aggregate span-scores.

There is a fairly close mapping between the highlevel steps outlined above, the package structure of rake-spacy, and the available keyword arguments of the Rake object. To understand the structure of the code we'll start with the first step.

Phrasers

To apply RAKE to a piece of text the first thing we need is a way to extract candidate spans from the text. Throughout the remainder of the documentation these spans will be referred to as "phrases".

By default, rake-spacy extracts "contiguous non-stop-words" as the candidate phrases. What are "contiguous non-stop-words"? These are just chunks of text (words/tokens that are side-by-side) which contain no stop words. (More on how stop words/tokens are determined below.)

This process of extracting candidate phrases for RAKE to consider (i.e. from which to build the co-occurance graph) can be customized by providing a user-specified callable as the phraser_class parameter when instantiating Rake:

def my_phraser_func(doc: spacy.language.Language) -> List[spacy.tokens.Span]:
    ...  # slice and dice the document as desired
    return my_list_of_spans

r = Rake(phraser_class=my_phraser_func)

Essentially, the object passed as the value to the phraser_class argument must be a callable which accepts a spacy.tokens.Doc object and returns a list of spacy.token.Span objects. Those span objects will be the phrases used to construct the co-occurance graph.

Phraser classes included in this package can be found in the rake_spacy.phrasers module.

Aside #1

Why call the paramter phraser_class? This is because the "batteries-included" phraser callables (found in the rake_spacy.phrasers module) inherit from rake_spacy.phrasers.BasePhraser. The choice to use classes is to allow for "parameterizable callables" which still only take one argument when called. Specifically, when Rake calls the phraser_class callable the only parameter that is provided is the spacy doc. By using a class to define this callable additional parameters can be specified in the class's __init__ method, and used in the __call__ method.

Of course, there are other ways to achieve this same "parameterizability", one could define the phraser parameters in the global scope and then reference those variables in the function body.

GLOBAL_NAME = 'World'

def f(greeting):
    print(f'{greeting}, {GLOBAL_NAME}!')

Or if global variables aren't your thing you could use functools.partial

import functools

def full_g(greeting, name):
    print(f'{greeting}, {name}')

g = functools.partial(full_g, name='World')

As with most things in python there's typically more than one way to do something. "Callable classes" seemed like a straight forward choice, and are found throughout the code.

(Word) Scorers

Skipping over the construction of the co-occurance graph for a moment. The next thing to do is to score each of the words/tokens found in the collection of phrases using the information contained in the co-occurance graph. By default, the score rake-spacy attributes to each word/token is simply the number of times it occured in the text (these are the diagonal elements in the co-occurance graph).

To override the default behaviour we simply define a callable that takes the co-occurance graph and a token (vertex in the graph) and returns a numeric score, and then provide that callable as the value to the word_scorer_class argument:

def my_scorer_func(
    cograph: DefaultDict[str, DefaultDict[str, int]],
    token: spacy.tokens.Token
) -> float:
    ...  # compute the score for this token using the co-ocrrance graph
    return token_score

r = Rake(word_scorer_class=my_scorer_func)

Word-scorer classes included in this package can be found in the rake_spacy.scorers module.

Aside #2

It's worth mentioning here that the co-occurance graph is stored as (essentially) a defaultdict within a defaultdict. Specifically, "indexing" once into the co-occurance graph with a token, cograph[token], returns a dict-like object which maps co-occuring words/tokens to co-occurance counts. That is, cograph[token][cotoken] is an integer denoting the number of time token and cotoken appeared in the same phrase.

cograph[token] looks weird, you might be thinking. token is a spacy.tokens.Token object and it's being used as the key to this dict-looking-thing? Of course, spacy.tokens.Token objects have a __hash__ method, and so they can be used as dictionary keys. But that's not (necessarily) what is going on under the hood (although it could be - if you really wanted it to be that way). Using tokens as "keys" is facilitated (and, in fact, is customizable as well) by another set of classes we have yet to talk about, but this small detail is precisely why you can think of the co-occurance graph as being essentially like a defaultdict within a defaultdict. In a nut shell, yes, you should index into the co-occurance graph with spacy.tokens.Token objects, directly. More on this below.

Aggregators

OK, so we've split our text up into phrases, computed the co-occurance graph and scored the words/tokens, now we have to score the phrases themselves.

By default, rake-spacy will just sum up all the token scores in each phrase. But this default behaviour can be changed by providing a callable as the value to the aggregator_class keyword argument.

def my_aggregator_func(word_scores: List[float]) -> float:
    ...  # compute the phrase-score from the word-scores of its constiuent words
    return phrase_score

r = Rake(aggregator_class=my_aggregator_func)

Aggregator classes included in this package can be found in the rake_spacy.aggregators module.

This covers probably the most important parts of how one might want to customize how the RAKE algorithm is administered to a piece of text, but we did skim over a few things. We'll cover those next.

Stop-Words/Tokens

Spacy is a fantastic library and with basically no code at all we can breakdown a piece of text into fine-grained parts seemingly "for free". However, when applying the RAKE algorithm we'd probably like to ignore some of the more common or less-meaningful words/tokens. Stop-words were alluded to above, how can we customize which words/tokens are considered "stop-words" and should effectively be ignored?

By default, rake-spacy assumes that any token for which the following expression evaluates to True is a stop-token:

(token.is_stop or token.is_space or token.is_punct) and not token.like_num

To customize this behaviour we can implement (you guessed it) a callable which accepts a token and returns a boolean. (True if this token should be ignored, False if it should be kept.)

def my_stop_word_detector_func(token: spacy.tokens.Token) -> bool:
    ...  # determine if this toke is a stop-token (True) or not (False)
    return is_stop_token

r = Rake(stop_token_class=my_stop_word_detector_func)

Internally, rake-spacy will use this callable on each token (actually, pair of tokens) before updating the co-occurance graph. If the callable indicates that one of the two tokens is a stop-word/token and should be ignored, that pair is skipped and rake-spacy moves onto the next pair.

One thing to note is that a user-specified stop-word callable will likely not have a role in how phrases are extracted (because "phrasing" is also user-specifiable). As such, if you provide Rake with a stop_token_class override, for consistency, you may want to use that same stop-word callable in the phraser callable. Perhaps something like:

class MyStopWordIndicator(rake_spacy.stop_words.BaseStopTokenIndicator):
    def __call__(self, token):
        ...  # your implementation

class MyPhraser(rake_spacy.phrasers.BasePhraser):
    def __init__(self, stop_word_indicator):
        self.stop_word_indicator = stop_word_indicator

    def __call__(self, doc):
        ...  # your implementation, using self.stop_word_indicator

stop_token_class = MyStopWordIndicator()
phraser_class = MyPhraser(stop_token_class)

r = Rake(stop_token_class=stop_token_class, phraser_class=phraser_class)

This is actually what rake-spacy does in the Rake.__init__() method.

Stop-word/token classes included in this package can be found in the rake_spacy.stop_words module.

The Co-Occurance Graph

So, why is the co-occurance graph essentially like a defaultdict within a defaultdict? Actually, internally the co-occurance graph uses a subclass of collections.defaultdict. This subclass is a thin wrapper around the stdlib defaultdict, but it slides in a (user-specifiable transformation) between the objects that are fed to __getitem__ and __setitem__ and what is given to the underlying defaultdict. Think of it like some light-weight middleware.

You provide a callable which accepts an object and converts that object to a string. How this conversion is done is up to you. The Co-Occurance graph simply applies this callable to whatever is given to get/setitem and feeds the output to the underlying defaultdict.

By default this callable is just the builtin str function. To customize this behaviour, provide a callable as the value to the token_mapper_class argument:

def my_token_mapper_func(token: spacy.tokens.Token) -> bool:
    ...  # Transform the token to a string
    return token_as_string

r = Rake(token_mapper_class=my_token_mapper_func)

Why might you want to do this? Perhaps you want the tokens "Hello" and "hello" to be mapped to the same vertex in the co-occurance graph. This is not the default behaviour, but can be achieved by, say, the function:

def f(token):
    return str(token).lower()

r = Rake(token_mapper_class=f)

Or perhaps you want is and are to be mapped to the same vertex (via the lemma be). This is not the default behaviour, but can be achieved by, say, the function:

def g(token):
    return token if isinstance(str, token) else token.lemma_.lower()

r = Rake(token_mapper_class=g)

Important: The callable passed as token_mapper_class must be idempotent. Specifically, it should satisfy the property:

token_mapper(t) == token_mapper(token_mapper(t))

To circle back, this how tokens (that are used to "index into" the co-occurance graph) and the graph itself play together nicely.

Token mapper classes included in this package can be found in the rake_spacy.mappers module.

About

Python implementation of the Rapid Automatic Keyword Extraction algorithm using spaCy

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages