This post is inspired by the Burton DeWilde's article (Intro to Automatic Keyphrase Extraction). In his post he gives an excellent explanation on how to implement TextRank from scratch based on word co-occurence.
However, if we have a look on the TextRank paper
Any relation that can be defined between two lexical units is a potentially useful connection (edge) that can be added between two such vertices. We are using a co-occurrence relation,controlled by the distance between word occurrences : two vertices are connected if their corresponding lexical units co-occur within a window of maximum words, where can be set anywhere from 2 to10 words. Co-occurrence links express relations between syntactic elements, and similar to the semantic links found useful for the task of word sense disambiguation (Mihalceaetal.,2004), they represent cohesion indicators for a given text.
Thus, we can use the semantic similarity to construct an edge between words.
So how do we calculate the semantic similarity between words? Fortunately, there is an excellent python package called Sematch.
Sematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.
Here is the list of steps I fulfilled to implement semantically-based version of the TextRank algorithm.
1. Extract Candidates:
def extract_candidate_words(document_body):
"""
Detect language of the analysed text
: param document_body: parsed document body
: return: a language code in iso format
"""
# fix bad unicode
document_body = document_body.encode('utf-8').decode('utf-8')
# all unigram nouns and adjectives
good_tags = set(['VERB'])
# punctuation: {'\\', '%',...}
punct = set(string.punctuation)
# ['au', 'aux', 'avec' ...]
stop_words = nltk.corpus.stopwords.words('french')
# filter out digits
document_body = ''.join(
letter for letter in document_body if not letter.isdigit())
# tag each token
tagged_words = Text(document_body, hint_language_code='fr').pos_tags
# candidates satisfying pos filter
candidates = []
for word, tag in tagged_words:
# convert to lower case
word = word.lower()
if tag in good_tags and word not in stop_words and not all(char in punct for char in word):
candidates.append(word)
print(candidates)
# words that satisfy the given filters
return candidates
2. Construct edges between semantically related words (here we use french, but as sematch is multilingual you can use your native language).
def pairwise(iterable):
"""
Iterate over word pairs and add unweighted edges into graph
s -> (s0, s1), (s1, s2), (s2, s3), ...
Construct semantically related tuples
: param iterable: parsed document body
: return: a language code in iso format
"""
# on the left of the zip is the list of terms having the relative
left = []
# on the right are semantically related terms
right = []
for item in iterable:
relative = find_similar_fr(item, iterable)
# only append non-empty relatives
if relative != '':
left.append(item)
right.append(relative)
return zip(left, right)
3. Now construct the graph and implement TextRank:
def score_keyphrase_textrank(text, n_keywords=0.5):
"""TextRank implementation based on semantic similarity instead of coocurrence
if the words are similar, then add the edge between them
Arguments:
text {[type]} -- initial text
Keyword Arguments:
n_keywords {float} -- number of keywords, may be number or ratio (default: {0.5})
Returns:
[type] -- sorted edges list (edges are sorted by their importance weight)
"""
# words array for later merging
words = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
words.append(word.lower())
# get candidate words
candidates = extract_candidate_words(text)
# construct graph
graph = networkx.Graph()
graph.add_nodes_from(set(candidates))
for w1, w2 in pairwise(candidates):
if w2:
graph.add_edge(*sorted([w1, w2]))
# score nodes using default PageRank
ranks = networkx.pagerank(graph)
# check if the number is ratio or not
if 0 < n_keywords < 1:
n_keywords = int(round(len(candidates) * n_keywords))
# terms with their ranks
# key = word, value = score
word_ranks = {}
# sort by rank (w[1])
for word_rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True)[:n_keywords]:
word_ranks[word_rank[0]] = word_rank[1]
keywords = set(word_ranks.keys())
# merge keywords into key phrases
keyphrases = {}
j = 0
for i, word in enumerate(words):
if i < j:
continue
if word in keywords:
kp_words = list(
takewhile(lambda x: x in keywords, words[i:i + 10]))
avg_pagerank = sum(word_ranks[w]
for w in kp_words) / float(len(kp_words))
keyphrases[' '.join(kp_words)] = avg_pagerank
# ensure merged keyphrases are not overlapping
j = i + len(kp_words)
# sort the output by score
return sorted(keyphrases.items(), key=lambda x: x[1], reverse=True)
4. Now you can use this TextRank algorithm implementation to extract keyphrases from your text:
def get_phrases_list(text, n_keywords=0.5):
"""After running text rank simply prepare output in a convenient manner
Arguments:
text {[type]} -- initial text to be analysed
Keyword Arguments:
n_keywords {float} -- number of keywords, may be number or ratio (default: {0.5})
Returns:
[type] -- array of top n key words
"""
# the list of keywords without scores
output = []
# get the keyphrases
keyphrases = score_keyphrase_textrank(text=text, n_keywords=n_keywords)
# fill the output in
for phrase in keyphrases:
output.append(phrase[0])
return outpu
Comments