Semantic similarity-based Key word extraction

This post is inspired by the Burton DeWilde's article (Intro to Automatic Keyphrase Extraction). In his post he gives an excellent explanation on how to implement TextRank from scratch based on word co-occurence.

However, if we have a look on the TextRank paper

Any relation that can be defined between two lexical units is a potentially useful connection (edge) that can be added between two such vertices. We are using a co-occurrence relation,controlled by the distance between word occurrences : two vertices are connected if their corresponding lexical units co-occur within a window of maximum words, where can be set anywhere from 2 to10 words. Co-occurrence links express relations between syntactic elements, and similar to the semantic links found useful for the task of word sense disambiguation (Mihalceaetal.,2004), they represent cohesion indicators for a given text.

Thus, we can use the semantic similarity to construct an edge between words.

So how do we calculate the semantic similarity between words? Fortunately, there is an excellent python package called Sematch.

Sematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.

Here is the list of steps I fulfilled to implement semantically-based version of the TextRank algorithm.

1. Extract Candidates:

def extract_candidate_words(document_body):
 """
    Detect language of the analysed text
    : param document_body: parsed document body
    : return: a language code in iso format
    """
 # fix bad unicode
    document_body = document_body.encode('utf-8').decode('utf-8')
 # all unigram nouns and adjectives
    good_tags = set(['VERB'])
 # punctuation:  {'\\', '%',...}
    punct = set(string.punctuation)
 # ['au', 'aux', 'avec' ...]
    stop_words = nltk.corpus.stopwords.words('french')
 # filter out digits
    document_body = ''.join(
        letter for letter in document_body if not letter.isdigit())
 # tag each token
    tagged_words = Text(document_body, hint_language_code='fr').pos_tags
 # candidates satisfying pos filter
    candidates = []
 for word, tag in tagged_words:
 # convert to lower case
        word = word.lower()
 if tag in good_tags and word not in stop_words and not all(char in punct for char in word):
            candidates.append(word)

 print(candidates)
 # words that satisfy the given filters
 return candidates

2. Construct edges between semantically related words (here we use french, but as sematch is multilingual you can use your native language).

def pairwise(iterable):
 """
    Iterate over word pairs and add unweighted edges into graph
    s -> (s0, s1), (s1, s2), (s2, s3), ...
    Construct semantically related tuples
    : param iterable: parsed document body
    : return: a language code in iso format
    """
 # on the left of the zip is the list of terms having the relative
    left = []
 # on the right are semantically related terms
    right = []
 for item in iterable:
        relative = find_similar_fr(item, iterable)
 # only append non-empty relatives
 if relative != '':
            left.append(item)
            right.append(relative)

 return zip(left, right)

3. Now construct the graph and implement TextRank:

def score_keyphrase_textrank(text, n_keywords=0.5):
 """TextRank implementation based on semantic similarity instead of coocurrence 
    if the words are similar, then add the edge between them

    Arguments:
        text {[type]} -- initial text

    Keyword Arguments:
        n_keywords {float} -- number of keywords, may be number or ratio (default: {0.5})

    Returns:
        [type] -- sorted edges list (edges are sorted by their importance weight)
    """
 # words array for later merging
    words = []
 for sent in nltk.sent_tokenize(text):
 for word in nltk.word_tokenize(sent):
            words.append(word.lower())

 # get candidate words
    candidates = extract_candidate_words(text)
 # construct graph
    graph = networkx.Graph()
    graph.add_nodes_from(set(candidates))
 for w1, w2 in pairwise(candidates):
 if w2:
            graph.add_edge(*sorted([w1, w2]))
 # score nodes using default PageRank
    ranks = networkx.pagerank(graph)

 # check if the number is ratio or not
 if 0 < n_keywords < 1:
        n_keywords = int(round(len(candidates) * n_keywords))

 # terms with their ranks
 # key = word, value = score
    word_ranks = {}
 # sort by rank (w[1])
 for word_rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True)[:n_keywords]:
        word_ranks[word_rank[0]] = word_rank[1]

    keywords = set(word_ranks.keys())

 # merge keywords into key phrases
    keyphrases = {}
    j = 0
 for i, word in enumerate(words):
 if i < j:
 continue
 if word in keywords:
            kp_words = list(
                takewhile(lambda x: x in keywords, words[i:i + 10]))
            avg_pagerank = sum(word_ranks[w]
 for w in kp_words) / float(len(kp_words))
            keyphrases[' '.join(kp_words)] = avg_pagerank
 # ensure merged keyphrases are not overlapping
            j = i + len(kp_words)
 # sort the output by score
 return sorted(keyphrases.items(), key=lambda x: x[1], reverse=True)

4. Now you can use this TextRank algorithm implementation to extract keyphrases from your text:

def get_phrases_list(text, n_keywords=0.5):
 """After running text rank simply prepare output in a convenient manner

    Arguments:
        text {[type]} -- initial text to be analysed

    Keyword Arguments:
        n_keywords {float} -- number of keywords, may be number or ratio (default: {0.5})

    Returns:
        [type] -- array of top n key words
    """
 # the list of keywords without scores
    output = []
 # get the keyphrases
    keyphrases = score_keyphrase_textrank(text=text, n_keywords=n_keywords)
 # fill the output in
 for phrase in keyphrases:
        output.append(phrase[0])
 return outpu

Semantic similarity-based Key word extraction

Recent Posts

Comments