In our previous article we described how to detect intruders (textual anomalies) in topics sent by LDA represented as a multinomial distribution of terms. After successfully finding the redundant term we create a simple filter that collects all the user comments containing this word. We then concatenate all the comments into one single string and apply TextRank to get the keywords or a brief summary of the text. TextRank is a graph-based ranking model applied for graphs obtained from natural language texts [1]. One of the advantages of this algorithm is that it is fully unsupervised.
First, the initial text is tokenized and annotated using the Part Of Speech (POS) tagger. Next, a POS lter is applied to leave only units (for example noun expressions, verb+adverb etc). Then the graph is constructed. If the units (terms in our case) co-occur within the manually dened window N we add the edge between them. The constructed graph is undirected and unweighted. The scores associated with each vertex are set to 1. Finally, a ranking algorithm is run on the graph for 20-30 iterations until it converges at a threshold of 0.0001. The ranking algorithm applied in TextRank uses the following formula to calculate the score of a vertex:
where WS(Vi) is the weghted score of the given vertex, In(Vi) is the set of vertices that point to it (predecessors) and Out(Vj) is the set of vertices that vertex Vi points to (successors) and d is a damping factor that defines the probability of jumping from a given vertex to another random vertex [1]. The units are then sorted, and top T are selected for post-processing. Sequences of adjacent keywords (that are marked in the initial text) are collapsed into multi-word keywords. As a result, we have both terms and multi-word key expressions that dene the main idea of the text.
This allows to find out the reason of a textual anomaly in the topics distribution. For example, for the anomaly mentioned above the possible explanation of such distribution may be the problem of key delivery causing a signicant delay. Thus, the suggestion generated by the framework would be "waiting for key delivery" or "too much time waiting for keys".
What is this done for? The logic is quite simple: LDA topics contain rich statistical information and we suppose that whenever there is an intruder, we may use it a source of a suggestion. However, there is still an issue in generating human-like summaries but there is a workaround in combining a given approach with external information source, like Knowledge Graphs. This may useful in analyzing top terms coherence and improve a graph construction.
[1] R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP-04and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004.
Comments