Skip to content

Query Pattern Recognition

dbraga edited this page Dec 16, 2014 · 6 revisions

This module aims to identify important, frequently appearing query patterns using probabilistic topic modeling. This process of pattern recognition is usually performed on a particular dataset of interest e.g. query exceptions. The patterns recognized for such a dataset can potentially help diagnose the common causes of query exceptions.

We use the Mallet open source tool for this purpose.

Quick Start Guide:

  • Edit the application.default.properties file to specify the path of the exceptions file, the destination file to store the mallet usable dataset, location of the stopwords file and the output visualization directory
  file.exceptions = /tmp/exceptions
  file.exceptions.mallet-format = /tmp/exceptions-only
  file.stopWords = thoth-topic-modeling/src/main/resources/stopwords.txt
  directory.topicModeling.visualization = thoth-topic-modeling/viz/
  • Run the ThothExceptionsToMallet.java class to create a Mallet usable dataset of query exceptions

  • Make sure that the Topic Modeling parameters are appropriately set inside the application.default.properties file.

directory.topicModeling.numTopics = 10
directory.topicModeling.numIterations = 1000
directory.topicModeling.numKeywordsToOutput = 50

Run the TopicModel.java class to generate a group of csv files, each corresponding to a topic.

  • Open the thoth-topic-modeling/viz/index.html in a web container to visualize the topics in the form of tag clouds.
python -m SimpleHTTPServer thoth-topic-modeling/viz/