Members -
Nanditha Sundararajan
Poorva Sonparote
Shruti Parpattedar
Language used - Python and Java
Coded in version - Python 3.7.2
This code requires the following software packages installed for it to run successfully:
Download and install from "https://www.python.org/downloads/"
Download and install Lucene from
https://lucene.apache.org/
https://archive.apache.org/dist/lucene/java/4.7.2/
Can be downloaded from "https://www.crummy.com/software/BeautifulSoup/"
Can be installed using pip, by entering the following command in Terminal or Command Line :
pip install beautifulsoup4
Unzip the given solution folder into a local directory. All necessary files required to run this project will be extracted.
Task 1 - Four baseline runs
Implementation of TFIDF, Query Likelihood Model (JM smoothed) and BM25 using python. The program internally
call Indexer.py and Parser.py to parse and index the corpus.
Implementation of Lucene's default retrieval model using Java. The helper program Query_cleaning.py cleans the queries so that they can be used by Lucene.
Run the following commands -
Task1-First3Runs.py
Query_cleaning.py
Lucene-proj/src/LuceneRun.java
Task 2 - Query Enhancement
Implementation of two query enhancement techniques - query time stemming and pseudo relevance with BM25 retrieval
model. The program internally call Indexer.py and Parser.py to parse and index the corpus.
Run the following commands -
Task2-QueryTimeStemming.py
Task2-PseudoRelevance.py
Task 3 - Stopping and Stemming Index
Implementation of stopped corpus with no stemming and stemmed corpus with stemmed queries with BM25 and IFIDF
retrieval models. The program internally call Indexer.py and Parser.py to parse and index the corpus.
Run the following commands -
Task3-StoppedIndex.py
Task3-StemmedIndex.py
Implementation of snippet generation and query highlighting. The program internally call Indexer.py and Parser.py to parse and index the corpus, and snippetGeneration.py for snippet generation.
Run the following commands -
Phase2Run.py
Ninth run - Query Expansion using Pseudo Relevance Feedback with Stopping
Implementing query enhancement using pseudo relevance feedback and stopping. The program internally
call Indexer.py and Parser.py to parse and index the corpus.
Run the following command -
Phase3Run.py
Evaluation
Evaluating the various runs based on MAP, MRR, Precision, Recall, Precision @5 and @20 and Recall @5 and @20.
Reads the list of runs to be evaluated from a file named Output_files_list.txt.
Run the following command -
Evaluation.py
Implementation of a search engine based on the Relevance Model using pseudo-relevance feedback and KL-Divergence for scoring. The program internally call Indexer.py and Parser.py to parse and index the corpus.
Run the following command -
KL_Divergence.py
All the outputs are stored in the Outputs Folder and all the evaluation results, along with the compiled evaluations and MAP-MRR summary are stored in Evaluation folder.