CS6200-Final_Project

Search Engine Implementation

Members -
Nanditha Sundararajan
Poorva Sonparote
Shruti Parpattedar

Language used - Python and Java
Coded in version - Python 3.7.2

Setup

This code requires the following software packages installed for it to run successfully:

Python 3.7
Download and install from "https://www.python.org/downloads/"

Lucene 4.7.2
Download and install Lucene from
https://lucene.apache.org/
https://archive.apache.org/dist/lucene/java/4.7.2/

BeautifulSoup package
Can be downloaded from "https://www.crummy.com/software/BeautifulSoup/"
Can be installed using pip, by entering the following command in Terminal or Command Line :

	 pip install beautifulsoup4

Compile and Run

Unzip the given solution folder into a local directory. All necessary files required to run this project will be extracted.

Phase 1 -

Task 1 - Four baseline runs
Implementation of TFIDF, Query Likelihood Model (JM smoothed) and BM25 using python. The program internally call Indexer.py and Parser.py to parse and index the corpus.

Implementation of Lucene's default retrieval model using Java. The helper program Query_cleaning.py cleans the queries so that they can be used by Lucene.

Run the following commands -
Task1-First3Runs.py
Query_cleaning.py
Lucene-proj/src/LuceneRun.java

Task 2 - Query Enhancement
Implementation of two query enhancement techniques - query time stemming and pseudo relevance with BM25 retrieval model. The program internally call Indexer.py and Parser.py to parse and index the corpus.

Run the following commands -
Task2-QueryTimeStemming.py
Task2-PseudoRelevance.py

Task 3 - Stopping and Stemming Index
Implementation of stopped corpus with no stemming and stemmed corpus with stemmed queries with BM25 and IFIDF retrieval models. The program internally call Indexer.py and Parser.py to parse and index the corpus.

Run the following commands -
Task3-StoppedIndex.py
Task3-StemmedIndex.py

Phase 2 -

Implementation of snippet generation and query highlighting. The program internally call Indexer.py and Parser.py to parse and index the corpus, and snippetGeneration.py for snippet generation.

Run the following commands -
Phase2Run.py

Phase 3 -

Ninth run - Query Expansion using Pseudo Relevance Feedback with Stopping
Implementing query enhancement using pseudo relevance feedback and stopping. The program internally call Indexer.py and Parser.py to parse and index the corpus.

Run the following command -
Phase3Run.py

Evaluation
Evaluating the various runs based on MAP, MRR, Precision, Recall, Precision @5 and @20 and Recall @5 and @20.
Reads the list of runs to be evaluated from a file named Output_files_list.txt.

Run the following command -
Evaluation.py

Extra Credit -

Implementation of a search engine based on the Relevance Model using pseudo-relevance feedback and KL-Divergence for scoring. The program internally call Indexer.py and Parser.py to parse and index the corpus.

Run the following command -
KL_Divergence.py

All the outputs are stored in the Outputs Folder and all the evaluation results, along with the compiled evaluations and MAP-MRR summary are stored in Evaluation folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS6200-Final_Project

Search Engine Implementation

Setup

Compile and Run

Phase 1 -

Phase 2 -

Phase 3 -

Extra Credit -

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.idea		.idea
Evaluation		Evaluation
Lucene-proj/src		Lucene-proj/src
Outputs		Outputs
TokenizedFile		TokenizedFile
cacm		cacm
Evaluation.py		Evaluation.py
IR_Project_Report.pdf		IR_Project_Report.pdf
IR_Project_final.pdf		IR_Project_final.pdf
Indexer.py		Indexer.py
KL_Divergence.py		KL_Divergence.py
NSundararajan_PSonparote_SParpattedar.pdf		NSundararajan_PSonparote_SParpattedar.pdf
Output_files_list.txt		Output_files_list.txt
Parser.py		Parser.py
Phase2Run.py		Phase2Run.py
Phase3Run.py		Phase3Run.py
Query_cleaning.py		Query_cleaning.py
README.md		README.md
Task1-First3Runs.py		Task1-First3Runs.py
Task2-PseudoRelevance.py		Task2-PseudoRelevance.py
Task2-QueryTimeStemming.py		Task2-QueryTimeStemming.py
Task3-StemmedIndex.py		Task3-StemmedIndex.py
Task3-StoppedIndex.py		Task3-StoppedIndex.py
cacm.query.txt		cacm.query.txt
cacm.rel.txt		cacm.rel.txt
cacm_stem.query.txt		cacm_stem.query.txt
cacm_stem.txt		cacm_stem.txt
common_words		common_words
corpusGeneration.py		corpusGeneration.py
snippetGeneration.py		snippetGeneration.py

spoorva/CS6200-Final_Project

Folders and files

Latest commit

History

Repository files navigation

CS6200-Final_Project

Search Engine Implementation

Setup

Compile and Run

Phase 1 -

Phase 2 -

Phase 3 -

Extra Credit -

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages