Information Retrieval System Evaluation and Text Classification

Dependencies

The assignment was completed using Python 2.7 on Anaconda, since that's the version in the DICE machines.

Libraries and packages used (in alphabetical order):

eval_out/ : Folder in which the IR evaluator's output is stored (*.eval)
format_check_scripts/ : Folder containing given perl scripts to check output format
systems/ : Folder containing the qrels.txt file and *.results files (evaluator's input)
Eval.py : Python module that implements the IR system evaluator module

files/ : Folder containing input files for the module - class IDs and stopword file
Improved_Classifier/ : Folder containing all files and folders used for the improved classifier (see notes!)
svm_linux/ : Folder containing the SVM classifier executable files, the model and the prediction output
tc_out/ : Folder in which the text classification's module output is stored
tweets/ : Folder containing the tweet train and test files
BOW_Extractor.py : Python module that implements the BOW extraction from the tweets train file
Feature_Converter.py : Python module that converts the tweets train and test files to the appropriate format for the classifier
Classifier_Evaluator.py : Python module that evaluates the classifier's perfomance
autorun.sh : Shell script that invokes all necessary modules and executables to complete the task

From a shell in the IR_Evaluation directory run the Eval script ("python .\Eval.py")

From a shell in the Text_Classification directory run the autorun script (".\autorun.sh")

The Improved_Classifier/ directory follows the exact same structure as the Text_Classification/ folder, but all file and folders' names have "_improved" appended to them.
The improved text classification module retrieves all webpage title text from all links within tweets, so running it may take a while depending on system.
The only difference between the baseline and the improved module is the features added to the feats.dic