The dataset used to train the system is r/Fakeddit which contains both image and text data.
ATTENTION: In order to load the original tsv
files as dataframes use pd.read_csv('filepath', sep=
\t)
.
-
preprocessing.py
:-
findCorruptImages()
:- Because the images were downloaded through multiple sessions, every time the session would end abruptly, the image being downloaded would become corrupt. Hence the need to delete those images.
-
dropUnusedRows()
:- The dataset is huge (Roughly 1 million rows, which means roughly 1 million images). Not all of them were used,
so this function checks the
directory
of the images (train
,test
, etc.) and only keeps the rows of the csv files that contain the imageids
found in thedirectory
.
- The dataset is huge (Roughly 1 million rows, which means roughly 1 million images). Not all of them were used,
so this function checks the
-
removeDatasetBias()
:- The part of the dataset that was initially downloaded, had a larger number of
False
images (not fake) thanTrue
(fake) images. So, this function removes the bias and makes the number of0
s equal to the number of1
s in the2_way_label
column of the csv.
- The part of the dataset that was initially downloaded, had a larger number of
-
-
image_downloader.py
:- This script downloads the images of the dataset, it was taken from the github repo of the paper's authors. It was modified in order to search for already existing images and skip them, as well as to skip images when the server is not responding.
-
resnet.py
:- The implementation of the ResNet network, taken from this github repo and the 18 layer version was added.
-
dataset.py
:- Custom class to load the images and labels into tensors in order to train the model based on pytorch documentation.
-
get_random_subset_of_dataset.py
:- The downloaded images were still too many and the training took about 30 mins per epoch (on an NVIDIA GTX 1650 graphics card), so I had to reduce the number of images even more. This is where this script comes in.
- After running this, we need to run
preprocessing.py
again in order to remove dataset bias and make new csv files with only the necessary number of rows.
-
image_classification.py
:- Here happens the training of the ResNet model for the image classification.
- In the
transforms
theLambda
transforms are used because some images contained either < 3 channels or > 3 channels after their transformation to tensor, and our ResNet model takes 3-channel inputs. -
CrossEntropyLoss()
is used which is commonly used in binary classification problems,SGD()
optimization, andReduceLROnPlateau()
withpatience = 1
optimization for the learning rate. The latter means that if the validation loss is not decreased for two consecutive epochs, the learning rate will be multiplied with$10^{-1}$ . - tqdm is used to show a progress bar when training the network.