Purpose of this project is to leverage reviews about major delivery companies that are operating in the UK, and perform NLP tasks to analyze different aspects of the reviews like the sentiment, most common words, probability distributions across word sequences, and more.
In this project we are going to explore the world of logistic companies and the issues that they might be facing. Specifically, we are going to focus on analyzing data regarding a few of the most well-known delivery companies in the UK, namely Deliveroo, UberEats, Just Eat and Stuart. To do that, we are going to utilize the internet and the reviews that someone can many different platforms - especially these platforms that are specializing at collecting reviews and opininions of customers for a plethora of companies and services.
The first iteration of this project it's using the reviews that can be found in the famous consumer review website TrustPilot. Even though the website is already providing some API functionalities, we are going to write our own web-scraping tool to retrieve the data in the format that we want. We will attempt to collect as many reviews as possible and then use them to identify interesting findings in the text. For example, we will try to identify what is the sentiment across all reviews for a specific company, what are the most common words and bigrams (i.e. pairs of words that tend to appear next to each other) in the reviews, and more. Finally, we will implement a Latent Dirichlet Allocation model to try and identify what are the topics that these reviews correspond to. Note that they LDA model is going to be implemented twice, one for the negative and one for the positive reviews.
graph LR
A[Build a tool to connect to web sources APIs] -->|Get reviews from web| B[Clean reviews]
B --> D[Knowledge Graphs]
B --> F[Unsupervised Clustering]
B --> C(Sentiment Analysis)
B --> |Identify topic of review| E[Topic Extraction]
E --> |Train Model| I[Assign Topic to new instances]
C --> |Train Model| J[Sentiment Classifier]
I --> K[Build UI]
J --> K[Build UI]
Version 1.0: (Most recent version of the Notebook can be found here: V1.0 Notebook)
- Impementation of the v1.0 of web scraper and data collection API
- Developed a standard LDA model for topic identification
- Created first version of visualizations to present the results
In order to collect the reviews directly from the TrustPilot website, we have created a web-scrapping tool that allowed us to automate this process across different companies & their corresponding reviews. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files. Moreover, we have packaged the tool into a python library. Hence, if you are thinking of working on a similar project where you need to retrieve data from TrustPilot, you can install the package that you can find here. As of January 2023, the package contains the main functionalities to collect many different information from the website, like the reviews, reviewer_id, date of the review, user rating, and more.
For the first iteration of the project, we have built the aforementioned package with the functionality to retrieve the following information - which will also be the features in our dataset:
- Company: Name of the Company that we are examining (e.g. Deliveroo, UberEats, JustEat, Stuart)
- Id: The unique identifier for the review
- Reviewer_Id: Unique id for a reviewer/user
- Title: Title of the review
- Review: The text corresponding to the review submitted from the reviewer
- Date: Day of review submission
- Rating: The rating about the company, as submitted from the reviewer
Column/Feature | Type | Description |
---|---|---|
Company | NVARCHAR | Name of the delivery company |
Id | NVARCHAR | Id of the review |
Reviewer_Id | NVARCHAR | Id of the reviewer |
Title | NVARCHAR | Title of the review |
Review | NVARCHAR | The review itself - free text field |
Date | DATE | Day that the review was submitted |
Rating | BIGINT | Rating (1-5) |
To get reviews from the TrustPilot website, we are leveraging a custom made web scraping tool. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files.
-
Set-up the appropriate configurations in config.json (example). The config needs to get populated with the following metadata:
- source_url: Main domain URL
- starting_page: Domain subpath to a specific reviews page
- steps: Defines number of pages to iterate over
- company: Company/Service of interest -
Execute the python retriever script
python data_retriever.py