Skip to content

clips/gsoc2019_vinlap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

viNLaP: Visualizing polarized data sources

This project was conceived due to the need of visualize polarized data easily and effectively. The first version of vinLAp aims to show an overview of the MAGA corpus, crawled in GSoC 2018 by Maja Goudoz- For doing so, it focuses on three aspects. These aspects are: variability of the vocabulary by each side using Scattertexts, speech production by geographic regions using GeoSpeech and, speech generation analysis from a specific word using Speech Igniters.

Getting Started

Clone this repository to your local machine.

Prerequisites

This project runs with Python 3.7. We use Jupyter notebooks for the development of each aspect, to create everything dynamically.

Install the following dependencies:

pip install pandas 
pip install seaborn 
pip install unicodedata
pip install nltk
pip install numpy
pip install networkx
pip install matplotlib
pip install scattertext
pip install tweepy
pip install datetime
pip install quantiphy

Structure of the repo

In data, you will find:

This dataset is a extension of the previous colleted MAGA corpus. It has the complete json object of each tweet. We hope this data will help other research in the future.

In notebooks, you will find 4 jupyter notebooks:

GeoSpeech contains two scripts. The first one will help you to crawl the complete json of a set of tweets' ids. This was used to get tweets_maga . The second one will get the locations of the tweets even though they are not geolocated.

This notebook will give you an overview of the data we are working with. We consider useful for any tabular dataset you have. You will fins data distributions, unique themes, frequencies, etc.

ScattertextGenerator contains the main script of the project. It has a plug-n-play function to generate scattertexts as the ones in results .

GeoSpeech contains two scripts. The first one will preprocess the text of each tweet in order to get the tokens for the word chains. The seconf script will use network graphs theory to create a function to obtain all the words related to a specific word of the dataset.

In results, you will find:

The scattertexts from our case of study's dataset.

Authors

  • Fabricio Layedra - GSoC 2019 Student with CLiPS - On GitHub

Acknowledgments

  • CLiPS - Supervision - Website
  • ESPOL University Big Data Research Club - Supervision - Website

About

GSoC 2019 Project developed by @FabricioLayedra under the supervision of @GuyDePaw. Contact [email protected]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published