Team members: Kwon, Stuhler, and Wu
This is the GitHub repository for our final project of the Computational Content Analysis class. With the transcripts of All My Children, General Hospital, and Days of Our Lives, we analyze gender roles in American Soap Operas.
This is the README file of our annotated scripts.
Web scraper to get all the urls. Transcripts of each episode are stored in seperated webpages. We use this file to build a scraper to get all the urls of the webpages. And export the urls to a csv file.
Text scraper. This file is to scrape the transcripts of soap operas from webpages. It uses the url.csv file constructed in 'get_url.py'. And it stores the scraped texts in .txt files by year and by show.
Parse the plain texts. This file takes the raw .txt files of data scraped from the internet, and get them into the format ready for analysis.
This file runs the network analysis of characters based on their speaking turns in the soap operas.
This file runs the LDA topic model and its visualization.
These files contain within-character complexity. They install and import the pre-trained Word2vec models, project each statement of each character into the vector space, aggregate each word vector into centroid, calculate the cosine similarity of each centroid within characters, and build a cosine similarity matrix.
This file gets character complexity for the AMC show.
This file gets character complexity for the GH show.
This file gets character complexity for the DoOL show.
These files contain the between-character similarity (analysis part 4).
This file runs the 1st method to get character similarity.
This file runs the 2nd method to get character similarity. Also this file calculates the correlations between different between-character similarity measures.