Skip to content

duongngyn0510/crawl-vietnamese-newpapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Text-Summarization-for-Vietnamese-Newpapers

1. Data Collection

Using Scrapy for crawling data from dantri, vietnamnet...

Using git bash or linux terminal for running bash pipe_crawl_vnn.bash and bash pipe_crawl_dantri.bash (in a folder src/crawl_paper). After running these commands we have a new folder src/crawl_paper/raw_data containing the raw dataset

Articles in dantri include 18 categories

Articles in vietnamnet include 14 categoriesS

Each article is saved as json file and includes 4 features (url, title, abstract and html_content)

Note that all collected data has not been preprocessed

2. Preprocessing

Releases

No releases published

Packages

No packages published