- improve document extension list
- add a few more video extension
- Implement relative links (thanks Sebastian Nagel)
- add filename and url metadata (thanks @marianna13)
- add filename and url metadata
- Add text and video document types
- Rename to cc2dataset
- Support audio document type
- Restart spark session for each part.
- Improve error handling and logging.
- Implement resume + speed up by reading file from s3 all at once.
- Add try catch on archive for broken wat.
- Implement multipart.
- Shuffle + use date as output path + write wat index files + shuffle input wat
- deduplication
- it works