Skip to content

Latest commit

 

History

History
36 lines (23 loc) · 686 Bytes

HISTORY.md

File metadata and controls

36 lines (23 loc) · 686 Bytes

1.5.0

  • improve document extension list
  • add a few more video extension
  • Implement relative links (thanks Sebastian Nagel)
  • add filename and url metadata (thanks @marianna13)
  • add filename and url metadata

1.4.0

  • Add text and video document types

1.3.1

  • Rename to cc2dataset

1.3.0

  • Support audio document type
  • Restart spark session for each part.
  • Improve error handling and logging.
  • Implement resume + speed up by reading file from s3 all at once.

1.2.0

  • Add try catch on archive for broken wat.
  • Implement multipart.
  • Shuffle + use date as output path + write wat index files + shuffle input wat

1.1.0

  • deduplication

1.0.0

  • it works