- Fix bug with
extract_text_from_files
helper tool
- Sort files based on unidecoded ASCII text (i.e. without diacritics, accents, umlauts, etc.)
- More default sorting rules
- New
--hide-dirs
option - New
--anonymize-user-dir
option - Bump
PyPDF
to5.0.1
- More default sorting rules
- Upgrade
PyPDF
to4.3.1
- Handle exceptions arising when trying to extract pages for submission as
PyPDF
bugs - Fix displaying contents of PDF text by escaping text that looks like a
rich
markup tag - New default sorting rules
- Make
\b
word boundary in a configuredSortRule
also match against underscores (which it doesn't by default) - New default sorting rules
- Make
RuleMatch
a real class
- New default sorting rules
- New script
extract_pages_from_pdf
lets you easily rip pages out of a PDF - Add
--page-range
argument to bothextract_pages_from_pdf
andextract_text_from_files
pypdf
exceptions will trigger the offending page to be extracted and a suggestion made to the user that they submit the page to thepypdf
team- Bump
pypdf
to version 3.14.0 (fixes for many bugs on edge case PDFs) - Better handling of sort rules that fail to parse
- New crypto sort rules
- Rename
--print-when-parsed
command line option to--print-as-parsed
- Suppress
/JBIG2Decode
warning output when decoding PDFs - Refactor overwrite confirmation, use stderr
- Allow
Pillow
10.0.0 - Reduce required python version to 3.8
- Output progress notifications to STDERR when parsing text from very large PDFs
- Fix issue that caused explosive memory growth when parsing large PDFs
--print-when-parsed
command line option forextract_text_from_files
- Upgrade
pypdf
to 3.12.0 to resolve various PDF parsing failures - PDFs: Handle various exceptions when enumerating embedded images:
OSError: cannot write mode CMYK as PNG
ValueError: not enough image data
TypeError: unhashable type: 'ArrayObject'
TypeError: unhashable type: 'IndirectObject'
- Parse text from images in PDFs (some PDFs have no text only images)
- Improve
extract_text_from_files
functionality
- Actually make
extract_text_from_files
executable
- Add a script to extract files
- More default sort rules
- Handle 0 byte PDF error
- More default rules
- Skip comment rows starting with
#
in rules CSVs. - New default rules
- Crop very long PDF pages when previewing in manual select window
- Fix regexes for wallet addresses
- Gracefully handle failures in file timestamp copying call
- Fix bug with manual folder selection
- Refactor
move_to_processed_dir()
and call fromsort_selector.py
.
- Add
--force
and[gui,pdf]
to thepipx
installation instructions
- Make the brackets print in the optional package install instructions
- Only check for
pymupdf
once
- PDF previews in manual sorting windows
- Override premature release
- Fix the install messages for missing packages
- Combobox instead of radio buttons for manual fallback directory select
- Replace fewer special characters in filenames
- New default crypto sort rules
- Make --manual-fallback and --only-if-match mutually exclusive
- New default crypto sort rules
- Add
--manual-fallback
option - Handle truncated image file binary errors
- Display the string that matched the rule when copying
- Better handling of re-scanning already sorted files
- New crypto sorting rules
- Handle more unparseable PDF issues more gracefully
- New crypto sort rules
- Make
--rescan-sorted
respect--all
flag - Don't append empty string to images that no text was extracted from
- More crypto sort rules
- Handle unparseable PDFs with a warning instead of a crash.
--yes-overwrite
option should not default to true- New crypto sort rules (MCB and CUBI now sorted to their own directories instead of 'Banks')
- Ask for confirmation before overwriting files (
--yes-overwrite
option to skip the check) - Check if image's extracted text is already in the filename (important if rescanning)
- New crypto sort rules
--only-if-match
option- New crypto sort rules
- Filenames for retweets
- New crypto sort rules
- New sort rules
- Wallet address sorting rules
- Apply extracted text to non-Tweets, non-Reddit posts.
--delete-originals
option--rescan-sorted
option- Avoid moving files that start in a sorted location out to the processed dir
--manual-sort
option- Couple new sorting rules
- Read
CLOWN_SORT_FILENAME_REGEX
default from env.
- Fix PySimpleGui packages reqs
- Read
CLOWN_SORT_SCREENSHOTS_DIR
andCLOWN_SORT_DESTINATION_DIR
defaults. - Use tilde notation for home dir in default args.
- Only sort filenames if they match the regex OR have a sorting folder match
- Allow multiple rules CSV files
- Allow configuration of rules CSV files via
.clown_sort
file
- Improve Tweet matching
- Improve logging
- Rename to
clown_sort
- Logging adjustments
--show-rules
argument,--debug
argument- More sorting rules
- Logging adjustments
- Logging adjustments.
- Copy non image files so they can be sent to multiple folders.
- Preserve all timestamps
- More command line args
- Initial release.