Browsertrix Crawler 0.11.0
New Features
- Store favicon urls as
favIconUrl
in pages.jsonl - Support for filtering sitemap by date (from specified date)
- Link extraction optimizations
- Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.
What's Changed
- link extraction optimization: for scopeType page, set depth == extraH… by @ikreymer in #364
- improve exit features: individual instance exit + exit code for interrupt by @ikreymer in #366
- feat: precommit by @Chickensoupwithrice in #363
- Capture Favicon by @Chickensoupwithrice in #362
- logging: resolve confusion with 'crawl done' not being written to log… by @ikreymer in #375
- logging fixes: avoid duplicate logging for same error by @ikreymer in #377
- Surface lastmod option for sitemap parser by @ghukill in #367
- Add example of mounting custom behaviours by @Chickensoupwithrice in #369
- various fixes regarding state restart: by @ikreymer in #370
- status: fix typo setting status to log message by @ikreymer in #379
- Add option to output stats file live, i.e. after each page crawled by @benoit74 in #374
- behavior logging tweaks, add netIdle by @ikreymer in #381
- Update tldextract cache for pywb during build by @vnznznz in #383
- Enhance file stats test to detect file modification by @benoit74 in #382
- optimize link extraction: (fixes #376) by @ikreymer in #380
New Contributors
- @Chickensoupwithrice made their first contribution in #363
- @ghukill made their first contribution in #367
- @benoit74 made their first contribution in #374
- @vnznznz made their first contribution in #383
Full Changelog: v0.10.4...v0.11.0