Releases: dadoonet/fscrawler
v2.9 🌈
What's Changed
- Bump log4j.version from 2.15.0 to 2.16.0 by @dependabot in #1325
- Split Build, IT and Unit Tests by @dadoonet in #1327
- Bump snakeyaml from 1.29 to 1.30 by @dependabot in #1326
- Bump log4j-core from 2.16.0 to 2.17.0 by @dependabot in #1329
- Bump docker-maven-plugin from 0.38.0 to 0.38.1 by @dependabot in #1332
- Bump jackson.version from 2.13.0 to 2.13.1 by @dependabot in #1331
- Bump elasticsearch-rest-high-level-client from 7.16.1 to 7.16.2 by @dependabot in #1333
- Bump maven-jar-plugin from 3.2.0 to 3.2.1 by @dependabot in #1347
- Bump build-helper-maven-plugin from 3.2.0 to 3.3.0 by @dependabot in #1346
- Bump log4j-api from 2.17.0 to 2.17.1 by @dependabot in #1339
- Switch to the new sonatype service by @dadoonet in #1348
- Bump tika.version from 2.1.0 to 2.2.0 by @dependabot in #1330
- Improve documentation for settings by @cbb-colab in #1345
- Add more default displayed fields by @dadoonet in #1298
New Contributors
- @cbb-colab made their first contribution in #1345
Full Changelog: fscrawler-2.8...fscrawler-2.9
v2.8 🌈
What's Changed
- #1356: ci(Mergify): configuration update (thanks to @dadoonet)
- #1322: Update Log4J 2.15.0 and Elasticsearch 7.16.1 (thanks to @dadoonet)
- #1276: Revert "Remove
a52f2ab6-086b-4285-a7a1-78ecdc6404ba
vulnerability id" (thanks to @dadoonet) - #1275: Remove
a52f2ab6-086b-4285-a7a1-78ecdc6404ba
vulnerability id (thanks to @dadoonet) - #1228:
latest
docker tag should be only the latest stable version (thanks to @dadoonet)
🚀 New features
- #1368: Add support for Delete Document (thanks to @dadoonet)
- #1298: Add more default displayed fields (thanks to @dadoonet)
🚨 Bug Fixes
- #1358:
fs.ocr.enabled
is always false (thanks to @ywjung) - #1286: Fix starting fscrawler with Docker (thanks to @dadoonet)
- #1271: fix: not working optional libraries (e.g. jpeg2000) (thanks to @NickUfer)
💉 Updated features
- #1393: Bump guava from 31.0.1-jre to 31.1-jre (thanks to @dependabot)
- #1392: Bump docker-maven-plugin from 0.39.0 to 0.39.1 (thanks to @dependabot)
- #1376: Use our own Http Client and remove specific distributions (thanks to @dadoonet)
- #1390: Bump nexus-staging-maven-plugin from 1.6.10 to 1.6.12 (thanks to @dependabot)
- #1387: Bump maven-compiler-plugin from 3.9.0 to 3.10.0 (thanks to @dependabot)
- #1386: Bump nexus-staging-maven-plugin from 1.6.8 to 1.6.10 (thanks to @dependabot)
- #1385: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2 (thanks to @dependabot)
- #1384: Bump jakarta.activation-api from 2.0.1 to 2.1.0 (thanks to @dependabot)
- #1383: Bump jersey.version from 3.0.3 to 3.0.4 (thanks to @dependabot)
- #1381: Bump slf4j-api from 1.7.35 to 1.7.36 (thanks to @dependabot)
- #1382: Bump jcl-over-slf4j from 1.7.33 to 1.7.36 (thanks to @dependabot)
- #1377: Bump websocket-client from 9.4.44.v20210927 to 9.4.45.v20220203 (thanks to @dependabot)
- #1374: Bump docker-maven-plugin from 0.38.1 to 0.39.0 (thanks to @dependabot)
- #1373: Bump ossindex-maven-plugin from 3.1.0 to 3.2.0 (thanks to @dependabot)
- #1371: Update to Elasticsearch 7.17.0 (thanks to @dadoonet)
- #1369: Bump json-path from 2.6.0 to 2.7.0 (thanks to @dependabot)
- #1365: Bump slf4j-api from 1.7.33 to 1.7.35 (thanks to @dependabot)
- #1364: Bump versions-maven-plugin from 2.8.1 to 2.9.0 (thanks to @dependabot)
- #1355: Bump jcl-over-slf4j from 1.7.32 to 1.7.33 (thanks to @dependabot)
- #1354: Bump elasticsearch-rest-high-level-client from 7.16.2 to 7.16.3 (thanks to @dependabot)
- #1353: Bump slf4j-api from 1.7.32 to 1.7.33 (thanks to @dependabot)
- #1352: Bump woodstox-core from 6.2.7 to 6.2.8 (thanks to @dependabot)
- #1350: Bump maven-jar-plugin from 3.2.1 to 3.2.2 (thanks to @dependabot)
- #1351: Bump maven-compiler-plugin from 3.8.1 to 3.9.0 (thanks to @dependabot)
- #1349: Bump jcommander from 1.81 to 1.82 (thanks to @dependabot)
- #1330: Bump tika.version from 2.1.0 to 2.2.0 (thanks to @dependabot)
- #1348: Switch to the new sonatype service (thanks to @dadoonet)
- #1339: Bump log4j-api from 2.17.0 to 2.17.1 (thanks to @dependabot)
- #1346: Bump build-helper-maven-plugin from 3.2.0 to 3.3.0 (thanks to @dependabot)
- #1347: Bump maven-jar-plugin from 3.2.0 to 3.2.1 (thanks to @dependabot)
- #1333: Bump elasticsearch-rest-high-level-client from 7.16.1 to 7.16.2 (thanks to @dependabot)
- #1331: Bump jackson.version from 2.13.0 to 2.13.1 (thanks to @dependabot)
- #1332: Bump docker-maven-plugin from 0.38.0 to 0.38.1 (thanks to @dependabot)
- #1329: Bump log4j-core from 2.16.0 to 2.17.0 (thanks to @dependabot)
- #1326: Bump snakeyaml from 1.29 to 1.30 (thanks to @dependabot)
- #1325: Bump log4j.version from 2.15.0 to 2.16.0 (thanks to @dependabot)
- #1309: Bump woodstox-core from 6.2.6 to 6.2.7 (thanks to @dependabot)
- #1314: Bump bcprov-jdk15on from 1.69 to 1.70 (thanks to @dependabot)
- #1316: Bump httpcore.version from 4.4.14 to 4.4.15 (thanks to @dependabot)
- #1321: Bump httpasyncclient from 4.1.4 to 4.1.5 (thanks to @dependabot)
- #1301: Bump docker-maven-plugin from 0.37.0 to 0.38.0 (thanks to @dependabot)
- #1317: Bump jdom2 from 2.0.6 to 2.0.6.1 (thanks to @dependabot)
- #1320: Bump log4j-core from 2.14.1 to 2.15.0 (thanks to @dependabot)
- #1290: Bump joda-time from 2.10.12 to 2.10.13 (thanks to @dependabot)
- #1288: Bump junit4-maven-plugin from 2.7.8 to 2.7.9 (thanks to @dependabot)
- #1289: Bump randomizedtesting-runner from 2.7.8 to 2.7.9 (thanks to @dependabot)
- #1285: Bump jansi from 2.3.4 to 2.4.0 (thanks to @dependabot)
- #1277: Bump jsoup from 1.14.2 to 1.14.3 (thanks to @dependabot)
- #1278: Bump joda-time from 2.10.10 to 2.10.12 (thanks to @dependabot)
- #1280: Bump guava from 30.1.1-jre to 31.0.1-jre (thanks to @dependabot)
- #1279: Bump jcl-over-slf4j from 1.7.31 to 1.7.32 (thanks to @dependabot)
- #1198: Update to Tika 2.1 (thanks to @dadoonet)
- #1268: Bump jackson.version from 2.12.5 to 2.13.0 (thanks to @dependabot)
- #1265: Bump guava from 30.1.1-jre to 31.0.1-jre (thanks to @dependabot)
- #1262: Bump MockFtpServer from 2.8.0 to 3.0.0 (thanks to @dependabot)
- #1269: Bump jsoup from 1.14.2 to 1.14.3 (thanks to @dependabot)
- #1270: Bump websocket-client from 9.4.43.v20210629 to 9.4.44.v20210927 (thanks to @dependabot)
- #1261: Bump elasticsearch-rest-high-level-client from 7.14.1 to 7.15.0 (thanks to @dependabot)
- #1260: Bump jersey.version from 3.0.2 to 3.0.3 (thanks to @dependabot)
- #1248: Bump maven-javadoc-plugin from 3.3.0 to 3.3.1 (thanks to @dependabot)
- #1243: Bump sqlite-jdbc from 3.36.0.2 to 3.36.0.3 (thanks to @dependabot)
- #1242: Bump jackson.version from 2.12.4 to 2.12.5 (thanks to @dependabot)
- #1245: Bump elasticsearch-rest-high-level-client from 7.14.0 to 7.14.1 (thanks to @dependabot)
- #1241: Bump sqlite-jdbc from 3.36.0.1 to 3.36.0.2 (thanks to @dependabot)
- #1233: Bump docker-maven-plugin from 0.36.1 to 0.37.0 (thanks to @dependabot)
📝 Documentation updates
- #1345: Improve documentation for settings (thanks to @cbb-colab)
- #1310: Update ocr.rst, the path was wrong and not working (thanks to @sahin52)
- #1256: Add section Workaround for huge temporary files (thanks to @dfbm)
🚦 Tests
- #1327: Split Build, IT and Unit Tests (thanks to @dadoonet)
- #1323: Add more traces when converting dates (thanks to @dadoonet)
Thanks to
@NickUfer, @cbb-colab, @cwperry, @dadoonet, @dependabot, @dependabot[bot], @dfbm, @mergify[bot], @sahin52 and @ywjung
FSCrawler 2.7 🌈
The FSCrawler team is pleased to announce the FSCrawler 2.7 release!
FSCrawler
FS Crawler offers a simple way to index binary files into elasticsearch.
Usage
Download FSCrawler 2.7:
wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7/fscrawler-es7-2.7.zip
Start FS crawler with:
bin/fscrawler job_name
FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json
).
If the file does not exist, FS crawler will propose to create your first job.
$ bin/fscrawler job_name
18:28:58,174 WARN [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch
Create a directory named /tmp/es
or c:\tmp\es
, add some files you want to index in it and start again:
$ bin/fscrawler job_name
18:30:34,330 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
More details in the documentation.
New features
- #991: Add Workplace Search connector.
- #1203: Add FTP crawler. By helsonxiao.
- #1211: Add
file.content_type
field on folders. - #1210: Add
file.filename
field on folders. - #1179: Automatically create Custom Sources.
- #1037: Split console logs and actual logs and add a banner :).
- #1036: Support ssl verification configurable. By TommyLike.
- #1035: Log index errors in documents.log.
- #1031: Add an external Log4J2 configuration file.
- #907: Add
path_prefix
option. - #820: Generate FSCrawler docker images. By toto1310.
- #776: Report HEAP size at startup.
- #752: Add option to ignore symlinks. By budachst.
- #715: Allow custom index name in the REST API. By kikkauz.
- #698: Add Cross-Origin Resource Sharing (CORS) headers to RestServer. By isaac-ipl.
- #692: Allow running OCR but not on PDF files.
- #673: Add support for YAML configuration.
- #663: Add Patterns table to includes and excludes. By wrathagom.
Fixed Bugs
- #1224: Fix NPE in Console when running with Docker.
- #1217: Check if date is null when formatting it to RFC3339.
- #1204: Split build and deploy phases for Docker images.
- #1201: 2.7 - Docker image broken. By agrantdeakin.
- #1194: Elasticsearch node settings should not be null by default.
- #1193: Corrupt PDF can lead to a StackOverflow.
- #1137: Ignore errors when parsing a 0 byte file.
- #1085: fscrawler.bat added a CD to move to the appropriate directory. By CircuitGuy.
- #1084: InputStream must have > 0 bytes. By yuanzhian.
- #1066: Start fscrawler instead of internal services.
- #1041: Fixed an issue that caused an error when running in a windows environment. By muraken720.
- #1006: Running fscrawler with no argument now lists existing jobs. By janhoy.
- #1005: Fix ENTRYPOINT in Dockerfile to allow variable substitution. By Maijin.
- #994: Using cloud id gives "invalid IPv6 Address". By tdaroly.
- #973: Fix SSH crawling from Windows machine.
- #899: FSCrawler can't index .doc or .docx elements. By LaaKii.
- #895: java.lang.NoSuchMethodError: parsing some Word files. By mwaltersbmc.
- #860: Bug Syntax error in fscrawler file, to init fscrawler. By CarlosRCDev.
- #847: sun.jnu.encoding=UTF-8 added in .bat and .sh both. By shahariaazam.
- #834: FS Crawler freezes when crawling a 0 byte TXT file. By dansfelix.
- #819: Fix Percentage computation.
- #760: Allow passing test parameters to Maven CLI.
- #714: fix release-drafter. By jetersen.
- #701: Change log level and display logs only if filters on content.
- #691: OCR without pdf_ocr. By Newmski.
- #686: Wait for healthy index when creating the index.
- #681: SSH dirs should be seen as dirs and not files.
- #680: trying to index remote files with ssh - files seen as folder. By sblanc0054.
- #660: Fix authentication when sending announcement email.
Main changes
- #1218: Isolate WorkplaceSearchClient and ElasticsearchClient.
- #1213: Switch back to Java 11.
- #1049: Update Dockerfile to use JDK14. By mario-89.
- #1212: Let's use JsonPath.
- #1207: Generate only 2 docker images.
- #1206: Detect when fscrawler runs in foreground and adapt logs.
- #1205: Add logs to the console when running a Docker instance.
- #1172: Move CI from Travis to GitHub actions.
- #872: Add more information to the _simulate API.
- #700: Add dependency convergence checks.
- #695: Exclude the PDFParser from the DefaultParser.
- #694: Display full names when catching parsing errors.
- #693: Move
fs.pdf_ocr
setting tofs.ocr.pdf_strategy
. - #675: Warn in case of Tika error.
- #1219: Update to Elasticsearch 7.14.0 and 6.8.18.
- #1180: Bump tika.version from 1.26 to 1.27.
Removed
- #978: files lost. By bluebell1990.
Have fun!
-FSCrawler team
FSCrawler 2.6
What's Changed
- Update Jackson to 2.9.8 (#657) @dadoonet
- Update to Tika 1.20 (#655) @dadoonet
- Update to Elasticsearch 6.5.3 (#649) @dadoonet
- Add a warning when using both silent and debug/trace (#647) @dadoonet
- Add documentation on how to run as a Windows service (#648) @dadoonet
- Check Elasticsearch 6 minor version (#642) @dadoonet
- Force the default number of shards to be 1 (#644) @dadoonet
- Update Guava transitive dependency to 27.0.1-jre (#645) @dadoonet
- Revisit Elasticsearch.Node and Rest settings (#638) @dadoonet
- Update to elasticsearch 6.5.1 (#637) @dadoonet
- Ignore dirs when
.fscrawlerignore
file is detected (#633) @dadoonet - Update issue templates (#632) @dadoonet
- Support multiple OCR languages (#631) @dadoonet
- Update Tika to 1.19.1 (#624) @dadoonet
- Create specific elasticsearch clients (#616) @dadoonet
- Add Release Drafter to automatically generate the release notes (#611) @dadoonet
- Add a Noop Parser (#610) @dadoonet
- Dump stack when not able to close FSCrawler (#609) @dadoonet
- Make default root dir Windows compatible (#595) @dadoonet
- Update to Tika 1.19 (#603) @dadoonet
- Update ossindex-maven-plugin to 3.0.1 (#604) @dadoonet
- Update to Jackson 2.9.7 (#602) @dadoonet
- Update to Elasticsearch 6.4.1 (#594) @dadoonet
- Add LGTM code quality badges (#597) @xcorail
- Support XML reoccurring structures (#593) @dadoonet
- Add a filter by content option (#585) @dadoonet
- Exclude dirs depending on dir full name (relative to root) (#561) @dadoonet
- Ignore files bigger than X (#584) @dadoonet
- Add
hocr
option for Tesseract-based OCR (#583) @dadoonet - Allow path partial matching (#582) @dadoonet
- Add support for Last Accessed date and Created date (#580) @dadoonet
- Use _doc doc type instead of doc (#581) @dadoonet
- Fix wrong detection of removed settings (#579) @dadoonet
- Add support for cloud id (#577) @dadoonet
- Update maven-compiler-plugin to 3.8.0 (#576) @dadoonet
- Add ossindex Maven plugin (#572) @dadoonet
- Close bulk processors with awaitClose instead of close (#570) @dadoonet
- Update to elasticsearch 6.3.2 (#569) @dadoonet
- Add File Permissions to generated documents (#567) @dadoonet
- Skip sonar build for external PRs (#568) @dadoonet
- Add a developer guide (#565) @dadoonet
- Add support for bulk size in bytes with unit (#563) @dadoonet
- Update to Elasticsearch 6.3.1 (#557) @dadoonet
- Revert "Use _doc doc type instead of doc" (#558) @dadoonet
- Use _doc doc type instead of doc (#554) @dadoonet
- Fix Sonar Critical issues (#551) @dadoonet
- Fix SonarQube hook (#550) @dadoonet
- Move documentation to https://readthedocs.org (#543) @dadoonet
- Allow using
store_source
without indexing content (#544) @dadoonet - Update to Tika 1.18 (#542) @dadoonet
- Update to Elasticsearch 6.3.0 (#541) @dadoonet
- Add a version check in tests (#527) @dadoonet
- Raw fields should be considered as text/keyword (#526) @dadoonet
- Add tests on OSS image as well (#525) @dadoonet
- Update elasticsearch to 6.2.2 (#524) @dadoonet
- Check that pipeline actually exists when starting (#522) @dadoonet
- Allow setting Tesseract path to executable and data (#520) @dadoonet
- Reduce Time to run tests from the IDE (#518) @dadoonet
- Update to elasticsearch 6.2.1 (#517) @dadoonet
- Split IT into different classes (#514) @dadoonet
- Start elasticsearch with docker-maven-plugin when running from the CLI (#513) @dadoonet
- Autodetect if a local node is running before starting docker (#512) @dadoonet
- Start removal of
core
module (#508) @dadoonet - Create fscrawler-rest module (#506) @dadoonet
- Create fscrawler-crawler-fs and fscrawler-crawler-ssh modules (#505) @dadoonet
- Clean package names (#504) @dadoonet
- Create fscrawler-tika and fscrawler-beans modules (#503) @dadoonet
- Create the fscrawler-cli module (#502) @dadoonet
- Move to Docker based integration tests (#500) @dadoonet
- Modify announcement email (#501) @dadoonet
- readme: add note that fs settings also affect rest (#492) @shadiakiki1986
- Fix ignore folders documentation (#488) @dadoonet
- Add more tests about moving files (#487) @dadoonet
- Includes and Excludes should not be case sensitive (#486) @dadoonet
- Split project into modules (#435) @dadoonet
- add setPipeline call when using REST (#475) @shadiakiki1986
- Add more info in case of bulk failures (#457) @dadoonet
- Don't rely on disk space for tests (#456) @dadoonet
- Update to Lucene 7.0.1 (#452) @dadoonet
- Update to maven-versions-plugin 2.5 (#453) @dadoonet
- Update to Log4J 2.9.1 (#451) @dadoonet
- Update to SQLite 3.20.1 (#450) @dadoonet
- Update to Jackson 2.9.2 (#449) @dadoonet
- Update to elasticsearch 6.0.0-beta2 (#434) @dadoonet
- Update dependencies (Jackson, Log4J, Jansi, SQLite, JSch, JCommander, Randomized Testing) (#430) @dadoonet
- use StringBuilder in a loop (#361) @ctamisier
- Add continue_on_error option to continue on error while crawling (#330) @kneubi
- Fix links typo (#326) @soruly
- Patch Log4J 2.8 to display messages on Windows (#323) @dadoonet
- Missing documentation for some local FS settings (#287) @shadiakiki1986
- add link to repo with dockerfile usage of fscrawler (#278) @shadiakiki1986
- documentation for loop moved to under --loop instead of under --rest (#277) @shadiakiki1986
- Use path analyzer for directory fields (#272) @dadoonet
- Prevent customised mappings from being overwritten (#231) @edjeavons
- Elasticsearch Client must use search size if set (#240) @babadofar
- Add OCR integration documentation (#224) @Jdecaudin
- Default REST elasticsearch port should be 9200 and not 9300 (#142) @FredDut
Thanks to
@FredDut, @Jdecaudin, @Quix0r, @babadofar, @barts2108, @coder-sa, @ctamisier, @dadoonet, @edjeavons, @fgaujous, @gpcmol, @it20one, @kneubi, @shadiakiki1986, @soruly, @vakopian, @xcorail, Ajitpal Singh and Julien Decaudin