All notable changes to this project will be documented in this file.
- Updated
sentence-parse
to v1.3.1 (wont crash on null inputs)
- Updated sentence splitter to use
sentence-parse
- Updated sentence splitter to use
@stdlib/nlp-sentencize
- Updated embedding cache to use
lru-cache
- Added
sentenceit
function (split by sentence and return embeddings)
- Update
string-segmenter
patch version
- Update
string-segmenter
patch version
- Only print version if logging is enabled (default is false)
- was adding console noise to upstream applications
- Updated Web UI to v1.3.1
- Updated README with Web UI usage examples
- Updated default values in both the library and Web UI
- Web UI default can be set in
webui/public/default-form-values.js
- Web UI default can be set in
- Misc cleanup and optimizations
- Updated
transformers.js
from v2 to v3 - Migrated quantization option from
onnxEmbeddingModelQuantized
(boolean) todtype
('p32', 'p16', 'q8', 'q4') - Updated Web UI to use new
dtype
option
- Updated Web UI styles for smaller screens
- Fixed issue with Web UI embedding cache not being cleared when a new model is initialized
- Web UI adjustments for display of truncated JSON results on screen but still allowing download of full results
- Web UI css adjustments for smaller screens
- Added Highlight.js to Web UI for syntax highlighting of JSON results and code samples
- Added JSON results toggle button to turn line wrapping on/off
- New Web UI tool for experimenting with semantic chunking settings
- Interactive form interface for all chunking parameters
- Real-time text processing and results display
- Visual feedback for similarity thresholds
- Model selection and configuration
- Results download in JSON format
- Code generation for settings
- Example texts for testing
- Dark mode interface
- Added
excludeChunkPrefixInResults
option tochunkit
andcramit
functions- Allows removal of chunk prefix from final results while maintaining prefix for embedding calculations
- Improved error handling and feedback in chunking functions
- Enhanced documentation with Web UI usage examples
- Added more embedding models to supported list
- Fixed issue with chunk prefix handling in embedding calculations
- Improved token length calculation reliability
- Updated README
cramit
example script to use updated document object input format.
- Fixed
cramit
function to properly pack sentences up to maxTokenSize
- Improved chunk creation logic to better handle both chunkit and cramit modes
- Enhanced token size calculation efficiency
- Improved semantic chunking accuracy with stricter similarity thresholds
- Enhanced logging in similarity calculations for better debugging
- Fixed chunk creation to better respect semantic boundaries
- Default similarity threshold increased to 0.5
- Default dynamic threshold bounds adjusted (0.4 - 0.8)
- Improved chunk rebalancing logic with similarity checks
- Updated logging for similarity scores between sentences
- Updated example scripts in README.
β οΈ BREAKING: Input format now accepts array of document objects- Output array of chunks extended with the following new properties:
document_id
: Timestamp in milliseconds when processing starteddocument_name
: Original document name or ""number_of_chunks
: Total number of chunks for the documentchunk_number
: Current chunk number (1-based)model_name
: Name of the embedding model usedis_model_quantized
: Whether the model is quantized
- Added
returnEmbedding
option tochunkit
andcramit
functions to include embeddings in the output. - Added
returnTokenLength
option tochunkit
andcramit
functions to include token length in the output. - Added
chunkPrefix
option to prefix each chunk with a task instruction (e.g., "search_document: ", "search_query: "). - Updated README to document new options and add RAG tips for using
chunkPrefix
with embedding models that support task prefixes.
β οΈ BREAKING: Returned array of chunks is now an array of objects withtext
,embedding
, andtokenLength
properties. Previous versions returned an array of strings.
- Fixed sentence splitter logic in
cramit
function..
- Replaced sentence splitter with a new algorithm that is more accurate and faster.
- Breakup library into modules for easier maintenance and updates going forward.
- Added download script to pre-download models for users that want pre-package them with their application.
- Added model path/cache directory options.
- Updated package dependencies.
- Updated example scripts.
- Updated README.
- Added dynamic combining of final chunks based on similarity threshold.
- Improved initial chunking algorithm to reduce the number of chunks.
- Initial release with basic chunking functionality.