Releases: lancedb/lance
v0.4.5 Preview private API for merging columns
Welcome @Mause as our newest contributor! Also, a big thank you for your work on the duckdb extension framework.
In this release we added a preview of the feature to do distributed column additions. This makes it possible to distribute Lance Fragments across nodes, add a new column to each Fragment, and then write a new Lance dataset version manifest with the updated schema and files.
What's Changed
- add support for aws profile by @Renkai in #807
- Upgrade Arrow to 37 by @changhiskhan in #810
- Schema intersection by @eddyxu in #814
- Add a check to make sure field names don't contain periods by @changhiskhan in #816
- fix(docs): correct link to docs.rs by @Mause in #819
- update arrow version in duckdb extension by @changhiskhan in #817
- Do not use lifetime on FileWriter by @eddyxu in #820
- Setting field ID after merging the fields. by @eddyxu in #821
- [Rust] Project schema by schema by @eddyxu in #822
- Merge batches from multiple datafiles in the same Fragment by @eddyxu in #815
- Update README.md by @jaichopra in #809
- [Python] Provide a private / distributed add column api in Python by @eddyxu in #823
New Contributors
Full Changelog: v0.4.4...v0.4.5
v0.4.4 Various bug fixes
#805 fixed an integer overflow bug in the plain decoder that resulted in high latency for Take (and consequently high latency for the vector search). We'll be adding continuous performance benchmarks soon to prevent issues like this from being released in the future.
We also fixed a gap in cosine similarity where the vectors does not line up perfectly with SIMD strides on the platform.
DiskANN progress is continuing. First milestone will be an in-memory version to support smaller datasets. A compressed, disk-based version will follow soon after that.
What's Changed
- Fix L2 simd benchmark by @eddyxu in #793
- bugfix for dataset overwrite method by @gsilvestrin in #794
- [Rust] Minor SIMD benchmark fix set minimal CPU target for AVX2 by @eddyxu in #795
- Persist simple diskann index by @eddyxu in #787
- Fix offset overflow in plain decoder by @eddyxu in #805
- Fix cosine similarity when missing simd alignment by @changhiskhan in #808
Full Changelog: v0.4.3...v0.4.4
v0.4.3 Bug fixes and code cleanup
What's Changed
- [Rust] L2 distance on not aligned data by @eddyxu in #779
- [Rust] Move L2 to linalg module by @eddyxu in #781
- [Rust] Build DiskANN index by @eddyxu in #763
- Refactor cosine distance into linalg module by @eddyxu in #786
- google cloud storage fixes by @gsilvestrin in #782
- Fix unaligned normalization bug on arm64 by @eddyxu in #789
- Speed up vector index tests by reducing dataset size by @changhiskhan in #790
Full Changelog: v0.4.2...v0.4.3
v0.4.2 Polars, GCS, and distributed lances
A warm welcome to @hzhang86 as Lance's newest contributor. Thanks for adding TPCH benchmarks for Lance to establish a baseline. This is really helpful for us to focus performance optimization roadmap.
This release is packed with valuable features:
- Direct polars scan without needing to pull everything into memory is added.
- We expose FileFragment's to allow distributed processing engines like Spark to access parts of a Lance dataset easily
- Last but not least, we've added support for reading Lance data directly from GS buckets
What's Changed
- [Rust] FileReader read range API by @eddyxu in #752
- Support direct polars scan by @changhiskhan in #755
- [Rust] Persist graph using lance file format. by @eddyxu in #756
- Refactor PQ and OPQ training function to make it usable widely by @eddyxu in #758
- Matrix::centroids method by @eddyxu in #759
- [Python] Set minimal version of Polars for python tests by @eddyxu in #765
- [Rust] Refactor RecordBatchStream trait by @eddyxu in #766
- [Rust] Expose DataFragment as pubilc dataset api. by @eddyxu in #769
- Revert "[Python] Set minimal version of Polars for python tests (#765)" by @gsilvestrin in #770
- add python script to compare lance performance vs parquet TPCH by @hzhang86 in #749
- Expose index metadata by @changhiskhan in #768
- Google Cloud Storage support. by @gsilvestrin in #773
- [Python] Expose DataFragment via dataset by @eddyxu in #774
- Get S3 credentials from_env by @changhiskhan in #775
- Fix duckdb build by @eddyxu in #776
- [Rust] A arrow kernel to compute hash value of the array. by @eddyxu in #777
New Contributors
Full Changelog: v0.4.1...v0.4.2
v0.4.1 Support Append in Vector Search
The vector search in Lance now supports live updates. Previously, when you added new vectors to the dataset, you would be required to rebuild the index. Now, the index is "inherited" and the vector search results are the combination of ANN search on the indexed data and KNN on the new Appended data. So there's a small latency increase and the recall should be the same or better.
This provides a smooth performance curve until you have inserted enough new data that re-indexing is warranted.
What's Changed
- Adding secret to publish task by @gsilvestrin in #742
- [Rust] make distance function to take slice instead of Float32Array by @eddyxu in #748
- Vector search should support appending new rows by @changhiskhan in #593
- windows lapack support by @gsilvestrin in #743
- Fix LanceDataset.to_batches by @changhiskhan in #751
Full Changelog: v0.4.0...v0.4.1
v0.4.0 Windows support
A warm welcome to @gsajko ! Thanks for making our tutorial notebook easier to use and understand!
Note: OPQ is disabled in windows for the vector index. This will be addressed once LAPACK support is added.
What's Changed
- small fixes by @gsajko in #725
- Windows support by @gsilvestrin in #724
New Contributors
Full Changelog: v0.3.19...v0.4.0
v0.3.19 Bug fix for filter predicates on large-utf8 type
Also fix publishing to crates.io
What's Changed
- Make contract clear for KNN nodes by @eddyxu in #729
- Refactor Scan I/O plan by @eddyxu in #731
- [Rust] Use folked sqlparser to unblock rust crate release by @eddyxu in #732
- [Rust] Fix filter on large UTF8 columns by @eddyxu in #733
Full Changelog: v0.3.18...v0.3.19
v0.3.18 Bug fix release for binary offsets
Fix for incorrect offset for string/variable list columns as reported in #720 (comment)
Thanks @lucazanna for the feedback!
What's Changed
- Train OPQ and write rotation matrix to index file by @eddyxu in #713
- removing warnings by @gsilvestrin in #721
- [Bug] Fix IVF merge sort when refine factor is presented. by @eddyxu in #722
- Add input / output schema contract to Global Take by @eddyxu in #728
- Fix offsets for Binary/Lists/LargeLists by @gsilvestrin in #727
Full Changelog: v0.3.17...v0.3.18
v0.3.17 Support for nested dict columns
A warm welcome to @haoxins , a new contributor who has helped improve Lance documentation.
This release adds support for list-of-dict columns (thanks @lucazanna for reporting the bug in #715).
Also included in this release are various vector index improvements for scalability and more progress towards OPQ implementation.
What's Changed
- docs: fix the links by @haoxins in #701
- repair macos build for duckdb extension by @changhiskhan in #705
- filter evaluation with flat search by @changhiskhan in #704
- fix flaky test by @changhiskhan in #706
- [Bug] Fix transpose in MatrixView.data() by @eddyxu in #711
- Refactored variable length encoders by @gsilvestrin in #710
- add notebook for q&a bot by @changhiskhan in #707
- Allow iteratively train PQ by @eddyxu in #712
- Use relative eq and fix a compiling warning by @eddyxu in #714
- docs: fix the mod path by @haoxins in #718
- Composable vector search pipeline by @eddyxu in #716
- Fix CI failure by increasing epsilon for test_train_pq_iteratively by @eddyxu in #719
- Implement support for list of Dictionaries by @gsilvestrin in #664
New Contributors
Full Changelog: v0.3.16...v0.3.17
v0.3.16 Filte pushdown improvements
Welcome @wangfenjin to lance contributors. Thanks for submitting a bug fix for the Lance DuckDB extensions 🔥
This release contains 2 workarounds for arrow limitations:
-
Lance datasets now support
<field> LIKE '%'
and<field> IN (<values>)
filters to be passed in as string. Generic SQL syntax supported by datafusion is now accepted. This is a break from standard pyarrow Dataset behavior which only accepts arrow compute Expression, which is not present in rust and also does not support introspection in python for developers to build custom adapter. -
When concatenating arrow dictionary arrays, the dict values are duplicated. There is currently no concrete plans to change this behavior in Arrow. Instead, we fix that at write time in Lance.
What's Changed
- Changed encoders to handle multiple Arrays by @gsilvestrin in #681
- Train kmeans iteratively by @eddyxu in #688
- Changed writers to handle multiple Arrays by @gsilvestrin in #691
- Streaming PQ by @eddyxu in #689
- [Bug] PQ training generates empty centroids by @eddyxu in #693
- Allow append mode even if dataset doesn't already exist by @ananis25 in #690
- Support "LIKE" and "IN" in filters by @eddyxu in #696
- fix typo by @wangfenjin in #697
- Improve indexing performance by @eddyxu in #699
- Compute PQ distortion. by @eddyxu in #695
- Bugfix for BinaryEncoder positions by @gsilvestrin in #698
New Contributors
- @wangfenjin made their first contribution in #697
Full Changelog: v0.3.15...v0.3.16