Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement StringColumn using StringViewArray #16610

Merged
merged 89 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
0c1473b
feat: implement StringColumn using StringViewArray
andylokandy Oct 15, 2024
2e2e5f6
fix
andylokandy Oct 15, 2024
af524c0
convert binaryview between arrow1 and arrow2
andylokandy Oct 22, 2024
01ffce9
Merge branch 'main' of https://github.com/datafuselabs/databend into …
andylokandy Oct 22, 2024
8028a37
fix
andylokandy Oct 22, 2024
feab44e
fix
andylokandy Oct 22, 2024
803ace4
fix
andylokandy Oct 22, 2024
60eb67c
fix
andylokandy Oct 23, 2024
e99be9e
Merge branch 'main' into dev1
andylokandy Oct 23, 2024
a0d159a
fix
andylokandy Oct 25, 2024
e15e1e5
Merge branch 'main' of https://github.com/datafuselabs/databend into …
andylokandy Oct 25, 2024
56cf9d8
Merge branch 'main' of https://github.com/datafuselabs/databend into …
andylokandy Oct 28, 2024
ac64bdc
fix some issue
andylokandy Oct 28, 2024
e6c5933
fix view slice bug
sundy-li Oct 29, 2024
0e85757
fix view slice bug
sundy-li Oct 29, 2024
81fba8a
Merge branch 'main' of https://github.com/datafuselabs/databend into …
andylokandy Oct 29, 2024
ba35eb8
fix
andylokandy Oct 29, 2024
9598aa0
support native read write
sundy-li Oct 29, 2024
8ccd6d5
fix
andylokandy Oct 29, 2024
6d63f7e
Merge branch 'dev1' of https://github.com/andylokandy/databend into dev1
andylokandy Oct 29, 2024
f533de5
fix
andylokandy Oct 29, 2024
eab81d4
fix tests
sundy-li Oct 30, 2024
88db184
add with_data_type
sundy-li Oct 30, 2024
8416f80
add with_data_type
sundy-li Oct 30, 2024
89c03d7
fix gen_random_uuid commit row
sundy-li Oct 30, 2024
f478c79
move record batch to block
sundy-li Oct 30, 2024
bb605b9
Merge branch 'main' into dev1
sundy-li Oct 30, 2024
d712fd4
remove unused dep
andylokandy Oct 30, 2024
b813d71
fix lint
andylokandy Oct 30, 2024
1d8b4da
fix commit row
sundy-li Oct 30, 2024
60ab196
fix commit row
sundy-li Oct 30, 2024
af79030
fix size
sundy-li Oct 30, 2024
9eda2e3
fix size
sundy-li Oct 30, 2024
b9c1773
Merge branch 'main' into dev1
sundy-li Oct 30, 2024
e116066
add NewBinaryColumnBuilder and NewStringColumnBulder
andylokandy Oct 30, 2024
714db05
fix incorrect serialize_size
sundy-li Nov 1, 2024
7a781da
fix incorrect serialize_size
sundy-li Nov 1, 2024
9276cea
lint
sundy-li Nov 1, 2024
39ec7d0
lint
sundy-li Nov 1, 2024
37f57bc
fix tests
sundy-li Nov 1, 2024
c1cdb6d
use binary state
sundy-li Nov 1, 2024
0c5ea41
Merge branch 'main' into dev1
sundy-li Nov 1, 2024
5e94781
use binary state
sundy-li Nov 1, 2024
6d8ecd3
update tests
sundy-li Nov 1, 2024
3a6396f
update tests
sundy-li Nov 1, 2024
aa194d6
update tests
sundy-li Nov 1, 2024
6ba6c7c
fix native view encoding
sundy-li Nov 2, 2024
43977ea
fix
andylokandy Nov 2, 2024
887df5e
[ci skip] updata kernel concat for view types
sundy-li Nov 2, 2024
b14f232
[ci skip]Merge branch 'main' into dev1
sundy-li Nov 2, 2024
0eabef5
[ci skip]Merge branch 'main' into dev1
sundy-li Nov 2, 2024
3456b4b
[ci skip] improve kernels for view types
sundy-li Nov 3, 2024
b9e22d8
[ci skip] only string type use string view type
sundy-li Nov 4, 2024
d8e5345
[ci skip] only string type use string view type
sundy-li Nov 4, 2024
89788d9
fix tests
sundy-li Nov 4, 2024
c2e1103
[ci skip] fix tests
sundy-li Nov 4, 2024
dcdf8b4
[ci skip] fix
sundy-li Nov 4, 2024
129e950
fix
andylokandy Nov 4, 2024
1f6c9ae
use NewStringColumnBuilder
andylokandy Nov 4, 2024
d381611
rename NewString -> String
sundy-li Nov 5, 2024
8757696
Merge branch 'main' into dev1
sundy-li Nov 5, 2024
4cb0277
fmt
sundy-li Nov 5, 2024
5f0dfe1
[ci skip] update tests
sundy-li Nov 5, 2024
37e549c
optimize take
sundy-li Nov 5, 2024
0411ea3
Merge branch 'main' into dev1
sundy-li Nov 5, 2024
24da802
add bench
sundy-li Nov 5, 2024
ad2366f
Merge branch 'dev1' of github.com:andylokandy/databend into dev1
sundy-li Nov 5, 2024
bcf29d2
fix tests
sundy-li Nov 5, 2024
4ec78de
[ci skip]Merge branch 'main' into dev1
sundy-li Nov 5, 2024
66b207d
update
sundy-li Nov 6, 2024
eda8e0f
improve compare
andylokandy Nov 6, 2024
00087ad
implement compare using string view prefix
andylokandy Nov 6, 2024
7de18c9
fix
andylokandy Nov 6, 2024
3303105
fix
sundy-li Nov 6, 2024
c2e3cb1
Merge branch 'main' into dev1
sundy-li Nov 6, 2024
e59e280
fix
sundy-li Nov 6, 2024
aad572a
fix-length
sundy-li Nov 6, 2024
6d040f8
disable spill
sundy-li Nov 6, 2024
c30b3c1
[ci skip] add put_and_commit
sundy-li Nov 6, 2024
5184580
[ci skip] update
sundy-li Nov 6, 2024
95e0b09
update test
sundy-li Nov 7, 2024
e8dc899
lint
sundy-li Nov 7, 2024
013a1c6
[ci skip] add maybe gc
sundy-li Nov 7, 2024
5ca5e83
fix endiness
andylokandy Nov 7, 2024
8b11de2
fix endiness
andylokandy Nov 7, 2024
d584953
fix
andylokandy Nov 7, 2024
3a73b39
update string compare
sundy-li Nov 7, 2024
7ae547c
Merge branch 'main' into dev1
sundy-li Nov 7, 2024
a9def92
update
sundy-li Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions src/common/arrow/src/arrow/array/binview/from.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,46 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use arrow_data::ArrayData;
use arrow_data::ArrayDataBuilder;
use arrow_schema::DataType;

use crate::arrow::array::Arrow2Arrow;
use crate::arrow::array::BinaryViewArrayGeneric;
use crate::arrow::array::MutableBinaryViewArray;
use crate::arrow::array::ViewType;
use crate::arrow::bitmap::Bitmap;

impl<T: ViewType + ?Sized, P: AsRef<T>> FromIterator<Option<P>> for BinaryViewArrayGeneric<T> {
#[inline]
fn from_iter<I: IntoIterator<Item = Option<P>>>(iter: I) -> Self {
MutableBinaryViewArray::<T>::from_iter(iter).into()
}
}

impl<T: ViewType + ?Sized> Arrow2Arrow for BinaryViewArrayGeneric<T> {
fn to_data(&self) -> ArrayData {
let builder = ArrayDataBuilder::new(DataType::BinaryView)
.len(self.len())
.add_buffer(self.views.clone().into())
.add_buffers(self.buffers.iter().map(|x| x.clone().into()).collect())
.nulls(self.validity.clone().map(Into::into));
unsafe { builder.build_unchecked() }
}

fn from_data(data: &ArrayData) -> Self {
let views = crate::arrow::buffer::Buffer::from(data.buffers()[0].clone());
let buffers = data.buffers()[1..]
.iter()
.map(|x| crate::arrow::buffer::Buffer::from(x.clone()))
.collect();
let validity = data.nulls().map(|x| Bitmap::from_null_buffer(x.clone()));
Self::try_new(
crate::arrow::datatypes::DataType::BinaryView,
views,
buffers,
validity,
)
.unwrap()
}
}
4 changes: 2 additions & 2 deletions src/common/arrow/src/arrow/array/binview/mutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,9 @@ pub struct MutableBinaryViewArray<T: ViewType + ?Sized> {
pub(super) validity: Option<MutableBitmap>,
pub(super) phantom: std::marker::PhantomData<T>,
/// Total bytes length if we would concatenate them all.
pub(super) total_bytes_len: usize,
pub total_bytes_len: usize,
/// Total bytes in the buffer (excluding remaining capacity)
pub(super) total_buffer_len: usize,
pub total_buffer_len: usize,
}

impl<T: ViewType + ?Sized> Clone for MutableBinaryViewArray<T> {
Expand Down
5 changes: 3 additions & 2 deletions src/common/arrow/src/arrow/array/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -496,7 +496,8 @@ pub fn to_data(array: &dyn Array) -> arrow_data::ArrayData {
})
}
Map => to_data_dyn!(array, MapArray),
BinaryView | Utf8View => unimplemented!(),
BinaryView => to_data_dyn!(array, BinaryViewArray),
Utf8View => unimplemented!(),
}
}

Expand Down Expand Up @@ -527,7 +528,7 @@ pub fn from_data(data: &arrow_data::ArrayData) -> Box<dyn Array> {
})
}
Map => Box::new(MapArray::from_data(data)),
BinaryView | Utf8View => unimplemented!(),
BinaryView | Utf8View => Box::new(BinaryViewArray::from_data(data)),
}
}

Expand Down
3 changes: 2 additions & 1 deletion src/common/arrow/src/arrow/datatypes/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,8 @@ impl From<DataType> for arrow_schema::DataType {
DataType::Decimal(precision, scale) => Self::Decimal128(precision as _, scale as _),
DataType::Decimal256(precision, scale) => Self::Decimal256(precision as _, scale as _),
DataType::Extension(_, d, _) => (*d).into(),
DataType::BinaryView | DataType::Utf8View => {
DataType::BinaryView => Self::BinaryView,
DataType::Utf8View => {
panic!("view datatypes are not supported by arrow-rs")
}
}
Expand Down
290 changes: 123 additions & 167 deletions src/query/expression/src/converts/arrow2/from.rs

Large diffs are not rendered by default.

56 changes: 11 additions & 45 deletions src/query/expression/src/converts/arrow2/to.rs
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ fn table_type_to_arrow_type(ty: &TableDataType) -> ArrowDataType {
None,
),
TableDataType::Boolean => ArrowDataType::Boolean,
TableDataType::Binary => ArrowDataType::LargeBinary,
TableDataType::String => ArrowDataType::LargeUtf8,
TableDataType::Binary => ArrowDataType::BinaryView,
TableDataType::String => ArrowDataType::Utf8View,
TableDataType::Number(ty) => with_number_type!(|TYPE| match ty {
NumberDataType::TYPE => ArrowDataType::TYPE,
}),
Expand Down Expand Up @@ -135,7 +135,7 @@ fn table_type_to_arrow_type(ty: &TableDataType) -> ArrowDataType {
}
TableDataType::Bitmap => ArrowDataType::Extension(
ARROW_EXT_TYPE_BITMAP.to_string(),
Box::new(ArrowDataType::LargeBinary),
Box::new(ArrowDataType::BinaryView),
None,
),
TableDataType::Tuple {
Expand All @@ -157,17 +157,17 @@ fn table_type_to_arrow_type(ty: &TableDataType) -> ArrowDataType {
}
TableDataType::Variant => ArrowDataType::Extension(
ARROW_EXT_TYPE_VARIANT.to_string(),
Box::new(ArrowDataType::LargeBinary),
Box::new(ArrowDataType::BinaryView),
None,
),
TableDataType::Geometry => ArrowDataType::Extension(
ARROW_EXT_TYPE_GEOMETRY.to_string(),
Box::new(ArrowDataType::LargeBinary),
Box::new(ArrowDataType::BinaryView),
None,
),
TableDataType::Geography => ArrowDataType::Extension(
ARROW_EXT_TYPE_GEOGRAPHY.to_string(),
Box::new(ArrowDataType::LargeBinary),
Box::new(ArrowDataType::BinaryView),
None,
),
}
Expand Down Expand Up @@ -304,32 +304,10 @@ impl Column {
)
.unwrap(),
),
Column::Binary(col) => {
let offsets: Buffer<i64> =
col.offsets().iter().map(|offset| *offset as i64).collect();
Box::new(
databend_common_arrow::arrow::array::BinaryArray::<i64>::try_new(
arrow_type,
unsafe { OffsetsBuffer::new_unchecked(offsets) },
col.data().clone(),
None,
)
.unwrap(),
)
}
Column::String(col) => {
let offsets: Buffer<i64> =
col.offsets().iter().map(|offset| *offset as i64).collect();
Box::new(
databend_common_arrow::arrow::array::Utf8Array::<i64>::try_new(
arrow_type,
unsafe { OffsetsBuffer::new_unchecked(offsets) },
col.data().clone(),
None,
)
.unwrap(),
)
}
Column::Binary(col) => Box::new(col.clone().into_inner()),
Column::String(col) => unsafe {
Box::new(col.clone().into_inner().to_utf8view_unchecked())
},
Column::Timestamp(col) => Box::new(
databend_common_arrow::arrow::array::PrimitiveArray::<i64>::try_new(
arrow_type,
Expand Down Expand Up @@ -401,19 +379,7 @@ impl Column {
Column::Bitmap(col)
| Column::Variant(col)
| Column::Geometry(col)
| Column::Geography(GeographyColumn(col)) => {
let offsets: Buffer<i64> =
col.offsets().iter().map(|offset| *offset as i64).collect();
Box::new(
databend_common_arrow::arrow::array::BinaryArray::<i64>::try_new(
arrow_type,
unsafe { OffsetsBuffer::new_unchecked(offsets) },
col.data().clone(),
None,
)
.unwrap(),
)
}
| Column::Geography(GeographyColumn(col)) => Box::new(col.clone().into_inner()),
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/query/expression/src/filter/filter_executor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ impl FilterExecutor {
if self.keep_order && self.has_or {
self.true_selection[0..result_count].sort();
}
data_block.take(&self.true_selection[0..result_count], &mut None)
data_block.take(&self.true_selection[0..result_count])
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -239,48 +239,19 @@ impl<'a> Selector<'a> {
Some(validity) => {
// search the whole string buffer
if let LikePattern::SurroundByPercent(searcher) = like_pattern {
let needle = searcher.needle();
let needle_byte_len = needle.len();
let data = column.data().as_slice();
let offsets = column.offsets().as_slice();
let mut idx = 0;
let mut pos = (*offsets.first().unwrap()) as usize;
let end = (*offsets.last().unwrap()) as usize;

while pos < end && idx < count {
if let Some(p) = searcher.search(&data[pos..end]) {
while offsets[idx + 1] as usize <= pos + p {
let ret = NOT && validity.get_bit_unchecked(idx);
update_index(
ret,
idx as u32,
true_selection,
false_selection,
);
idx += 1;
}

// check if the substring is in bound
let ret =
pos + p + needle_byte_len <= offsets[idx + 1] as usize;

let ret = if NOT {
validity.get_bit_unchecked(idx) && !ret
} else {
validity.get_bit_unchecked(idx) && ret
};
update_index(ret, idx as u32, true_selection, false_selection);

pos = offsets[idx + 1] as usize;
idx += 1;
for idx in 0u32..count as u32 {
let ret = if NOT {
validity.get_bit_unchecked(idx as usize)
&& searcher
.search(column.index_unchecked_bytes(idx as usize))
.is_none()
} else {
break;
}
}
while idx < count {
let ret = NOT && validity.get_bit_unchecked(idx);
update_index(ret, idx as u32, true_selection, false_selection);
idx += 1;
validity.get_bit_unchecked(idx as usize)
&& searcher
.search(column.index_unchecked_bytes(idx as usize))
.is_some()
};
update_index(ret, idx, true_selection, false_selection);
}
} else {
for idx in 0u32..count as u32 {
Expand All @@ -300,40 +271,17 @@ impl<'a> Selector<'a> {
None => {
// search the whole string buffer
if let LikePattern::SurroundByPercent(searcher) = like_pattern {
let needle = searcher.needle();
let needle_byte_len = needle.len();
let data = column.data().as_slice();
let offsets = column.offsets().as_slice();
let mut idx = 0;
let mut pos = (*offsets.first().unwrap()) as usize;
let end = (*offsets.last().unwrap()) as usize;

while pos < end && idx < count {
if let Some(p) = searcher.search(&data[pos..end]) {
while offsets[idx + 1] as usize <= pos + p {
update_index(
NOT,
idx as u32,
true_selection,
false_selection,
);
idx += 1;
}
// check if the substring is in bound
let ret =
pos + p + needle_byte_len <= offsets[idx + 1] as usize;
let ret = if NOT { !ret } else { ret };
update_index(ret, idx as u32, true_selection, false_selection);

pos = offsets[idx + 1] as usize;
idx += 1;
for idx in 0u32..count as u32 {
let ret = if NOT {
searcher
.search(column.index_unchecked_bytes(idx as usize))
.is_none()
} else {
break;
}
}
while idx < count {
update_index(NOT, idx as u32, true_selection, false_selection);
idx += 1;
searcher
.search(column.index_unchecked_bytes(idx as usize))
.is_some()
};
update_index(ret, idx, true_selection, false_selection);
}
} else {
for idx in 0u32..count as u32 {
Expand Down
36 changes: 5 additions & 31 deletions src/query/expression/src/kernels/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ use itertools::Itertools;

use crate::copy_continuous_bits;
use crate::kernels::take::BIT_MASK;
use crate::kernels::utils::copy_advance_aligned;
use crate::kernels::utils::set_vec_len_by_ptr;
use crate::store_advance_aligned;
use crate::types::array::ArrayColumnBuilder;
use crate::types::binary::BinaryColumn;
use crate::types::binary::BinaryColumnBuilder;
use crate::types::decimal::Decimal;
use crate::types::decimal::DecimalColumn;
use crate::types::geography::GeographyColumn;
Expand Down Expand Up @@ -275,37 +275,11 @@ impl Column {
cols: impl Iterator<Item = BinaryColumn> + Clone,
num_rows: usize,
) -> BinaryColumn {
// [`BinaryColumn`] consists of [`data`] and [`offset`], we build [`data`] and [`offset`] respectively,
// and then call `BinaryColumn::new(data.into(), offsets.into())` to create [`BinaryColumn`].
let mut offsets: Vec<u64> = Vec::with_capacity(num_rows + 1);
let mut data_size = 0;

// Build [`offset`] and calculate `data_size` required by [`data`].
offsets.push(0);
for col in cols.clone() {
let mut start = col.offsets()[0];
for end in col.offsets()[1..].iter() {
data_size += end - start;
start = *end;
offsets.push(data_size);
}
}

// Build [`data`].
let mut data: Vec<u8> = Vec::with_capacity(data_size as usize);
let mut data_ptr = data.as_mut_ptr();

unsafe {
for col in cols {
let offsets = col.offsets();
let col_data = &(col.data().as_slice())
[offsets[0] as usize..offsets[offsets.len() - 1] as usize];
copy_advance_aligned(col_data.as_ptr(), &mut data_ptr, col_data.len());
}
set_vec_len_by_ptr(&mut data, data_ptr);
let mut builder = BinaryColumnBuilder::with_capacity(num_rows, 0);
for col in cols {
builder.append_column(&col);
}

BinaryColumn::new(data.into(), offsets.into())
builder.build()
}

pub fn concat_string_types(
Expand Down
Loading
Loading