-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constructorsv2 #249
base: main
Are you sure you want to change the base?
Constructorsv2 #249
Conversation
ArrowArray arr{}; | ||
ArrowSchema schema{}; | ||
|
||
schema.format = sparrow::data_type_format_of<T>().data(); | ||
schema.release = release_arrow_schema; | ||
schema.n_children = 0; | ||
schema.children = nullptr; | ||
schema.dictionary = nullptr; | ||
schema.release = &release_arrow_schema; | ||
|
||
const auto range_size = std::ranges::size(values); | ||
|
||
arr.length = static_cast<std::int64_t>(range_size); | ||
arr.null_count = 0; | ||
arr.offset = 0; | ||
arr.n_buffers = 2; | ||
arr.n_children = 0; | ||
|
||
// buffers | ||
std::uint8_t** buf = new std::uint8_t*[2]; | ||
arr.buffers = const_cast<const void**>(reinterpret_cast<void**>(buf)); | ||
const auto bitmap_size = static_cast<std::size_t>((arr.length + 7) / 8); | ||
buf[0] = new std::uint8_t[bitmap_size]; | ||
auto bitmap_ptr = buf[0]; | ||
|
||
// data buffer | ||
buf[1] = new std::uint8_t[range_size * sizeof(T)]; | ||
auto data_ptr = reinterpret_cast<T*>(buf[1]); | ||
arr.release = &release_arrow_array; | ||
|
||
|
||
auto value_iter = std::ranges::begin(values); | ||
auto is_non_null_iter = std::ranges::begin(is_non_null); | ||
|
||
for(std::size_t i=0; i < range_size; ++i) | ||
{ | ||
const bool is_non_null_val = *is_non_null_iter; | ||
if(is_non_null_val) | ||
{ | ||
const auto value = *value_iter; | ||
data_ptr[i] = static_cast<T>(value); | ||
// set bit to 1 | ||
bitmap_ptr[i / 8] |= 1 << (i % 8); | ||
} | ||
else{ | ||
// set bit to 0 | ||
bitmap_ptr[i / 8] &= ~(1 << (i % 8)); | ||
} | ||
++value_iter; | ||
++is_non_null_iter; | ||
} | ||
arrow_proxy proxy{std::move(arr), std::move(schema)}; | ||
return proxy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use arrow_array_schema_factory.hpp instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its a bad idea to have these things implemented in arrow_array_schema_factory
The factories of the respective layouts should be implemented by the respective layout.
Furthermore, the APIs of the factories are not suitable for nested arrays.
This is already non-usable when the values in the dict are not variable-sized binary
{ | ||
auto storage = detail::array_access::extract_arrow_proxy(std::move(array)); | ||
p_array = array_factory(std::move(storage)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ARRAY_TYPE is known at compile time, maybe something like:
template<class ARRAY_TYPE>
requires std::is_rvalue_reference_v<ARRAY_TYPE&&>
array::array(ARRAY_TYPE&& array)
: p_array(new array_wrapper_impl<ARRAY_TYPE>(std::move(array)))
{
}
would be enough? It avoids extracting the proxy and going through the factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the constructors for variable_size_binary_array
and struc_array_array
, is that a leftover or is it intentional?
@@ -47,9 +55,21 @@ namespace sparrow | |||
SPARROW_API size_type size() const; | |||
SPARROW_API const_reference operator[](size_type) const; | |||
|
|||
SPARROW_API cloning_ptr<array_wrapper> && extract_array_wrapper() &&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type should be a value type instead of an rvalue reference to prevent dangling references.
|
||
template<class ARRAY> | ||
requires(std::is_rvalue_reference_v<ARRAY&&>) | ||
static inline sparrow::arrow_proxy&& extract_arrow_proxy(ARRAY&& array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here (notice this is just about the return type, the implementation is correct).
|
||
private: | ||
|
||
arrow_proxy && extract_arrow_proxy() &&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here.
@@ -491,4 +502,44 @@ namespace sparrow | |||
const auto offset = i * m_list_size; | |||
return std::make_pair(offset, offset + m_list_size); | |||
} | |||
|
|||
|
|||
inline arrow_proxy fixed_sized_list_array::create_proxy(std::uint64_t list_size, array && flat_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow to pass a range for the null values (similar to what we do here). Same remark for the other layouts where it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree!
} | ||
++value_iter; | ||
++is_non_null_iter; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filling the bitmap should be done in a dedicated method (notice it is alrady implemented here) to avoid code duplictation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohanMabille the function you show does something different.
I pass a complete range of non-null / null (ie a range as long as the values filled with true and false).
The API of the function you show has a list of positions where values are missing.
This is also fine with me to prefer such API but they are 2 different things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the (non_owning)dynamic_bitset instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also fine with me to prefer such API but they are 2 different things
Ah indeed. I think passing a range with the indices of the missing values would be more convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also fine with me to prefer such API but they are 2 different things
Ah indeed. I think passing a range with the indices of the missing values would be more convenient.
My thinking was the following:
When you have data, do you rather have a list of where the data is missing, or do you have some sort of functor which takes in a single data point and then decides if its missing or not. For instance once could think of a measurements of floats, and any negative number indicates that something went wrong, then you want to exclude this data point -- in such cases I would prefer to pass an iterator to bools instead of indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convenient thing with indices of missing values is that you can provide a default empty argument, and omit it when you know that you don't have any missing values. And building a range of indices from the full bitmap range could be quite easy with the std::range
functions, I'm not sure the opposite would be as easy (but I might have missed some range constructions).
EDIT: we can actually also provide a default argument when passing a full range (std::ranges::views::single), so the only argument remaining for passing a range of indices is the "easy" transformation range => full range. If we can achieve the opposite trnasformation with the same "simplicity", I would say that I don't have any preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More thoughts about this: I think it could be worth providing both APIs actually. So we could keep the constructor taking two full ranges, and provide a static function form_values(value_range, range<indices>)
that would rely on the constructor and hide the transformation range<indices> => full range
to the user. These methods can be added in a dedicated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convenient thing with indices of missing values is that you can provide a default empty argument, and omit it when you know that you don't have any missing values.
jeah that's indeed a good argument
@@ -78,6 +79,7 @@ namespace sparrow | |||
SPARROW_API static acc_length_ptr_variant_type get_acc_lengths_ptr(const array_wrapper& ar); | |||
SPARROW_API std::uint64_t get_run_length(std::uint64_t run_index) const; | |||
|
|||
arrow_proxy && extract_arrow_proxy() &&; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type should be a value instead of an rvalue reference.
I only implemented the primitive array and a single nested array (only the fixed_size_binary) because it made sense to show this to you before implementing everything. Also for the single nested array the version with a range of bools for the nullables are missing. |
} | ||
++value_iter; | ||
++is_non_null_iter; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the (non_owning)dynamic_bitset instead ?
const auto range_size = std::ranges::size(values); | ||
|
||
// buffers | ||
auto bitmap_ptr = new std::uint8_t[(range_size + 7) / 8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using sparrow::buffer here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to use the buffers cls
else | ||
{ | ||
// set bit to 0 | ||
bitmap_ptr[i / 8] &= static_cast<std::uint8_t>(~(1 << (i % 8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set the value anyway, in case the bitmap is changed later.
So IMO, you should just do a move/copy of all values, then process the bitmap.
In the case the user want to set to null a single element in 100 000 elements, he has to create a vector of 100 000 elements or a range which return false for this index ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not work in the generic case.
An iterator might might not be allowed to be dereferenced in the case of a missing value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set the value anyway, in case the bitmap is changed later.
not sure if I get this..but if you set a non-missing value to missing, everything is fine.
And if you set a missing value to non-missing you better change the value anyhow
In the case the user want to set to null a single element in 100 000 elements, he has to create a vector of 100 000 elements or a range which return false for this index ?
It does not need to be a vector at all, it can be any iterator. Second, there is an overload where you just pass the values (this then generates an iterator always returning true (I trust compilers do do some sort of optimization st. this will not be very bad)) and then change the bitmap later on for the one entry.
Furthermore, the price of an N element bool-vector would be 1/8 of the size of the actual values in the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you set a missing value to non-missing you better change the value anyhow
I don't know the use cases, but there is no cons and only benefits to copy "null" values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the use cases, but there is no cons and only benefits to copy "null" values.
Again, the value iterator might be undefined for the places where data is missing.
Assume you have std::vector<sparrow::nullable<float>>
and the values are a transform iterator and also the missing-vs-non missing is a transform-iterator.
In that case you will get an exception if you try to access .value()
at places where you don't have data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to use the get() function in that case
// create arrow array | ||
ArrowArray arr = make_arrow_array( | ||
static_cast<std::int64_t>(range_size), // length | ||
0, // null_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to calculate the right null_count
arrow_proxy primitive_array<T>::create_proxy(VALUE_RANGE && values, BOOL_RANGE && is_non_null) | ||
{ | ||
|
||
const auto range_size = std::ranges::size(values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values size must be equal to is_non_null size, we should add a check
No description provided.