Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constructorsv2 #249

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

DerThorsten
Copy link
Collaborator

No description provided.

Comment on lines 185 to 237
ArrowArray arr{};
ArrowSchema schema{};

schema.format = sparrow::data_type_format_of<T>().data();
schema.release = release_arrow_schema;
schema.n_children = 0;
schema.children = nullptr;
schema.dictionary = nullptr;
schema.release = &release_arrow_schema;

const auto range_size = std::ranges::size(values);

arr.length = static_cast<std::int64_t>(range_size);
arr.null_count = 0;
arr.offset = 0;
arr.n_buffers = 2;
arr.n_children = 0;

// buffers
std::uint8_t** buf = new std::uint8_t*[2];
arr.buffers = const_cast<const void**>(reinterpret_cast<void**>(buf));
const auto bitmap_size = static_cast<std::size_t>((arr.length + 7) / 8);
buf[0] = new std::uint8_t[bitmap_size];
auto bitmap_ptr = buf[0];

// data buffer
buf[1] = new std::uint8_t[range_size * sizeof(T)];
auto data_ptr = reinterpret_cast<T*>(buf[1]);
arr.release = &release_arrow_array;


auto value_iter = std::ranges::begin(values);
auto is_non_null_iter = std::ranges::begin(is_non_null);

for(std::size_t i=0; i < range_size; ++i)
{
const bool is_non_null_val = *is_non_null_iter;
if(is_non_null_val)
{
const auto value = *value_iter;
data_ptr[i] = static_cast<T>(value);
// set bit to 1
bitmap_ptr[i / 8] |= 1 << (i % 8);
}
else{
// set bit to 0
bitmap_ptr[i / 8] &= ~(1 << (i % 8));
}
++value_iter;
++is_non_null_iter;
}
arrow_proxy proxy{std::move(arr), std::move(schema)};
return proxy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use arrow_array_schema_factory.hpp instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a bad idea to have these things implemented in arrow_array_schema_factory

The factories of the respective layouts should be implemented by the respective layout.

Furthermore, the APIs of the factories are not suitable for nested arrays.
This is already non-usable when the values in the dict are not variable-sized binary

{
auto storage = detail::array_access::extract_arrow_proxy(std::move(array));
p_array = array_factory(std::move(storage));
}
Copy link
Collaborator

@JohanMabille JohanMabille Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ARRAY_TYPE is known at compile time, maybe something like:

    template<class ARRAY_TYPE>
    requires std::is_rvalue_reference_v<ARRAY_TYPE&&>
    array::array(ARRAY_TYPE&& array)
        : p_array(new array_wrapper_impl<ARRAY_TYPE>(std::move(array)))
    {
    }

would be enough? It avoids extracting the proxy and going through the factory.

Copy link
Collaborator

@JohanMabille JohanMabille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the constructors for variable_size_binary_array and struc_array_array, is that a leftover or is it intentional?

@@ -47,9 +55,21 @@ namespace sparrow
SPARROW_API size_type size() const;
SPARROW_API const_reference operator[](size_type) const;

SPARROW_API cloning_ptr<array_wrapper> && extract_array_wrapper() &&;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type should be a value type instead of an rvalue reference to prevent dangling references.


template<class ARRAY>
requires(std::is_rvalue_reference_v<ARRAY&&>)
static inline sparrow::arrow_proxy&& extract_arrow_proxy(ARRAY&& array)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark here (notice this is just about the return type, the implementation is correct).


private:

arrow_proxy && extract_arrow_proxy() &&;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark here.

@@ -491,4 +502,44 @@ namespace sparrow
const auto offset = i * m_list_size;
return std::make_pair(offset, offset + m_list_size);
}


inline arrow_proxy fixed_sized_list_array::create_proxy(std::uint64_t list_size, array && flat_values)
Copy link
Collaborator

@JohanMabille JohanMabille Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow to pass a range for the null values (similar to what we do here). Same remark for the other layouts where it makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree!

}
++value_iter;
++is_non_null_iter;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filling the bitmap should be done in a dedicated method (notice it is alrady implemented here) to avoid code duplictation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohanMabille the function you show does something different.
I pass a complete range of non-null / null (ie a range as long as the values filled with true and false).
The API of the function you show has a list of positions where values are missing.

This is also fine with me to prefer such API but they are 2 different things

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the (non_owning)dynamic_bitset instead ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also fine with me to prefer such API but they are 2 different things

Ah indeed. I think passing a range with the indices of the missing values would be more convenient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also fine with me to prefer such API but they are 2 different things

Ah indeed. I think passing a range with the indices of the missing values would be more convenient.

My thinking was the following:
When you have data, do you rather have a list of where the data is missing, or do you have some sort of functor which takes in a single data point and then decides if its missing or not. For instance once could think of a measurements of floats, and any negative number indicates that something went wrong, then you want to exclude this data point -- in such cases I would prefer to pass an iterator to bools instead of indices)

Copy link
Collaborator

@JohanMabille JohanMabille Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convenient thing with indices of missing values is that you can provide a default empty argument, and omit it when you know that you don't have any missing values. And building a range of indices from the full bitmap range could be quite easy with the std::range functions, I'm not sure the opposite would be as easy (but I might have missed some range constructions).

EDIT: we can actually also provide a default argument when passing a full range (std::ranges::views::single), so the only argument remaining for passing a range of indices is the "easy" transformation range => full range. If we can achieve the opposite trnasformation with the same "simplicity", I would say that I don't have any preference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts about this: I think it could be worth providing both APIs actually. So we could keep the constructor taking two full ranges, and provide a static function form_values(value_range, range<indices>) that would rely on the constructor and hide the transformation range<indices> => full range to the user. These methods can be added in a dedicated PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convenient thing with indices of missing values is that you can provide a default empty argument, and omit it when you know that you don't have any missing values.

jeah that's indeed a good argument

@@ -78,6 +79,7 @@ namespace sparrow
SPARROW_API static acc_length_ptr_variant_type get_acc_lengths_ptr(const array_wrapper& ar);
SPARROW_API std::uint64_t get_run_length(std::uint64_t run_index) const;

arrow_proxy && extract_arrow_proxy() &&;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type should be a value instead of an rvalue reference.

@DerThorsten
Copy link
Collaborator Author

DerThorsten commented Oct 25, 2024

I don't see the constructors for variable_size_binary_array and struc_array_array, is that a leftover or is it intentional?

I only implemented the primitive array and a single nested array (only the fixed_size_binary) because it made sense to show this to you before implementing everything.

Also for the single nested array the version with a range of bools for the nullables are missing.
The focus here was more on how we do the nested array construction in principle

}
++value_iter;
++is_non_null_iter;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the (non_owning)dynamic_bitset instead ?

const auto range_size = std::ranges::size(values);

// buffers
auto bitmap_ptr = new std::uint8_t[(range_size + 7) / 8];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using sparrow::buffer here ?

Copy link
Collaborator Author

@DerThorsten DerThorsten Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to use the buffers cls

else
{
// set bit to 0
bitmap_ptr[i / 8] &= static_cast<std::uint8_t>(~(1 << (i % 8)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set the value anyway, in case the bitmap is changed later.
So IMO, you should just do a move/copy of all values, then process the bitmap.
In the case the user want to set to null a single element in 100 000 elements, he has to create a vector of 100 000 elements or a range which return false for this index ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not work in the generic case.
An iterator might might not be allowed to be dereferenced in the case of a missing value

Copy link
Collaborator Author

@DerThorsten DerThorsten Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set the value anyway, in case the bitmap is changed later.

not sure if I get this..but if you set a non-missing value to missing, everything is fine.
And if you set a missing value to non-missing you better change the value anyhow

In the case the user want to set to null a single element in 100 000 elements, he has to create a vector of 100 000 elements or a range which return false for this index ?

It does not need to be a vector at all, it can be any iterator. Second, there is an overload where you just pass the values (this then generates an iterator always returning true (I trust compilers do do some sort of optimization st. this will not be very bad)) and then change the bitmap later on for the one entry.
Furthermore, the price of an N element bool-vector would be 1/8 of the size of the actual values in the worst case.

Copy link
Collaborator

@Alex-PLACET Alex-PLACET Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you set a missing value to non-missing you better change the value anyhow

I don't know the use cases, but there is no cons and only benefits to copy "null" values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the use cases, but there is no cons and only benefits to copy "null" values.

Again, the value iterator might be undefined for the places where data is missing.
Assume you have std::vector<sparrow::nullable<float>> and the values are a transform iterator and also the missing-vs-non missing is a transform-iterator.

In that case you will get an exception if you try to access .value() at places where you don't have data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use the get() function in that case

// create arrow array
ArrowArray arr = make_arrow_array(
static_cast<std::int64_t>(range_size), // length
0, // null_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to calculate the right null_count

arrow_proxy primitive_array<T>::create_proxy(VALUE_RANGE && values, BOOL_RANGE && is_non_null)
{

const auto range_size = std::ranges::size(values);
Copy link
Collaborator

@Alex-PLACET Alex-PLACET Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values size must be equal to is_non_null size, we should add a check

@DerThorsten DerThorsten marked this pull request as ready for review October 27, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants