-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fields abstraction (#3955) #3965
Conversation
/// ``` | ||
/// | ||
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] | ||
pub struct Fields(Arc<[FieldRef]>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally defined Fields = Vec<FieldPtr>
Whilst simple the lack of a newtype made for a more convoluted migration, with a newtype we can define conversions From<Vec<Field>>
, etc... to help reduce friction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a better formulation than a typedef and will allow for more flexibility
@@ -182,7 +183,7 @@ pub enum DataType { | |||
/// A single LargeList array can store up to [`i64::MAX`] elements in total | |||
LargeList(Box<Field>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick follow PR would then replace Box<Field>
with FieldRef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another followup could be done for Union
, although that would also benefit from a Vec<(Field, i8)>
instead of two separate vectors. I think that also currently makes it the largest variant, which increases the needed size of all datatypes.
A slightly hacky improvement for union could also be to move the type_id
into Field
and leave it unused in most places. That should basically be free since Field
already has a few bits of padding left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I plan to do the other variants in a follow up. I think changing it to (Fields, Arc<[i8]>, UnionMode)
may be sufficient and would keep things simple
|
||
impl From<Vec<FieldRef>> for Fields { | ||
fn from(value: Vec<FieldRef>) -> Self { | ||
Self(value.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perhaps worth highlighting that this is implemented as a memove, it cannot reuse the allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it cannot reuse the allocation
If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly the implementation will always move regardless, I think it is some limitation of unsized coercion
FYI @alamb @crepererum @viirya I would appreciate your thoughts on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it 👍
|
||
impl From<Vec<FieldRef>> for Fields { | ||
fn from(value: Vec<FieldRef>) -> Self { | ||
Self(value.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it cannot reuse the allocation
If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like it
/// ``` | ||
/// | ||
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] | ||
pub struct Fields(Arc<[FieldRef]>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a better formulation than a typedef and will allow for more flexibility
pub fn size(&self) -> usize { | ||
self.iter().map(|field| field.size()).sum() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While constructing / Modifying lists of fields, I think it would be great if we could also add functions like
/// Maybe something more generic to allow adding a Field and FieldREf
pub fn push(mut &self, field: Field...) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can make SchemaBuilder
public and add such a method to it, Fields itself is inherently immutable
|
||
use crate::error::ArrowError; | ||
use crate::field::Field; | ||
use crate::{FieldRef, Fields}; | ||
|
||
/// A builder to facilitate building a [`Schema`] from iteratively from [`FieldRef`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I think is one of the most expensive operations in DataFusion planning now: apache/datafusion#5157 (comment)
So 👍
}, | ||
); | ||
|
||
let iter = v.into_iter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to rework these once StructArray::new
exists as part of #3880
@@ -302,6 +306,21 @@ impl From<(Vec<(Field, ArrayRef)>, Buffer)> for StructArray { | |||
} | |||
} | |||
|
|||
impl From<RecordBatch> for StructArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces the existing implementation in record_batch.rs with a more optimal implementation
@@ -467,19 +466,6 @@ impl From<&StructArray> for RecordBatch { | |||
} | |||
} | |||
|
|||
impl From<RecordBatch> for StructArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to struct_array.rs
self.iter().map(|field| field.size()).sum() | ||
} | ||
|
||
/// Searches for a field by name, returning it along with its index if found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be an obvious place to add hash based lookup or similar
let struct_type = | ||
DataType::Struct(vec![Field::new("data", DataType::Int64, false)]); | ||
DataType::Struct(vec![Field::new("data", DataType::Int64, false)].into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
into
works like Fields::from
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I switched between the two to make rustfmt happy 😅
|
||
/// A cheaply cloneable, owned slice of [`FieldRef`] | ||
/// | ||
/// Similar to `Arc<Vec<FieldPtr>>` or `Arc<[FieldPtr]>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FieldPtr
? Do you mean FieldRef
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction looks good and datatype/schema manipulation can be more efficient.
Notified the mailing list about this - https://lists.apache.org/thread/pmxq5j864qlkp36lvxg8kvk0kct56r8m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I think this looks in general very good.
My biggest concern is on the amount of API churn that this will generate -- I think there may be a way to reduce the churn and make this PR smaller, and I left comments to that effect.
Once we sort it out and get this merged, I think we should then try (almost immediately) to upgrade some other project that makes significant use of arrow-rs to see how painful the upgrade is (and if there are other ergonomic things that could be done to ease the transition pain)
Thank you again for pushing this through
@@ -178,10 +181,10 @@ mod tests { | |||
), | |||
Field::new( | |||
"c25", | |||
DataType::Struct(vec![ | |||
DataType::Struct(Fields::from(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone who uses DataType::Struct
this is now getting complicated to construct
I wonder if we can ease the pain by having something
impl DataType {
fn new_struct(fields: impl Into<Fields>) -> Self {
..
}
So then this could be
DataType::Struct(Fields::from(vec![ | |
DataType::new_struct(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced by this, the major reason for using the more verbose DataType::Struct(Fields::from(..))
was to reduce formatting churn, most downstreams will just be able to use .into()
.
I'll have a go upgrading DataFusion to assess the churn required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in my mind it is about the cognative load. Now I need to know what a Fields is, import it, construct one, etc.
Maybe the magic into()
will make it ok
DataFusion upgrade PR - apache/datafusion#5782 |
There don't appear to be any objections to this, and there is plenty of time until the next release, and so I am going to get this in before it develops merge conflicts. We can continue to iterate from there |
Which issue does this PR close?
Part of #3955
Rationale for this change
This adds a cheaply cloneable
Fields
abstraction, that internally containsFieldRef
within a reference counted slice.This achieves a couple of things:
FieldRef
allows projecting / reconstructing schema without needing to copyField
Arc<[FieldRef]>
allows cheap cloning ofDataType
, construction ofDataType::Struct
, etc...Schema
andDataType
What changes are included in this PR?
Are there any user-facing changes?
Yes, this makes a fundamental change to the schema representation