-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GODRIVER-3286 BSON Binary vector subtype support. #1919
base: master
Are you sure you want to change the base?
Conversation
API Change Report./v2/bsoncompatible changesBitVector: added ./v2/mongoincompatible changes(*ReplaceOneModel).SetSort: removed ./v2/mongo/optionsincompatible changes(*ReplaceOptionsBuilder).SetSort: removed |
a939853
to
cc4cc4d
Compare
cc4cc4d
to
73018a5
Compare
// BitVector represents a binary quantized (PACKED_BIT) vector of 0s and 1s | ||
// (bits). The Padding prescribes the number of bits to ignore in the final byte | ||
// of the Data. It should be 0 for an empty Data and always less than 8. | ||
type BitVector struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest merging BitVector into Vector, similar to python. Also, suggest not using generics since a function with a generic signature like NewVectorFromBinary[T ...]
cannot dynamically decide at runtime which T
to return and so must return an ambiguous type such as interface{}
or any
which is not ideal.
type VectorDType byte
// These constants are vector data types.
const (
Int8Vector VectorDType = 0x03
Float32Vector VectorDType = 0x27
PackedBitVector VectorDType = 0x10
)
// Vector represents a densely packed array of numbers / bits.
type Vector struct {
DType VectorDType
Int8Data []int8
Float32Data []float32
BitData []byte
BitPadding []uint8
}
Then the internal constructors would only return their associated type, rather than an entire struct:
func newInt8Vector(b []byte) ([]int8, error) {
if len(b) == 0 {
return nil, ErrInsufficientVectorData
}
if padding := b[0]; padding > 0 {
return nil, ErrNonZeroVectorPadding
}
s := make([]int8, 0, len(b)-1)
for i := 1; i < len(b); i++ {
s = append(s, int8(b[i]))
}
return s, nil
}
Finally, the vector constructor, avoiding the necessity of returning interface{}
/any
:
func NewVectorFromBinary(b Binary) (Vector, error) {
var vec Vector
if b.Subtype != TypeBinaryVector {
return vec, ErrNotVector
}
if len(b.Data) < 2 {
return vec, ErrInsufficientVectorData
}
switch VectorDType(b.Data[0]) {
case Int8Vector:
data, err := newInt8Vector(b.Data[1:])
if err != nil {
return vec, err
}
vec.DType = Int8Vector
vec.Int8Data = data
return vec, nil
default:
return vec, fmt.Errorf("invalid Vector data type: %d", b.Data[1:])
}
}
The con of this approach is that we would need to actual run invalid cases such as "Vector with float values PACKED_BIT", which we currently skip because it's not possible under the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation of the current design is to reduce run-time errors by shifting the burden onto the compiler. Likewise, it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType
because Vector
and its fields are directly accessible.
Returning a general interface from a constructor is not ideal because it lacks a syntactical clue about the type. However, the name of the method also suggests the scope. On the other hand, the user has to identify the vector type by either the DType
or a type assertion anyway. The latter seems more instinctive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType because Vector and its fields are directly accessible.
I have concerns about returning an empty interface from an exported function in stable API. Specifically "accept interfaces, return structs". While there are exceptions( io.LimitReader, etc), IMO the case described here should be avoided since we know what the function should return. The generic solution is fine as long as we return concrete types from exported constructors.
If we are concerned with users accessing types, we could privatize them:
// Vector represents a densely packed array of numbers / bits.
type Vector struct {
dType VectorDType
int8Data []int8
float32Data []float32
bitData []byte
bitPadding []uint8
}
// NewVector creates a new Vector from the provided data and padding, based on the type of the data.
func NewVector[T int8 | float32 | byte](data []T, padding []uint8) (Vector, error) {
switch any(data).(type) {
case []int8:
if len(padding) > 0 {
return Vector{}, errors.New("padding should not be set for int8 data")
}
return Vector{
dType: Int8Vector,
int8Data: any(data).([]int8), // Type assertion
}, nil
case []float32:
if len(padding) > 0 {
return Vector{}, errors.New("padding should not be set for float32 data")
}
return Vector{
dType: Float32Vector,
float32Data: any(data).([]float32), // Type assertion
}, nil
case []byte:
return Vector{
dType: PackedBitVector,
bitData: any(data).([]byte), // Type assertion
bitPadding: padding,
}, nil
default:
return Vector{}, errors.New("unsupported data type")
}
}
// GetInt8Data retrieves the int8 data if the Vector is of Int8Type
func (v *Vector) Int8Vector() ([]int8, error) { // maybe ok instead of error?
if v.dType != Int8Vector {
return nil, errors.New("data is not of type int8")
}
return v.int8Data, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Privatizing data fields sounds like a better approach to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdale What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that making Vector
a non-generic struct with unexported fields seems like the best choice. What's not clear is how to expose the vector data.
One option is to use a method like Decode
:
func (v Vector) Decode(v any) error {
// ...
}
In use, that would look like:
var v Vector
var floats []float32
err := v.Decode(&floats)
if err != nil {
// ...
Another option is to use an API similar to RawValue
's conversion methods:
func (v Vector) Int8() []int8
func (v Vector) Int8OK() ([]int8, bool)
func (v Vector) Float32() []float32
func (v Vector) Float32OK() ([]float32, bool)
func (v Vector) PackedBits() (bits []byte, padding byte)
func (v Vector) PackedBitsOK() (bits []byte, padding byte, ok bool)
In use, that would look like:
var v Vector
floats, ok := v.Float32OK()
if !ok {
// ...
I lean toward the latter because it's more conventional based on the existing RawValue
API.
For creating a vector, I recommend defining two methods:
func NewVector[T int8 | float32](data []T) Vector
func NewPackedBitsVector(bits []byte, padding byte) Vector
In use, that would look like:
vec := bson.NewVector([]int8{0, 1, 2, 3})
Additionally, I recommend making it possible to decode "int8" and "float32" vectors directly into []int8
and []float32
, respectively. For example, the following should work:
var coll *mongo.Collection
d := bson.D{{
"vec", bson.NewVector([]int8{0, 1, 2, 3}),
}}
coll.InsertOne(..., v)
var res struct {
Vec []int8
}
coll.FindOne(...).Decode(&res)
Additional questions:
- Should we also add a struct field annotation for
[]int8
and[]float32
fields that instructs the Go Driver to store them as vectors instead of arrays? That would allow a user to encode and decode vector data with a single struct without any additional decoding/conversion steps.
For example:
var coll *mongo.Collection
type MyType struct {
Vec []int8 `bson:"vec,asvector"`
}
coll.InsertOne(... MyType{Vec: []int8{0, 1, 2, 3})
var res MyType
coll.FindOne(...).Decode(&res)
- If we do the above, do we even need a
Vector
type? Maybe we only need aPackedBitVector
type, and we can use[]int8
and[]float32
for everything else? - Some of these questions would be easier to answer if we had examples of how vector data might be inserted, queried, and manipulated in a Go application. Do we have any example use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer a Vector
type over the field annotation. Annotation lacks compile-time checks, which means a typo can result in a hard-to-find bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the Vector
type is safer. In addition to what @qingyang-hu notes, it gives us a way to maintain state concerning the type in the unlikely case that is required. It also mirrors the python implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a non-generic Vector
type with unexported fields sounds like the best plan.
Based on our separate conversation, supporting unmarshaling directly to []int8
or []float32
is lower priority than being able to easily insert vector data in those formats. I've created GODRIVER-3472 to track that improvement.
ErrNotVector = errors.New("not a vector") | ||
ErrInsufficientVectorData = errors.New("insufficient data") | ||
ErrNonZeroVectorPadding = errors.New("padding must be 0") | ||
ErrVectorPaddingTooLarge = errors.New("padding larger than 7") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should users need to assert against these error types? Consider un-exporting them until there is a clear use case for exporting
} | ||
|
||
// NewBinaryFromVector converts a Vector into a BSON Binary. | ||
func NewBinaryFromVector[T BitVector | Vector[int8] | Vector[float32]](v T) (Binary, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a method on the Vector
type instead?
E.g.
func (v Vector) Binary() Binary
GODRIVER-3286
Summary
BSON Binary vector subtype support.
Background & Motivation
specs: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/bson-binary-vector.md
tests: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/tests/README.md
The invalid cases in the spec tests are skipped because they are inappropriate for the implementation.
The motivation for using generic types is to reduce run-time errors by shifting the burden onto the compiler.