Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-3286 BSON Binary vector subtype support. #1919

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

qingyang-hu
Copy link
Collaborator

@qingyang-hu qingyang-hu commented Jan 10, 2025

GODRIVER-3286

Summary

BSON Binary vector subtype support.

Background & Motivation

specs: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/bson-binary-vector.md
tests: https://github.com/mongodb/specifications/blob/master/source/bson-binary-vector/tests/README.md

The invalid cases in the spec tests are skipped because they are inappropriate for the implementation.

The motivation for using generic types is to reduce run-time errors by shifting the burden onto the compiler.

@mongodb-drivers-pr-bot mongodb-drivers-pr-bot bot added the priority-3-low Low Priority PR for Review label Jan 10, 2025
Copy link
Contributor

mongodb-drivers-pr-bot bot commented Jan 10, 2025

API Change Report

./v2/bson

compatible changes

BitVector: added
ErrInsufficientVectorData: added
ErrNonZeroVectorPadding: added
ErrNotVector: added
ErrVectorPaddingTooLarge: added
Float32Vector: added
Int8Vector: added
NewBinaryFromVector: added
NewVectorFromBinary: added
PackedBitVector: added
TypeBinaryVector: added
Vector: added

./v2/mongo

incompatible changes

(*ReplaceOneModel).SetSort: removed
(*UpdateOneModel).SetSort: removed
ReplaceOneModel.Sort: removed
UpdateOneModel.Sort: removed

./v2/mongo/options

incompatible changes

(*ReplaceOptionsBuilder).SetSort: removed
(*UpdateOneOptionsBuilder).SetSort: removed
ReplaceOptions.Sort: removed
UpdateOneOptions.Sort: removed

@qingyang-hu qingyang-hu marked this pull request as ready for review January 10, 2025 23:32
// BitVector represents a binary quantized (PACKED_BIT) vector of 0s and 1s
// (bits). The Padding prescribes the number of bits to ignore in the final byte
// of the Data. It should be 0 for an empty Data and always less than 8.
type BitVector struct {
Copy link
Collaborator

@prestonvasquez prestonvasquez Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest merging BitVector into Vector, similar to python. Also, suggest not using generics since a function with a generic signature like NewVectorFromBinary[T ...] cannot dynamically decide at runtime which T to return and so must return an ambiguous type such as interface{} or any which is not ideal.

type VectorDType byte

// These constants are vector data types.
const (
	Int8Vector      VectorDType = 0x03
	Float32Vector   VectorDType = 0x27
	PackedBitVector VectorDType = 0x10
)

// Vector represents a densely packed array of numbers / bits.
type Vector struct {
	DType       VectorDType
	Int8Data    []int8
	Float32Data []float32
	BitData     []byte
	BitPadding  []uint8
}

Then the internal constructors would only return their associated type, rather than an entire struct:

func newInt8Vector(b []byte) ([]int8, error) {
	if len(b) == 0 {
		return nil, ErrInsufficientVectorData
	}
	if padding := b[0]; padding > 0 {
		return nil, ErrNonZeroVectorPadding
	}
	s := make([]int8, 0, len(b)-1)
	for i := 1; i < len(b); i++ {
		s = append(s, int8(b[i]))
	}
	return s, nil
}

Finally, the vector constructor, avoiding the necessity of returning interface{}/any:

func NewVectorFromBinary(b Binary) (Vector, error) {
	var vec Vector

	if b.Subtype != TypeBinaryVector {
		return vec, ErrNotVector
	}
	if len(b.Data) < 2 {
		return vec, ErrInsufficientVectorData
	}

	switch VectorDType(b.Data[0]) {
	case Int8Vector:
		data, err := newInt8Vector(b.Data[1:])
		if err != nil {
			return vec, err
		}

		vec.DType = Int8Vector
		vec.Int8Data = data

		return vec, nil
	default:
		return vec, fmt.Errorf("invalid Vector data type: %d", b.Data[1:])
	}
}

The con of this approach is that we would need to actual run invalid cases such as "Vector with float values PACKED_BIT", which we currently skip because it's not possible under the current design.

Copy link
Collaborator Author

@qingyang-hu qingyang-hu Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation of the current design is to reduce run-time errors by shifting the burden onto the compiler. Likewise, it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType because Vector and its fields are directly accessible.

Returning a general interface from a constructor is not ideal because it lacks a syntactical clue about the type. However, the name of the method also suggests the scope. On the other hand, the user has to identify the vector type by either the DType or a type assertion anyway. The latter seems more instinctive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it intends to minimize user mistakes, e.g., assigning a wrong/invalid DType because Vector and its fields are directly accessible.

I have concerns about returning an empty interface from an exported function in stable API. Specifically "accept interfaces, return structs". While there are exceptions( io.LimitReader, etc), IMO the case described here should be avoided since we know what the function should return. The generic solution is fine as long as we return concrete types from exported constructors.

If we are concerned with users accessing types, we could privatize them:

// Vector represents a densely packed array of numbers / bits.
type Vector struct {
	dType       VectorDType
	int8Data    []int8
	float32Data []float32
	bitData     []byte
	bitPadding  []uint8
}

// NewVector creates a new Vector from the provided data and padding, based on the type of the data.
func NewVector[T int8 | float32 | byte](data []T, padding []uint8) (Vector, error) {
	switch any(data).(type) {
	case []int8:
		if len(padding) > 0 {
			return Vector{}, errors.New("padding should not be set for int8 data")
		}
		return Vector{
			dType:    Int8Vector,
			int8Data: any(data).([]int8), // Type assertion
		}, nil

	case []float32:
		if len(padding) > 0 {
			return Vector{}, errors.New("padding should not be set for float32 data")
		}
		return Vector{
			dType:       Float32Vector,
			float32Data: any(data).([]float32), // Type assertion
		}, nil

	case []byte:
		return Vector{
			dType:      PackedBitVector,
			bitData:    any(data).([]byte), // Type assertion
			bitPadding: padding,
		}, nil

	default:
		return Vector{}, errors.New("unsupported data type")
	}
}

// GetInt8Data retrieves the int8 data if the Vector is of Int8Type
func (v *Vector) Int8Vector() ([]int8, error) { // maybe ok instead of error?
	if v.dType != Int8Vector {
		return nil, errors.New("data is not of type int8")
	}
	return v.int8Data, nil
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Privatizing data fields sounds like a better approach to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdale What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that making Vector a non-generic struct with unexported fields seems like the best choice. What's not clear is how to expose the vector data.

One option is to use a method like Decode:

func (v Vector) Decode(v any) error {
	// ...
}

In use, that would look like:

var v Vector
var floats []float32
err := v.Decode(&floats)
if err != nil {
// ...

Another option is to use an API similar to RawValue's conversion methods:

func (v Vector) Int8() []int8
func (v Vector) Int8OK() ([]int8, bool)
func (v Vector) Float32() []float32
func (v Vector) Float32OK() ([]float32, bool)
func (v Vector) PackedBits() (bits []byte, padding byte)
func (v Vector) PackedBitsOK() (bits []byte, padding byte, ok bool)

In use, that would look like:

var v Vector
floats, ok := v.Float32OK()
if !ok {
// ...

I lean toward the latter because it's more conventional based on the existing RawValue API.

For creating a vector, I recommend defining two methods:

func NewVector[T int8 | float32](data []T) Vector
func NewPackedBitsVector(bits []byte, padding byte) Vector

In use, that would look like:

vec := bson.NewVector([]int8{0, 1, 2, 3})

Additionally, I recommend making it possible to decode "int8" and "float32" vectors directly into []int8 and []float32, respectively. For example, the following should work:

var coll *mongo.Collection
d := bson.D{{
	"vec", bson.NewVector([]int8{0, 1, 2, 3}),
}}
coll.InsertOne(..., v)

var res struct {
	Vec []int8
}
coll.FindOne(...).Decode(&res)

Additional questions:

  • Should we also add a struct field annotation for []int8 and []float32 fields that instructs the Go Driver to store them as vectors instead of arrays? That would allow a user to encode and decode vector data with a single struct without any additional decoding/conversion steps.
    For example:
var coll *mongo.Collection
type MyType struct {
	Vec []int8 `bson:"vec,asvector"`
}
coll.InsertOne(... MyType{Vec: []int8{0, 1, 2, 3})

var res MyType
coll.FindOne(...).Decode(&res)
  • If we do the above, do we even need a Vector type? Maybe we only need a PackedBitVector type, and we can use []int8 and []float32 for everything else?
  • Some of these questions would be easier to answer if we had examples of how vector data might be inserted, queried, and manipulated in a Go application. Do we have any example use cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a Vector type over the field annotation. Annotation lacks compile-time checks, which means a typo can result in a hard-to-find bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the Vector type is safer. In addition to what @qingyang-hu notes, it gives us a way to maintain state concerning the type in the unlikely case that is required. It also mirrors the python implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a non-generic Vector type with unexported fields sounds like the best plan.

Based on our separate conversation, supporting unmarshaling directly to []int8 or []float32 is lower priority than being able to easily insert vector data in those formats. I've created GODRIVER-3472 to track that improvement.

Comment on lines +25 to +28
ErrNotVector = errors.New("not a vector")
ErrInsufficientVectorData = errors.New("insufficient data")
ErrNonZeroVectorPadding = errors.New("padding must be 0")
ErrVectorPaddingTooLarge = errors.New("padding larger than 7")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should users need to assert against these error types? Consider un-exporting them until there is a clear use case for exporting

bson/vector.go Show resolved Hide resolved
}

// NewBinaryFromVector converts a Vector into a BSON Binary.
func NewBinaryFromVector[T BitVector | Vector[int8] | Vector[float32]](v T) (Binary, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a method on the Vector type instead?

E.g.

func (v Vector) Binary() Binary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-3-low Low Priority PR for Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants