Handling huge embeddings databases. #261

lambdacerro · 2022-03-26T10:42:13Z

lambdacerro
Mar 26, 2022

Hey,

While this is not something strictly related to Human, it is related on how to manage Human 1024 elements embedding arrays.

Most of the other facereg libraries I've used deals with 128 dimensions for their embedding however human uses 1024 arrays, this does complicate things a little bit.

For now I've just used a 256GB ram server to store all the data in memory and handle face matching, however the dataset is currently too large to handle it on RAM and i'm now looking for ways to use a DB for this approach.

I was thinking to use postgresql since it does have native support for array but 1024 dimensions is too much for it to handle, another options that comes to mind is MongoDB.

Does anyone have any experience dealing with this?

Thanks!

Answered by vladmandic

Mar 26, 2022

I was thinking to use postgresql since it does have native support for array but 1024 dimensions is too much for it to handle, another options that comes to mind is MongoDB.

I've used MongoDB in several projects, its pretty trivial and quite fast.

Most of the other facereg libraries I've used deals with 128 dimensions for their embedding however human uses 1024 arrays, this does complicate things a little bit.

You could reduce number of computed dimensions. Matching algorithm cares that both source and target descriptor have same number of dimensions, but they can be anything.

Reducing dimensions does decrease precision, but you can play with what is acceptable to you (1024 -> 512 -> …

View full answer

vladmandic · 2022-03-26T11:29:09Z

vladmandic
Mar 26, 2022
Maintainer

I was thinking to use postgresql since it does have native support for array but 1024 dimensions is too much for it to handle, another options that comes to mind is MongoDB.

I've used MongoDB in several projects, its pretty trivial and quite fast.

Most of the other facereg libraries I've used deals with 128 dimensions for their embedding however human uses 1024 arrays, this does complicate things a little bit.

You could reduce number of computed dimensions. Matching algorithm cares that both source and target descriptor have same number of dimensions, but they can be anything.

Reducing dimensions does decrease precision, but you can play with what is acceptable to you (1024 -> 512 -> 256 -> 128).
Take a look at https://github.com/vladmandic/human-match, specifically at https://github.com/vladmandic/human-match/blob/dcefda04958e9d4e130595394bec6add6e5083d0/match.js#L57 where I've experimented using PCA reduction methods.

You can also normalize descriptors from float32 to uint8 (again, with reduced precision) which would give you 4x space savings.

0 replies

lambdacerro · 2022-03-26T14:19:16Z

lambdacerro
Mar 26, 2022
Author

Thank you, i'll do testings with mongodb and elasticsearch keeping the 1024 dimensions.

0 replies

vladmandic · 2022-03-26T17:05:05Z

vladmandic
Mar 26, 2022
Maintainer

just my $0.02 on MongoDB vs ElasticSearch

I love ES - its query performance is great for large datasets and its even resilient enough nowadays to use as primary store (not just as index database as it was originally intended), but it's much harder to setup and maintain and does use a lot more resources when idle.

On the other hand, using MongoDB from NodeJS is trivial and it requires very little resources when idle.

I'd stick with MongoDB unless you're expecting really extreme number of descriptors. But if your database size is expected to be in multi-GB range, ES starts showing advantages.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling huge embeddings databases. #261

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Handling huge embeddings databases. #261

lambdacerro Mar 26, 2022

Replies: 3 comments

vladmandic Mar 26, 2022 Maintainer

lambdacerro Mar 26, 2022 Author

vladmandic Mar 26, 2022 Maintainer

lambdacerro
Mar 26, 2022

vladmandic
Mar 26, 2022
Maintainer

lambdacerro
Mar 26, 2022
Author

vladmandic
Mar 26, 2022
Maintainer