Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RFD describing proposed batched query feature #550

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions site/content/rfds/0009-batch-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Batched Plugin Queries
weight: 9
slug: 0009
extra:
rfd: 9
---

# Batched Plugin Queries

Now that we've made substantial progress in converting legacy Hipcheck analysis
implementations to plugins, we've identified the need for a way to "batch"
queries for performance reasons. For example, the entropy and churn analyses
rely on a "linguist" functionality that evaluates whether a given file in a repo
contains source code. The plugin design philosophy would dictate that we
separate the `linguist` functionality into its own plugin, with an endpoint
`is_likely_source_file(&Path) -> bool`, and have churn and entropy each query
that. But making a `gRPC` request from the `entropy` plugin for each file in a
potentially large repository is likely to incur runtime costs.

Plugins can expose query endpoints that take a `Vec<_>` of objects and return a
`Vec<_>` of reponses, but because `salsa` memo-ization operates on the entire
query object, it would cache the entire `Vec<_>` as a key, and therefore any
future `Vec`-based queries would not benefit from the memo-ization unless they
were the exact same `Vec` in terms of size and order of elements.

## Proposed Protocol Change

The current definition of a `Query` gRPC message in the Hipcheck v1 protocol
includes the following fields:

```protobuf
message Query {
...

// incremental computation system will use to cache the response.
string key = 6;

// The response for the query, as a JSON object. This will be cached by
// Hipcheck for future queries matching the publisher name, plugin name,
// query name, and key.
string output = 7;

// Concern chunking is the same as other fields.
repeated string concern = 8;
}
```

We propose augmenting the `key` and `output` fields into `repeated` fields.
From a gRPC perspective, this means that multiple `key` and `output` fields can
appear in each message. Compiling this `protobuf` definition into a Rust
language `struct` will have the effect of replacing `key: String` with `key:
Vec<String>` and the same for `output`.

According to the `proto3` [language guide][proto3]:

> "For string, bytes, and message fields, singular is compatible with repeated.
> Given serialized data of a repeated field as input, clients that expect this
> field to be singular will take the last input value if it’s a primitive type
> field or merge all input elements if it’s a message type field. Note that this
> is not generally safe for numeric types, including bools and enums. Repeated
> fields of numeric types are serialized in the packed format by default, which
> will not be parsed correctly when a singular field is expected."

Therefore making this change should not break compatibility with any plugins
compiled using the existing protocol definition.

## Associated Change to Querying Plugins from Rust SDK

With the above protocol changes, under the hood the `key` field of a query
object will be a `Vec<String>` instead of a singular `String`.

We propose to leave the current `PluginEngine::query(key: Value)` function API
intact. Under the hood we will simply wrap `key` into a single-element vector.
We will add a second function `PluginEngine::batch_query(keys: Vec<Value>)`
which will not do any additional wrapping of the `keys` field and insert it
as-is into the protobuf `Query` struct once each key has been JSON-serialized.

## Associated Change to Hipcheck Core

Since we want to make full use of `salsa` memo-ization, when the Hipcheck core
query engine receives a request from a plugin to another plugin that contains a
`key: Vec<String>` with length greater than one, it will split out each element
and make a separate `salsa` request for it.

As a result, **no plugin should expect to receive a query where the `key` field
has more than one element**, and doing so should be considered an error.

Once the result of querying a plugin on each element of `key` is finished,
Hipcheck core will package the results into an array of equal length where the
result at each index corresponds with the key value each each index of `key`,
and return the generated `QueryResponse` object to the requesting plugin.

[proto3]: https://protobuf.dev/programming-guides/proto3/