test: Add completion models unit & matrix integration tests #68

faassen · 2024-10-21T19:04:53Z

Given @0xMochan's comment on #66 I took the liberty of creating an issue for it. Here are some of my thoughts. I hope it's helpful at all.

Integration testing of external services is a tricky issue. Even without the stochastic nature of LLMs, there is no guarantee that replaying a request against the same external service is going to give the same response, as state could have changed downstream.

A generic mock service is very useful, but will only test part of the integration, but we also know specific integrations can break and it's difficult to test them thoroughly by manually.

How would manual testing work?

There would be a bunch of scenarios, and you'd run them, eyeballing whether the correct behavior happens. It'd be nice if those scenarios could be generic and specific clients (openai, anthropic, etc) can be injected. This way most of the manual testing code can be reused.

This can also lead to at least some "smoke tests" - while we don't know the details of the responses, we know the response should be successful, and that the right type of response is returned (such as the request to use a tool).

How to automate such smoke tests?

One a test passes with the real API, the requests (including parameters, body, etc) as well as the full responses can be recorded and serialized into files. So, the real clients need to grow an optional recording feature. Then you could have fake versions of the various clients (openapi, anthropic), etc, which instead of making a real quest look whether the request matches a recording, and synthesize the response from this recording as well.

Now that I've written all that I actually thought "wait, this could all happen on the reqwest layer", and I found a library that does something like I just described:

https://docs.rs/reqwest_mock/latest

But I found out it's discontinued. But the README describes a bunch of alternatives to explore:

I'll read a bit about these next.

0xMochan · 2024-10-21T19:22:09Z

Adding a recording feature to serialize real responses as apart of the provider machinery would be quite interesting and perhaps a robust way to setup a nice in-between w/ smoke tests (which I think of sanity checks) and proper integration testing. Mocking is always tricky (though most of these providers let you set a seed, it's not something to rely on).

faassen · 2024-10-21T19:30:10Z

I've reviewed some of the various suggested solutions:

rvcr

https://github.com/chorusone/rvcr

Crowds: 13,289

This is centered around the record/replay feature but offers no other mocking features.

Builds on top of reqwest-middleware which may be interesting in its own right. I think from grepping the source it supports async requests.

wiremock-rs

https://github.com/lukemathwalker/wiremock-rs

Crowds: 9,591,535

This allows you to mock a HTTP service, but no recording/replay.

Lists mockito and httpmock as prior art, and presents a table comparing features:

https://docs.rs/wiremock/0.6.2/wiremock/#prior-art

Not the same or compatible with Java wiremock.

confusion warning

There's wiremock-rs, and there's Java wiremock (at wiremock.org), which is something else.

stubr

https://github.com/beltram/stubr

Crowds: 37,385

This is an adaptation (fork?) of wiremock-rs which supports Java Wiremock json stubs as input. Aims to reach drop-in replacement status of Java Wiremock.

Hasn't been updated in over a year.

This adds recording with the record-reqwest feature

https://beltram.github.io/stubr/html/recording/reqwest.html

but async requests aren't supported and rig uses async requests.

httpmock

https://github.com/alexliesenfeld/httpmock

Crowds: 3,873,238

Mock a HTTP service, but no recording/replay.

mockito

https://github.com/lipanski/mockito

Crowds: 5,595,547

Mock a HTTP service, but no recording/replay.

Analysis

By popularity wiremock-rs would be interesting, but it requires mocking the individual APIs rather than being able to record/replay. stubr fixes that and can integrate with reqwest, but relying on a fork of wiremock-rs that's less maintained seems risky, plus async requests appear to be required and not supported.

rvcr seems fairly simple and if record/replay is chosen, I think this is the way to go. If record/replay doesn't work out, fall back on wiremock-rs, as it's most popular and with work fake APIs can be produced.

faassen · 2024-10-21T19:36:59Z

though most of these providers let you set a seed, it's not something to rely on

Indeed! Anthropic's API docs take forever to load (today?), but here's this note about temperature (I haven't come across of a notion of seed):

Note that even with temperature of 0.0, the results will not be fully deterministic.

https://docs.anthropic.com/en/api/messages

0xMochan · 2024-10-21T19:39:40Z

I've reviewed some of the various suggested solutions:

This is some very high quality analysis and will be super useful as we approach this soon!

mateobelanger · 2024-11-12T16:07:43Z

Task

Create testing matrix for all completion model providers
EX: with/out additional params, with/out tools, ...
integration tests

Research

https://rust.testcontainers.org/

faassen added the feat label Oct 21, 2024

mateobelanger changed the title ~~feat: Integration Testing~~ test(integration): Add hybrid smoke-integration tests for external providers Oct 22, 2024

mateobelanger added test Improvements to tests coverage & depth and removed feat labels Oct 22, 2024

0xMochan mentioned this issue Oct 29, 2024

feat: Qdrant support #81

Merged

8 tasks

mateobelanger added feat and removed feat labels Oct 30, 2024

mateobelanger self-assigned this Nov 12, 2024

mateobelanger mentioned this issue Nov 12, 2024

feat: completion models unit tests and integration tests #96

Closed

1 task

mateobelanger added this to the v0.5 milestone Nov 12, 2024

mateobelanger changed the title ~~test(integration): Add hybrid smoke-integration tests for external providers~~ test(integration): Add completion models unit & matrix integration tests Nov 12, 2024

mateobelanger changed the title ~~test(integration): Add completion models unit & matrix integration tests~~ test: Add completion models unit & matrix integration tests Nov 12, 2024

mateobelanger linked a pull request Nov 12, 2024 that will close this issue

test: add test command to include examples, add test config in ex #97

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add completion models unit & matrix integration tests #68

test: Add completion models unit & matrix integration tests #68

faassen commented Oct 21, 2024

0xMochan commented Oct 21, 2024

faassen commented Oct 21, 2024

faassen commented Oct 21, 2024

0xMochan commented Oct 21, 2024 •

edited

Loading

mateobelanger commented Nov 12, 2024

test: Add completion models unit & matrix integration tests #68

test: Add completion models unit & matrix integration tests #68

Comments

faassen commented Oct 21, 2024

0xMochan commented Oct 21, 2024

faassen commented Oct 21, 2024

rvcr

wiremock-rs

confusion warning

stubr

httpmock

mockito

Analysis

faassen commented Oct 21, 2024

0xMochan commented Oct 21, 2024 • edited Loading

mateobelanger commented Nov 12, 2024

Task

Research

0xMochan commented Oct 21, 2024 •

edited

Loading