Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: pull_prompt performance is not acceptable for real time chat #1441

Open
codekiln opened this issue Jan 21, 2025 · 0 comments
Open

Issue: pull_prompt performance is not acceptable for real time chat #1441

codekiln opened this issue Jan 21, 2025 · 0 comments

Comments

@codekiln
Copy link

codekiln commented Jan 21, 2025

I would like to be able to rely on the prompt hub for our real-time chat results. Unfortunately, it seems like the performance of the API endpoints and/or the python SDK is not sufficient to enable this yet.

Given the following benchmark_prompthub_pull.py:

import time
import statistics
from dotenv import load_dotenv
from langsmith import client


def measure_pull_prompt(prompt_name, repeat=5):
    """Pull the specified prompt `repeat` times and measure latency."""
    times = []
    c = client.Client()  # We create a new client so that each measure is fresh.
    for _ in range(repeat):
        start = time.perf_counter()
        _ = c.pull_prompt(prompt_name)
        end = time.perf_counter()
        times.append(end - start)
    return times


def print_stats(label, times):
    """Print min, max, mean, and standard deviation for the recorded times."""
    print(f"\nStats for '{label}':")
    print(f"  All times: {times}")
    print(f"  Mean time:   {statistics.mean(times):.4f} s")
    print(
        f"  StdDev time: {statistics.stdev(times):.4f} s"
        if len(times) > 1
        else "  StdDev time: N/A (only one measurement)"
    )
    print(f"  Min time:    {min(times):.4f} s")
    print(f"  Max time:    {max(times):.4f} s")


def main():
    load_dotenv()  # Load environment variables, e.g. LangSmith credentials

    public_prompt = "rlm/rag-prompt"  # Example: a public prompt from the Hub
    private_prompt = (
        "brims-seller-2024-12-17"  # Example: a private prompt in your account
    )

    public_times = measure_pull_prompt(public_prompt, repeat=5)
    private_times = measure_pull_prompt(private_prompt, repeat=5)

    print_stats(public_prompt, public_times)
    print_stats(private_prompt, private_times)


if __name__ == "__main__":
    main()

When I run, I get:

Stats for 'rlm/rag-prompt':
  All times: [0.7337127500213683, 0.16061124997213483, 0.16808425000635907, 0.43246658297721297, 0.2569338330067694]
  Mean time:   0.3504 s
  StdDev time: 0.2407 s
  Min time:    0.1606 s
  Max time:    0.7337 s

Stats for 'brims-seller-2024-12-17':
  All times: [0.23353637498803437, 0.1711990410112776, 1.7675039170426317, 0.2944247499690391, 0.4333985830307938]
  Mean time:   0.5800 s
  StdDev time: 0.6709 s
  Min time:    0.1712 s
  Max time:    1.7675 s

Based on this, it seems like five pulls of a private prompt range from 171 ms to 1.8 sec, with an average of about 671 ms. For public prompts, the times ranged from 161ms to 734ms, with an avg of 350ms. Note that this is for the most popular prompt on prompt hub, so this is likely an upper bound of expected performance for public prompts.

Given that LLM response time is already a bottleneck, we don't have the latency budget to use prompt hub at this time. Please work on this; it's a blocker for us for using the PromptHub, particularly in the context of LangGraph Platform where we would like assistants to reference PromptHub templates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant