Issue: pull_prompt performance is not acceptable for real time chat #1441

codekiln · 2025-01-21T16:46:43Z

I would like to be able to rely on the prompt hub for our real-time chat results. Unfortunately, it seems like the performance of the API endpoints and/or the python SDK is not sufficient to enable this yet.

Given the following benchmark_prompthub_pull.py:

import time
import statistics
from dotenv import load_dotenv
from langsmith import client


def measure_pull_prompt(prompt_name, repeat=5):
    """Pull the specified prompt `repeat` times and measure latency."""
    times = []
    c = client.Client()  # We create a new client so that each measure is fresh.
    for _ in range(repeat):
        start = time.perf_counter()
        _ = c.pull_prompt(prompt_name)
        end = time.perf_counter()
        times.append(end - start)
    return times


def print_stats(label, times):
    """Print min, max, mean, and standard deviation for the recorded times."""
    print(f"\nStats for '{label}':")
    print(f"  All times: {times}")
    print(f"  Mean time:   {statistics.mean(times):.4f} s")
    print(
        f"  StdDev time: {statistics.stdev(times):.4f} s"
        if len(times) > 1
        else "  StdDev time: N/A (only one measurement)"
    )
    print(f"  Min time:    {min(times):.4f} s")
    print(f"  Max time:    {max(times):.4f} s")


def main():
    load_dotenv()  # Load environment variables, e.g. LangSmith credentials

    public_prompt = "rlm/rag-prompt"  # Example: a public prompt from the Hub
    private_prompt = (
        "brims-seller-2024-12-17"  # Example: a private prompt in your account
    )

    public_times = measure_pull_prompt(public_prompt, repeat=5)
    private_times = measure_pull_prompt(private_prompt, repeat=5)

    print_stats(public_prompt, public_times)
    print_stats(private_prompt, private_times)


if __name__ == "__main__":
    main()

When I run, I get:

Stats for 'rlm/rag-prompt':
  All times: [0.7337127500213683, 0.16061124997213483, 0.16808425000635907, 0.43246658297721297, 0.2569338330067694]
  Mean time:   0.3504 s
  StdDev time: 0.2407 s
  Min time:    0.1606 s
  Max time:    0.7337 s

Stats for 'brims-seller-2024-12-17':
  All times: [0.23353637498803437, 0.1711990410112776, 1.7675039170426317, 0.2944247499690391, 0.4333985830307938]
  Mean time:   0.5800 s
  StdDev time: 0.6709 s
  Min time:    0.1712 s
  Max time:    1.7675 s

Based on this, it seems like five pulls of a private prompt range from 171 ms to 1.8 sec, with an average of about 671 ms. For public prompts, the times ranged from 161ms to 734ms, with an avg of 350ms. Note that this is for the most popular prompt on prompt hub, so this is likely an upper bound of expected performance for public prompts.

Given that LLM response time is already a bottleneck, we don't have the latency budget to use prompt hub at this time. Please work on this; it's a blocker for us for using the PromptHub, particularly in the context of LangGraph Platform where we would like assistants to reference PromptHub templates.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: pull_prompt performance is not acceptable for real time chat #1441

Issue: pull_prompt performance is not acceptable for real time chat #1441

codekiln commented Jan 21, 2025 •

edited

Loading

Issue: pull_prompt performance is not acceptable for real time chat #1441

Issue: pull_prompt performance is not acceptable for real time chat #1441

Comments

codekiln commented Jan 21, 2025 • edited Loading

codekiln commented Jan 21, 2025 •

edited

Loading