Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/kickoff hangs when llm call fails #1943

Merged
merged 10 commits into from
Jan 22, 2025

Conversation

bhancockio
Copy link
Collaborator

@bhancockio bhancockio commented Jan 21, 2025

Pretty important bug fix for OSS:

Root issue:

  • We are not properly handling the issue when LiteLLM fails to make a call because it doesn't have the proper API keys.
  • As a result, the crew tries to make the same llm call until we hit the max iteration limit (20 by default).

Issues for users:

Solution:

  • We properly handle LiteLLM exceptions now and exit early.

Closes #1934

@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for PR #1943

Overview

This PR significantly improves the handling of authentication errors and enhances logging mechanisms in the CrewAI codebase, specifically related to LiteLLM integration and error management.

Key Code Improvements

1. Logging Enhancements

In the current code, there are several print statements for debugging purposes. This approach is not suitable for production-level code:

Current Implementation:

print("Authentication error: Please check your API credentials")

Suggested Improvement:
Transition to the Python logging framework:

import logging

logging.error("Authentication error: Please check your API credentials")

This change will allow for different logging levels and better control over log output.


2. Error Handling Improvement

The nested try-except structures could lead to decreased readability and maintenance challenges. It's advisable to simplify the error handling mechanism:

Current Implementation:

try:
    # operation
except LiteLLMAuthenticationError as auth_error:
    # handle auth error
except Exception as e:
    # handle general error

Suggested Improvement:
Encapsulate proper error-handling logic to reduce nesting:

def _handle_errors(self):
    try:
        # operation
    except LiteLLMAuthenticationError:
        self._handle_auth_error()
    except Exception as e:
        self._handle_generic_error(e)

This modular approach improves readability and can be reused in other parts of the code.


3. Magic Constants

The code currently makes use of hard-coded strings, which should be defined as constants for maintainability:

Current Implementation:

self._printer.print(content="Authentication error with litellm occurred", color="red")

Suggested Improvement:
Define these as class constants:

class CrewAgentExecutor:
    ERROR_COLOR = "red"
    AUTH_ERROR_MESSAGE = "Authentication error with litellm occurred."

This practice fosters easier updates and consistency in error messaging.


Links to Historical Context and Learnings

While I couldn't fetch related PRs, it is important to note that previous pull requests have highlighted the need for consistency in error handling and logging strategy, as evidenced by the ongoing pattern of switching from print statements to a more robust logging framework.

Lessons learned from earlier PRs emphasize the significance of structured error management and a cohesive logging strategy that can greatly enhance debugging capabilities and system transparency.

Specific Recommendations

  1. Adopt a Consistent Logging Framework: Replace all debugging print statements with the appropriate logging levels to ensure production readiness.
  2. Enhance Documentation: Incorporate detailed docstrings for all new methods, including parameters, return types, and exceptional scenarios that may arise.
  3. Testing Enhancements: Implement unit tests for newly introduced error handling scenarios. Include integration tests specifically focused on LiteLLM authentication flows.
  4. Centralized Configuration Management: Extract configurable parameters into a configuration file, allowing for easy modifications and environment-specific configurations.
  5. Security Review: Ensure sensitive information does not get logged, particularly around authentication errors, to mitigate risks of data leaks.

Conclusion

Overall, the adjustments made in PR #1943 enhance code maintainability, reliability, and production readiness. Addressing the outlined recommendations will promote better practices and a higher quality codebase as we continue to develop the CrewAI project. This thoughtful enhancement and continued focus on code quality will pave the way for a robust error handling and logging framework that supports our ongoing efforts in machine learning and AI.

Comment on lines +265 to +267
if isinstance(e, LiteLLMAuthenticationError):
# Do not retry on authentication errors
raise e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! When we raise should we create a colored logger to make it clear no keys provided ?

@@ -145,10 +149,40 @@ def _invoke_loop(self):
if self._is_context_length_exceeded(e):
self._handle_context_length()
continue
elif self._is_litellm_authentication_error(e):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice you did it here! beautiful !

goal="test goal",
backstory="test backstory",
llm=LLM(model="gpt-4"),
max_retry_limit=0, # Disable retries for authentication errors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt this work without doing this? as in if that error happens, we should drop max_retry_limit to 0 ?

@bhancockio bhancockio merged commit 67f0de1 into main Jan 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] kickoff hangs when LLM call fails
4 participants