-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
observability: annotate Session+SessionPool events #1207
base: main
Are you sure you want to change the base?
Conversation
8d6f2b7
to
4bc937e
Compare
4bc937e
to
0d5bf26
Compare
0d5bf26
to
3617921
Compare
f71d05c
to
eb9dd5a
Compare
This change adds annotations for session and session pool events to aid customers in debugging latency issues with session pool malevolence and also for maintainers to figure out which session pool type is the most appropriate. Updates googleapis#1170
This allows us to avoid the flakes with OpenTelemetry's inMemorySpanExporter that might not have received the exported spans yet.
b1ea772
to
ab23f09
Compare
Kindly cc-ing you @harshachinta. |
start_time = time.time() | ||
current_span = get_current_span() | ||
current_span.add_event( | ||
"Waiting for a session to become available", {"kind": "pinging_pool"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not "Acquiring a session"? Lets maintain span consistency across all the pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because really it is waiting for a session to become available per
python-spanner/google/cloud/spanner_v1/pool.py
Lines 218 to 238 in b2e6762
def get(self, timeout=None): | |
"""Check a session out from the pool. | |
:type timeout: int | |
:param timeout: seconds to block waiting for an available session | |
:rtype: :class:`~google.cloud.spanner_v1.session.Session` | |
:returns: an existing session from the pool, or a newly-created | |
session. | |
:raises: :exc:`queue.Empty` if the queue is empty. | |
""" | |
if timeout is None: | |
timeout = self.default_timeout | |
session = self._sessions.get(block=True, timeout=timeout) | |
if not session.exists(): | |
session = self._database.session() | |
session.create() | |
return session |
and in our document requirements there is a clear need to add that case
Kindly please take another look @harshachinta, feedback addressed, thank you for the code review! |
raise | ||
else: | ||
span.set_status(Status(StatusCode.OK)) | ||
|
||
|
||
def set_span_error_and_record_exception(span, exc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not used anywhere else, we can revert it back as it was.
resp = api.batch_create_sessions( | ||
request=request, | ||
metadata=metadata, | ||
with trace_call("CloudSpanner.BatchCreateSessions", self) as span: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like trace_call is not imported here.
from google.cloud.spanner_v1._opentelemetry_tracing import trace_call
google/cloud/spanner_v1/pool.py
Outdated
) | ||
|
||
if requested_session_count > 0: | ||
current_span.add_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_span is not initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sheesh, sorry I had a bunch of benchmarks running locally and they interrupted my nox -s unit-3.8
tests so didn't see those but all fixed now.
3c0c421
to
7cbbb0f
Compare
|
||
|
||
def get_current_span(): | ||
return trace.get_current_span() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid what happens here if opentelemetry is not installed.
trace will be None
right? then all the places where this function will get invoked will get Null pointer exception.
) | ||
|
||
if requested_session_count > 0: | ||
span.add_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly span
here could be None
if opentelemetry is not installed (see
# Empty context manager. Users will have to check if the generated value is None or a span |
In that case it can throw AttributeError: 'NoneType' object has no attribute 'add_event'
session = self._database.session() | ||
session.create() | ||
span_event_attributes["id"] = self._session_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should be session.session_id
session = self._sessions.get(block=True, timeout=timeout) | ||
span_event_attributes["session.id"] = session.session_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here session.session_id
can be empty, if session does not exist. Move this attribute addignment to after the if block below, as it ensures that session will be available.
session = self._database.session() | ||
session.create() | ||
span_event_attributes["id"] = self._session_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the attribute we are using for session id?
In some place above it was session.id
, but here it is id
. Can we follow the attribure naming consistently?
|
||
span_event_attributes["time.elapsed"] = time.time() - start_time | ||
current_span.add_event("Acquired session", span_event_attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to have "Acquired Session" and then "Using Session".
Also I feel "Acquired Session" event should exist when a session lies in pool. If there is no session in pool then we are creating session and in that case "Acquired Session" does not make sense.
So the span event hierachy would something be like this.
If session exists in the pool, then
- Acquiring Session
- Waiting for a session to become available
- Acquired Session
- Using Session
If no session in the pool
- Acquiring Session
- Waiting for a session to become available
- No session available
- Creating Session
- CloudSpanner.CreateSession span
- Using Session
Does that sound good?
@surbhigarg92 Do you have any comments on this structure?
@@ -422,6 +480,18 @@ def bind(self, database): | |||
session_template=Session(creator_role=self.database_role), | |||
) | |||
|
|||
requested_session_count = request.session_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you not followed the BatchCreateSession span here as commented earlier somewhere above?
raise ValueError("No session ID set by back-end") | ||
err_message = "No session ID set by back-end" | ||
current_span = get_current_span() | ||
current_span.add_event(err_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this error not get captured by parent span? Do we need to add it as event here?
if self._session_id is not None: | ||
raise ValueError("Session ID already set by back-end") | ||
err_message = "Session ID already set by back-end" | ||
current_span.add_event(err_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here also, will this not be captured by current_span
?
@@ -174,6 +175,43 @@ def test_create_w_database_role(self): | |||
"CloudSpanner.CreateSession", attributes=TestSession.BASE_ATTRIBUTES | |||
) | |||
|
|||
def test_create_session_span_annotations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add unit tests to verify spans for different pool class similar to what is done for create_session?
This change adds annotations for session and session pool events to aid customers in debugging latency issues with session pool malevolence and also for maintainers to figure out which session pool type is the most appropriate.
Updates #1170