Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-8292] Create Single App Connection and Use for Net Appender and Restart Checker #4746

Merged
merged 52 commits into from
Jan 29, 2025

Conversation

bashar-515
Copy link
Member

@bashar-515 bashar-515 commented Jan 27, 2025

This PR accomplishes two main things:

  1. creates a new type to implement the rpc.ClientConn interface that we'll use as our one global connection to app in unifying the many that already exist.
  2. uses said global connection to replace (or unify, rather) the two separate connections currently created in our NetAppender and RestartChecker.

The global connection works as follows. It's first instantiated - and thus made available to the entire RDK server - when the NetAppender is created in web/server/entrypoint.go. This could change in coming PR's as more connections are roped into this global one. A more important focal point is of this PR is how the connection is created [and managed]. When the grpc.NewAppConn() is called, a dial attempt to App is made. This initial attempt blocks for some time. If not successful, a Goroutine will spin in off in the background that'll repeatedly attempt to dial App. These attempts are serialized, and each times out after 5 seconds. Once a successful connection is created, the Goroutine is done.

Testing
NetAppender
Offline startup | NetAppender fails logging to network because it's "not connected."
Offline startup -> connection | NetAppender fails logging to network at first but is then able to, signifying background Goroutine continually attempting to dial App works.
Offline startup -> connection -> disconnection -> reconnection | NetAppender fails logging to network at first but is then able to. After disconnecting from the internet, NetAppender logs time out. Logs succeed when an internet connection returns.
Online startup -> disconnection -> connection | NetAppender is able to log at first. Once WiFi is cut out, logs time out but succeed after it returns.

RestartChecker
Same as above. Signal to restart was received after multiple disconnections/reconnections.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jan 27, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
)

type AppConn struct {
ReconfigurableClientConn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that an AppConn instance attempts to establish a connection in a background Goroutine, all calls to its methods would require a nil check and the acquisition of a lock. This code/functionality is already written out in the ReconfigurableClientConn, so I figured it'd be best to just reuse that.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
grpc/app_conn.go Outdated
continue
}

appConn.Err = err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm handling errors returned by the background dials by simply storing them in the AppConn struct and stopping the repeating dial attempts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other question around this, I'm not sure if we need this field (yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we'd want to expose whether the connection is connected or not, but that can come later (the connection being not nil does not tell us whether it is disconnected or not)

@bashar-515 bashar-515 requested a review from cheukt January 27, 2025 17:12
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
grpc/app_conn.go Outdated
for {
appConn.connMu.Lock()

ctxWithTimeOut, ctxWithTimeOutCancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Member Author

@bashar-515 bashar-515 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use the context.Context passed to NewAppConn() when creating this context.WithTimeout because the one passed as a function parameter has its own timeout and is cancelled when this function returns (during which this Goroutine might still be running).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the context passed in would be good because then a context cancellation stemming from killing the viam-server will propagate properly

@@ -191,13 +192,17 @@ func RunServer(ctx context.Context, args []string, _ logging.Logger) (err error)
// Start remote logging with config from disk.
// This is to ensure we make our best effort to write logs for failures loading the remote config.
if cfgFromDisk.Cloud != nil && (cfgFromDisk.Cloud.LogPath != "" || cfgFromDisk.Cloud.AppAddress != "") {
ctxWithTimeout, ctxWithTimeoutCancel := config.GetTimeoutCtx(ctx, true, cfgFromDisk.Cloud.ID)
appConn, err := grpc.NewAppConn(ctxWithTimeout, cfgFromDisk.Cloud, logger) // TODO(RSDK-8292): [q] what logger should I pass here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[qu]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sublogger, possibly networking.app_connection

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 27, 2025
@viambot viambot removed the safe to test This pull request is marked safe to test from a trusted zone label Jan 27, 2025
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@bashar-515 bashar-515 marked this pull request as ready for review January 29, 2025 19:12
@bashar-515 bashar-515 requested a review from cheukt January 29, 2025 19:12
Copy link
Member

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

web/server/entrypoint.go Show resolved Hide resolved
grpc/app_conn.go Outdated
}
})

// if initial dial attempt fails due to time out, return nil error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment here

@@ -406,6 +406,7 @@ func (w *remoteLogWriterGRPC) write(ctx context.Context, logs []*commonpb.LogEnt
return nil
}

// TODO(RSDK-8292): [qu] do we need this anymore?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment since this isn't a question anymore

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 29, 2025
@bashar-515
Copy link
Member Author

@cheukt re: our convo offline. I tested with a config that doesn't have a "cloud" component, and it seems to work fine. The RDK continue to run as expected. While I was at it, I also attempted including cloud information but leaving out fields and having invalid values, also seems to be good.

No cloud config: RDK runs as normal
Cloud config but only app address: server returns bc missing ID
Cloud config but only app address and ID: server returns bc missing secret
Cloud config but invalid secret: connection does not get established, but RDK continues running
Cloud config but invalid ID: server returns bc missing cached config and can't retrieve config

@bashar-515 bashar-515 merged commit a5a072f into viamrobotics:main Jan 29, 2025
16 checks passed
@bashar-515 bashar-515 deleted the RSDK-8292 branch January 29, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants