-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource limit #55
base: master
Are you sure you want to change the base?
Resource limit #55
Conversation
LogJobStart was not being logged - FIXED
…saurabhnanda#38) Many small changes were required, but the majority were mechanical changes. With the exception of the `createJob`/`scheduleJob` family of function, this should be mostly backwards compatible with existing user code that does not use "internal" functions. Further discussion about how to handle backwards compatibility is needed. Note also that this commit by itself is NOT compatible with existing databases. A future commit will address migration from existing to new database definitions. The intention is for this feature to have a negligable impact on performance when the resource limiting features are not being used, and only a minor impact when they are used. However, performance testing has not yet been done. Material changes inclue: - New resource table - New `resource_id` column in to job table - New `cfgDefaultResourceLimit` configuration value - New `createResourceJob` and `scheduleResourceJob` functions - New `TableNames` type to carry the now multiple values needed - Updated existing functions and queries to account for these changes - Updated dequeueing queries to implement resource concurrency limits - New `DevelDb` module to the `devel` executable to help ad hoc testing TODO: - Resource record management functions - Migration functionality (need to discuss database versioning) - Unit tests - Performance tests - Documentation updates - Resource management UI
@@ -0,0 +1,49 @@ | |||
{-# LANGUAGE DeriveAnyClass #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff was helpful for me during testing so I thought I'd leave it here.
@@ -35,8 +35,22 @@ import Control.Monad.Logger (LogLevel) | |||
-- @ | |||
type TableName = PGS.Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes for #53 will need to be integrated here.
pgEventName :: TableName -> Query | ||
pgEventName tname = "job_created_" <> tname | ||
-- | Specifies the table names used for job management. | ||
data TableNames = TableNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed the obvious name, but if there are there any future features that might require more database configuration than just table names, it might be worthwhile to make this name more general now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to add this in the top-level Config
object with cfgJobTableName
being mandatory, and cfgJobResourceTableName
being a Maybe x
?
newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON) | ||
|
||
newtype ResourceId = ResourceId { unResourceId :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add a comment on the type explaining its purpose as an opaque identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick - I've started preferring rawX
instead of unX
, so rawResourceId
, cfgOnJobTimeout = (const $ pure ()) | ||
, cfgConcurrencyControl = ccControl | ||
, cfgDefaultResourceLimit = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that 1
(meaning jobs with a resource with no limit would "allow execution with minimal impact") is a reasonable default. The other more or less reasonable defaults are 0
(meaning these jobs would never run) and maxBound
(meaning there would effectively be no limit to the concurrency for these jobs). Thoughts?
src/OddJobs/ConfigBuilder.hs
Outdated
@@ -63,14 +63,15 @@ mkConfig logger tname dbpool ccControl jrunner configOverridesFn = | |||
, cfgDbPool = dbpool | |||
, cfgOnJobStart = (const $ pure ()) | |||
, cfgDefaultMaxAttempts = 10 | |||
, cfgTableName = tname | |||
, cfgTableNames = simpleTableNames tname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows backwards compatibility, assuming that for simple use cases, users don't plan on using resources and don't care about controlling every table name.
"update " <> tnJob tnames <> | ||
"\n set status = ?, locked_at = now(), locked_by = ?, attempts=attempts+1" <> | ||
"\nwhere id in ( select id from " <> tnJob tnames <> " j_out" <> | ||
"\n where (run_at<=now() AND (status in ? OR (status = ? and locked_at < now() - ? * interval '1 second')))" <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly conditioned, by painful experience, to as much as possible do datetime evaluation either on the database client or on the database server, but not both. Here, I've moved all the datetime logic into the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about using the DB as the single source of truth for current_timetamp.
"\n and j_in.locked_at >= now() - ? * interval '1 second' )" <> | ||
"\n < coalesce(( select resource_limit from " <> tnResource tnames <> " r" <> | ||
"\n where r.resource_id = j_out.resource_id ), ?)" <> | ||
"\n end" <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core of the feature. You can see that in the case where there is no resource, the performance should not really be affected. The sum and the resource limit are both supported by indices, so hopefully, they are also performant, but testing is needed.
By doing this all within the query, there is no need for any additional "bookkeeping" (and no need for the transaction management issues that would bring). This is why I was adamant about doing this inside the library. Of course all of this could be externalized, but, like transaction management, that could be rather disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss just this SQL query first. The other stuff is about naming, style nit-picks, etc.
@@ -532,21 +557,33 @@ jobEventListener :: (HasJobRunner m) | |||
jobEventListener = do | |||
log LevelInfo $ LogText "Starting the job monitor via LISTEN/NOTIFY..." | |||
pool <- getDbPool | |||
tname <- getTableName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, I moved the datetime logic into the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested order for review:
src/OddJobs/Types.hs
src/OddJobs/ConfigBuilder.hs
src/OddJobs/Job.hs
- Everything else
* setup a simple CI using GitHub actions * describe a simple dev setup process * build branches with pull requests against master, run build on multiple operating systems * trigger CI for all branches * always checkout the ref * build only on pull_request events not to duplicate builds * build only on Ubuntu
Because PostgreSQL limits notification payloads to 8000 bytes by default, passing the full job record in the notification artificially limits the job payload size. By passing only required fields in the notification, this limitation is removed.
…saurabhnanda#38) Many small changes were required, but the majority were mechanical changes. With the exception of the `createJob`/`scheduleJob` family of function, this should be mostly backwards compatible with existing user code that does not use "internal" functions. Further discussion about how to handle backwards compatibility is needed. Note also that this commit by itself is NOT compatible with existing databases. A future commit will address migration from existing to new database definitions. The intention is for this feature to have a negligable impact on performance when the resource limiting features are not being used, and only a minor impact when they are used. However, performance testing has not yet been done. Material changes inclue: - New resource table - New `resource_id` column in to job table - New `cfgDefaultResourceLimit` configuration value - New `createResourceJob` and `scheduleResourceJob` functions - New `TableNames` type to carry the now multiple values needed - Updated existing functions and queries to account for these changes - Updated dequeueing queries to implement resource concurrency limits - New `DevelDb` module to the `devel` executable to help ad hoc testing TODO: - Resource record management functions - Migration functionality (need to discuss database versioning) - Unit tests - Performance tests - Documentation updates - Resource management UI
Because there are unit tests that delete tabes, and now that there are multiple table names in play, it seemed useful to explicitly implement a deletion function. Becuase this deserved a comment, I also added comments to the other externally useful functions in this module, and rearranged the file a bit to highlight these defintions.
Now, if the environment variable "ODD_JOBS_TEST_DB_CONNECT" is set, it will be used as a connection string for unit test. If not, the existing hard-coded default connection will be used. This is useful when developing in a containerized environment.
This includes the changes that are on `fix/tests` branch.
ba55eca
to
984da57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an explain analyze
for the before/after versions of jobPollingSql
? Let's start with the perf impact there.
pgEventName :: TableName -> Query | ||
pgEventName tname = "job_created_" <> tname | ||
-- | Specifies the table names used for job management. | ||
data TableNames = TableNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to add this in the top-level Config
object with cfgJobTableName
being mandatory, and cfgJobResourceTableName
being a Maybe x
?
newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON) | ||
|
||
newtype ResourceId = ResourceId { unResourceId :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick - I've started preferring rawX
instead of unX
, so rawResourceId
@@ -158,9 +172,10 @@ instance FromJSON Status where | |||
Left e -> fail e | |||
Right r -> pure r | |||
|
|||
|
|||
newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Change to rawJobRunnerName
-- that unlike the limit of `cfgConcurrencyControl`, this IS a global limit | ||
-- across the entire job-queue. This means that this limit will be applied | ||
-- regardless of what machine the jobs are running on. | ||
, cfgDefaultResourceLimit :: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought:
cfgPerResourceConcurrencySettings :: Maybe (TableName, Int)
"update " <> tnJob tnames <> | ||
"\n set status = ?, locked_at = now(), locked_by = ?, attempts=attempts+1" <> | ||
"\nwhere id in ( select id from " <> tnJob tnames <> " j_out" <> | ||
"\n where (run_at<=now() AND (status in ? OR (status = ? and locked_at < now() - ? * interval '1 second')))" <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about using the DB as the single source of truth for current_timetamp.
"\n and j_in.locked_at >= now() - ? * interval '1 second' )" <> | ||
"\n < coalesce(( select resource_limit from " <> tnResource tnames <> " r" <> | ||
"\n where r.resource_id = j_out.resource_id ), ?)" <> | ||
"\n end" <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss just this SQL query first. The other stuff is about naming, style nit-picks, etc.
Apart from the perf concerns, I was also wondering if there is any way we can support a simpler use-case: concurrency limits on a job-type basis. With this PR as-is, users who have a simpler requirement of throttling certain job-types will have to take an additional step of attaching a resource-id to each job. |
(@saurabhnanda, I missed this earlier, and I don't know how to reply to this inline above, so let's move more general stuff here.)
Yes, there would have to be an explicit resource created for each throttled job type. Where would the resource/job limit be specified and stored if not for that? I suppose you could have a helper function that would allow a user to declare a job-level limit in code, I guess as part of their code that manages migrations? Hmm, I'm having trouble wrapping my head around how this would work. Could you flesh out the requirements of what you would like to see? |
I have spent a bit of time working through validating the performance of the core query changes. I have an updated query and will post some analysis soon. Unfortunately, I've been distracted by other work and looming deadlines. I will get back to this as soon as I can. |
This is based on the system in saurabhnanda#55, and aims to address saurabhnanda#38, where jobs need to have some limited access to resources that controls how many can run simultaneously. Unlike that PR, this implements a system where jobs can require access to 0 or more resources, with different amounts of usage. This is because of a business need for jobs to be able to require multiple resources. The implementation is intended to have no performance impact on existing code wherever the user has not opted in to resource-based concurrency. This is done by having parallel implementations wherever necessary that switch based on the concurrency control chosen. This subsumes and replaces the job-type-based concurrency control, because it is possible to model that system using job types as resources.
This is based on the system in saurabhnanda#55, and aims to address saurabhnanda#38, where jobs need to have some limited access to resources that controls how many can run simultaneously. Unlike that PR, this implements a system where jobs can require access to 0 or more resources, with different amounts of usage. This is because of a business need for jobs to be able to require multiple resources. The implementation is intended to have no performance impact on existing code wherever the user has not opted in to resource-based concurrency. This is done by having parallel implementations wherever necessary that switch based on the concurrency control chosen.
This is based on the system in saurabhnanda#55, and aims to address saurabhnanda#38, where jobs need to have some limited access to resources that controls how many can run simultaneously. Unlike that PR, this implements a system where jobs can require access to 0 or more resources, with different amounts of usage. This is because of a business need for jobs to be able to require multiple resources. The implementation is intended to have no performance impact on existing code wherever the user has not opted in to resource-based concurrency. This is done by having parallel implementations wherever necessary that switch based on the concurrency control chosen.
Before going further with this, I thought I'd post a draft for design review. Does what I have here make sense? It meets all my requirements, and though this isn't fully backwards compatible, the change required to integrate this into my project were straightforward.
Suggested order for review:
src/OddJobs/Types.hs
src/OddJobs/ConfigBuilder.hs
src/OddJobs/Job.hs
Update 2020-10-20: I decided to move forward with some additional work, so I'll keep pushing commits to this review as I go. @saurabhnanda, I've merged your
fix/tests
branch from #4 into this branch so I could get the existing tests updated for my changes.