Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource limit #55

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Resource limit #55

wants to merge 15 commits into from

Conversation

mrputty
Copy link
Contributor

@mrputty mrputty commented Oct 18, 2020

Before going further with this, I thought I'd post a draft for design review. Does what I have here make sense? It meets all my requirements, and though this isn't fully backwards compatible, the change required to integrate this into my project were straightforward.

Suggested order for review:

  1. src/OddJobs/Types.hs
  2. src/OddJobs/ConfigBuilder.hs
  3. src/OddJobs/Job.hs
  4. Everything else

Update 2020-10-20: I decided to move forward with some additional work, so I'll keep pushing commits to this review as I go. @saurabhnanda, I've merged your fix/tests branch from #4 into this branch so I could get the existing tests updated for my changes.

saurabhnanda and others added 4 commits October 16, 2020 23:34
LogJobStart was not being logged - FIXED
…saurabhnanda#38)

Many small changes were required, but the majority were mechanical
changes. With the exception of the `createJob`/`scheduleJob` family of
function, this should be mostly backwards compatible with existing
user code that does not use "internal" functions. Further discussion
about how to handle backwards compatibility is needed.

Note also that this commit by itself is NOT compatible with existing
databases. A future commit will address migration from existing to new
database definitions.

The intention is for this feature to have a negligable impact on
performance when the resource limiting features are not being used,
and only a minor impact when they are used. However, performance
testing has not yet been done.

Material changes inclue:

- New resource table
- New `resource_id` column in to job table
- New `cfgDefaultResourceLimit` configuration value
- New `createResourceJob` and `scheduleResourceJob` functions
- New `TableNames` type to carry the now multiple values needed
- Updated existing functions and queries to account for these changes
- Updated dequeueing queries to implement resource concurrency limits
- New `DevelDb` module to the `devel` executable to help ad hoc testing

TODO:
- Resource record management functions
- Migration functionality (need to discuss database versioning)
- Unit tests
- Performance tests
- Documentation updates
- Resource management UI
@@ -0,0 +1,49 @@
{-# LANGUAGE DeriveAnyClass #-}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff was helpful for me during testing so I thought I'd leave it here.

@@ -35,8 +35,22 @@ import Control.Monad.Logger (LogLevel)
-- @
type TableName = PGS.Query
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for #53 will need to be integrated here.

pgEventName :: TableName -> Query
pgEventName tname = "job_created_" <> tname
-- | Specifies the table names used for job management.
data TableNames = TableNames
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed the obvious name, but if there are there any future features that might require more database configuration than just table names, it might be worthwhile to make this name more general now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to add this in the top-level Config object with cfgJobTableName being mandatory, and cfgJobResourceTableName being a Maybe x?

newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON)

newtype ResourceId = ResourceId { unResourceId :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add a comment on the type explaining its purpose as an opaque identifier.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick - I've started preferring rawX instead of unX, so rawResourceId

, cfgOnJobTimeout = (const $ pure ())
, cfgConcurrencyControl = ccControl
, cfgDefaultResourceLimit = 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that 1 (meaning jobs with a resource with no limit would "allow execution with minimal impact") is a reasonable default. The other more or less reasonable defaults are 0 (meaning these jobs would never run) and maxBound (meaning there would effectively be no limit to the concurrency for these jobs). Thoughts?

@@ -63,14 +63,15 @@ mkConfig logger tname dbpool ccControl jrunner configOverridesFn =
, cfgDbPool = dbpool
, cfgOnJobStart = (const $ pure ())
, cfgDefaultMaxAttempts = 10
, cfgTableName = tname
, cfgTableNames = simpleTableNames tname
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows backwards compatibility, assuming that for simple use cases, users don't plan on using resources and don't care about controlling every table name.

"update " <> tnJob tnames <>
"\n set status = ?, locked_at = now(), locked_by = ?, attempts=attempts+1" <>
"\nwhere id in ( select id from " <> tnJob tnames <> " j_out" <>
"\n where (run_at<=now() AND (status in ? OR (status = ? and locked_at < now() - ? * interval '1 second')))" <>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly conditioned, by painful experience, to as much as possible do datetime evaluation either on the database client or on the database server, but not both. Here, I've moved all the datetime logic into the query.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about using the DB as the single source of truth for current_timetamp.

"\n and j_in.locked_at >= now() - ? * interval '1 second' )" <>
"\n < coalesce(( select resource_limit from " <> tnResource tnames <> " r" <>
"\n where r.resource_id = j_out.resource_id ), ?)" <>
"\n end" <>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the feature. You can see that in the case where there is no resource, the performance should not really be affected. The sum and the resource limit are both supported by indices, so hopefully, they are also performant, but testing is needed.

By doing this all within the query, there is no need for any additional "bookkeeping" (and no need for the transaction management issues that would bring). This is why I was adamant about doing this inside the library. Of course all of this could be externalized, but, like transaction management, that could be rather disruptive.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss just this SQL query first. The other stuff is about naming, style nit-picks, etc.

@@ -532,21 +557,33 @@ jobEventListener :: (HasJobRunner m)
jobEventListener = do
log LevelInfo $ LogText "Starting the job monitor via LISTEN/NOTIFY..."
pool <- getDbPool
tname <- getTableName
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, I moved the datetime logic into the database.

Copy link
Contributor Author

@mrputty mrputty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested order for review:

  1. src/OddJobs/Types.hs
  2. src/OddJobs/ConfigBuilder.hs
  3. src/OddJobs/Job.hs
  4. Everything else

@mrputty mrputty mentioned this pull request Oct 18, 2020
kanagarajkm and others added 11 commits October 19, 2020 16:07
* setup a simple CI using GitHub actions

* describe a simple dev setup process

* build branches with pull requests against master, run build on multiple operating systems

* trigger CI for all branches

* always checkout the ref

* build only on pull_request events not to duplicate builds

* build only on Ubuntu
Because PostgreSQL limits notification payloads to 8000 bytes by
default, passing the full job record in the notification artificially
limits the job payload size.  By passing only required fields in the
notification, this limitation is removed.
…saurabhnanda#38)

Many small changes were required, but the majority were mechanical
changes. With the exception of the `createJob`/`scheduleJob` family of
function, this should be mostly backwards compatible with existing
user code that does not use "internal" functions. Further discussion
about how to handle backwards compatibility is needed.

Note also that this commit by itself is NOT compatible with existing
databases. A future commit will address migration from existing to new
database definitions.

The intention is for this feature to have a negligable impact on
performance when the resource limiting features are not being used,
and only a minor impact when they are used. However, performance
testing has not yet been done.

Material changes inclue:

- New resource table
- New `resource_id` column in to job table
- New `cfgDefaultResourceLimit` configuration value
- New `createResourceJob` and `scheduleResourceJob` functions
- New `TableNames` type to carry the now multiple values needed
- Updated existing functions and queries to account for these changes
- Updated dequeueing queries to implement resource concurrency limits
- New `DevelDb` module to the `devel` executable to help ad hoc testing

TODO:
- Resource record management functions
- Migration functionality (need to discuss database versioning)
- Unit tests
- Performance tests
- Documentation updates
- Resource management UI
Because there are unit tests that delete tabes, and now that there are
multiple table names in play, it seemed useful to explicitly implement
a deletion function. Becuase this deserved a comment, I also added
comments to the other externally useful functions in this module, and
rearranged the file a bit to highlight these defintions.
Now, if the environment variable "ODD_JOBS_TEST_DB_CONNECT" is set, it
will be used as a connection string for unit test.  If not, the
existing hard-coded default connection will be used.

This is useful when developing in a containerized environment.
This includes the changes that are on `fix/tests` branch.
Copy link
Owner

@saurabhnanda saurabhnanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an explain analyze for the before/after versions of jobPollingSql? Let's start with the perf impact there.

pgEventName :: TableName -> Query
pgEventName tname = "job_created_" <> tname
-- | Specifies the table names used for job management.
data TableNames = TableNames
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to add this in the top-level Config object with cfgJobTableName being mandatory, and cfgJobResourceTableName being a Maybe x?

newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON)

newtype ResourceId = ResourceId { unResourceId :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick - I've started preferring rawX instead of unX, so rawResourceId

@@ -158,9 +172,10 @@ instance FromJSON Status where
Left e -> fail e
Right r -> pure r


newtype JobRunnerName = JobRunnerName { unJobRunnerName :: Text } deriving (Eq, Show, FromField, ToField, Generic, ToJSON, FromJSON)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Change to rawJobRunnerName

-- that unlike the limit of `cfgConcurrencyControl`, this IS a global limit
-- across the entire job-queue. This means that this limit will be applied
-- regardless of what machine the jobs are running on.
, cfgDefaultResourceLimit :: Int
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought:

cfgPerResourceConcurrencySettings :: Maybe (TableName, Int)

"update " <> tnJob tnames <>
"\n set status = ?, locked_at = now(), locked_by = ?, attempts=attempts+1" <>
"\nwhere id in ( select id from " <> tnJob tnames <> " j_out" <>
"\n where (run_at<=now() AND (status in ? OR (status = ? and locked_at < now() - ? * interval '1 second')))" <>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about using the DB as the single source of truth for current_timetamp.

"\n and j_in.locked_at >= now() - ? * interval '1 second' )" <>
"\n < coalesce(( select resource_limit from " <> tnResource tnames <> " r" <>
"\n where r.resource_id = j_out.resource_id ), ?)" <>
"\n end" <>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss just this SQL query first. The other stuff is about naming, style nit-picks, etc.

@saurabhnanda
Copy link
Owner

Apart from the perf concerns, I was also wondering if there is any way we can support a simpler use-case: concurrency limits on a job-type basis. With this PR as-is, users who have a simpler requirement of throttling certain job-types will have to take an additional step of attaching a resource-id to each job.

@mrputty
Copy link
Contributor Author

mrputty commented Oct 22, 2020

(@saurabhnanda, I missed this earlier, and I don't know how to reply to this inline above, so let's move more general stuff here.)

Apart from the perf concerns, I was also wondering if there is any way we can support a simpler use-case: concurrency limits on a job-type basis. With this PR as-is, users who have a simpler requirement of throttling certain job-types will have to take an additional step of attaching a resource-id to each job.

Yes, there would have to be an explicit resource created for each throttled job type. Where would the resource/job limit be specified and stored if not for that? I suppose you could have a helper function that would allow a user to declare a job-level limit in code, I guess as part of their code that manages migrations?

Hmm, I'm having trouble wrapping my head around how this would work. Could you flesh out the requirements of what you would like to see?

@mrputty
Copy link
Contributor Author

mrputty commented Nov 5, 2020

I have spent a bit of time working through validating the performance of the core query changes. I have an updated query and will post some analysis soon. Unfortunately, I've been distracted by other work and looming deadlines. I will get back to this as soon as I can.

ivb-supercede added a commit to SupercedeTech/odd-jobs that referenced this pull request Sep 2, 2021
This is based on the system in saurabhnanda#55, and aims to
address saurabhnanda#38, where jobs need to have some limited
access to resources that controls how many can run simultaneously.

Unlike that PR, this implements a system where jobs can require access
to 0 or more resources, with different amounts of usage. This is because
of a business need for jobs to be able to require multiple resources.

The implementation is intended to have no performance impact on existing
code wherever the user has not opted in to resource-based concurrency.
This is done by having parallel implementations wherever necessary that
switch based on the concurrency control chosen.

This subsumes and replaces the job-type-based concurrency control,
because it is possible to model that system using job types as
resources.
ivb-supercede added a commit to SupercedeTech/odd-jobs that referenced this pull request Sep 2, 2021
This is based on the system in saurabhnanda#55, and aims to
address saurabhnanda#38, where jobs need to have some limited
access to resources that controls how many can run simultaneously.

Unlike that PR, this implements a system where jobs can require access
to 0 or more resources, with different amounts of usage. This is because
of a business need for jobs to be able to require multiple resources.

The implementation is intended to have no performance impact on existing
code wherever the user has not opted in to resource-based concurrency.
This is done by having parallel implementations wherever necessary that
switch based on the concurrency control chosen.
ivb-supercede added a commit to SupercedeTech/odd-jobs that referenced this pull request Feb 1, 2023
This is based on the system in saurabhnanda#55, and aims to
address saurabhnanda#38, where jobs need to have some limited
access to resources that controls how many can run simultaneously.

Unlike that PR, this implements a system where jobs can require access
to 0 or more resources, with different amounts of usage. This is because
of a business need for jobs to be able to require multiple resources.

The implementation is intended to have no performance impact on existing
code wherever the user has not opted in to resource-based concurrency.
This is done by having parallel implementations wherever necessary that
switch based on the concurrency control chosen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants