Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: No persistent Quantum State #240

Closed
maennchen opened this issue Aug 11, 2017 · 20 comments
Closed

Problem: No persistent Quantum State #240

maennchen opened this issue Aug 11, 2017 · 20 comments

Comments

@maennchen
Copy link
Member

maennchen commented Aug 11, 2017

Description

Currently quantum does not persist any state. Therefore if an application is restarted, all dynamic jobs are lost. Also in the time that has passed since the restart happened no jobs has been executed and there is no way to let them redo the lost time.

Minimal Requirements

I don''t think that Quantum should choose a storage backend and implement it since this heavily depends on the application where it is integrated in. I propose that we just develop protocols for such a storage and let the implementation to the user of the library. (Where this person can decide if he wants to use redis / postgres / mnesia / whatever to ensure persistence)

  • Extendable Storage Protocol
  • Config Option to set a Storage Backend Module
  • Config Option wether to redo not ran jobs between app stop & start

Implementation

I heavily recommend waiting for PR #230 before implementing any changes.

The storage would have to be inserted in two places:

  • JobBroadcaster
    • read on init
      • check if there is a storage present
      • when present read storage instead of config
        • (I see no good way to both read the config and the storage. Since config jobs can be deleted at runtime, there is no safe way to merge storage & config.)
    • write on change
      • after init write config jobs
      • :add
      • :delete
      • :change_state
      • :delete_all
  • ExecutionBroadcaster
    • read on init
      • read last execution date from storage if present and config redo is active
    • write on :execute

PR

Since this is a fairly big implementation, we'd be happy to receive a PR. If this is a popular request please leave a 👍 so that we see that this is important to our users and implement it ourselves.

@maennchen
Copy link
Member Author

Replacement of #238

@egeersoz
Copy link

Currently there are two popular job scheduling libraries for Elixir: Exq and Quantum-Elixir.

Exq is Redis-backed, so jobs don't get lost if application restarts. However, it cannot be configured to run jobs at regular intervals.

Quantum-Elixir can run jobs at regular intervals, but it has no persistent state for Quantums.

This is a predicament: robust, "real-life" job scheduling functionality is currently split between these two libraries, with neither one providing the full functionality.

That's why I think this feature should be high priority.

@Yamilquery
Copy link

Yamilquery commented Jan 11, 2018

I think that you could use ETS and PersistentEts as a storage state system, instead of leave to the user that implementation.

@maennchen
Copy link
Member Author

@Yamilquery I'm aware that there are many options for the storage backend. This is not really a complex feature but requires a decent amount of work. If you'd like to tackle this, you're very welcome to do provide a PR.

As mentioned in the requirements section please do not implement to one specific storage backend but rather write a protocol so that other implementations can be built.

@pyatkov
Copy link
Contributor

pyatkov commented Feb 28, 2018

Why do you guys insist on this generic mul;ti-backend implementation as if your really reaaly know that it really is a critical for some use cases, while the whole persistence capability (which obviously is critical) is missing. Why not just start off with a feasible DETS/Mnesia implementation and go ahead with it?
I would love to get my hands on implementing a PR, but for a feasible DETS/Mnesia implementation.
I know configurable backend is a common and usefull option, but don't you see it's blocking people from working on a PR, because it's making it too big. Don;t be so demanding when there is nothing in place, ease your requirements and the solution will come to you.

@maennchen
Copy link
Member Author

@pyatkov

My plan was to implement a behaviour for the adapter and one implementation. (not sure yet what I would pick first) The reason this is not long done is that I currently lack the time to do all this.

If you'd like to volunteer, I could define the behaviour / config etc. in a PR and you could provide the first implementation of the storage. Does that sound like an option to you?

Don;t be so demanding

I don't think that defining a behaviour is very demanding. This library is fairly popular by now and I don't see lowering standards as an option.

@maennchen
Copy link
Member Author

@pyatkov
The basis is done on the branch storage. Do you want to implement a storage provider now?

https://github.com/c-rack/quantum-elixir/blob/storage/lib/quantum/storage/adapter.ex

Please tell me if there is something wrong with the interface or if you have ideas for improvements.

@maennchen
Copy link
Member Author

maennchen commented Feb 28, 2018

@c-rack Would you prefer if storage providers live in their own package or if we manage them here?

Edit: Depending on the adapter type this could introduce many new optional dependencies like redix etc. which could collide with the installers setup. I think it would be better if the adapters were in its own package.

@pyatkov
Copy link
Contributor

pyatkov commented Mar 1, 2018

@maennchen
I agree. Just wanted to stresss that sometimes requirements should be less demanding, considering first step is always hard to take.
In fact Im already working on this PR. I was thinking through the options and by now I sticked to the behaviour, so I'm working on a generic imlementation.
I will be open to various comments on something like a code review phase, so details could be adjusted accordingly in the process.
Right now I'm thinking about job execution tracking and the redo feature as well.
Seems like the job should have something like a last_executed_at field holding the date and time the job was last executed at.
Also it seems job.name is a unique field and could be used as a PK or similar for storing the job.
Hopefuly, I will have some draft implementation on hand in the meantime, so it would be a concrete basis to elaborate on this PR/issue.

@maennchen
Copy link
Member Author

@pyatkov

In fact Im already working on this PR. I was thinking through the options and by now I sticked to the behaviour, so I'm working on a generic imlementation.

Do you work based on https://github.com/c-rack/quantum-elixir/tree/storage ?

I will be open to various comments on something like a code review phase, so details could be adjusted accordingly in the process.

Great, I'm eager to see your PR.

Right now I'm thinking about job execution tracking and the redo feature as well.

This is one of the core features of the current ExecutionBroadcaster. It keeps track of which jobs have been executed and more importantly it knows at which time it was run. (If your server crashes and you re-init it with the old time value, it will pick up back then.)

Therefore the only interesting value for the storage is the last_execution_time of the executor and not the one of the jobs. All other values can be extrapolated from there.

Also it seems job.name is a unique field and could be used as a PK or similar for storing the job.

Yes, we treat the job name as unique.

@pyatkov
Copy link
Contributor

pyatkov commented Mar 1, 2018

@maennchen
I just looked through the definition of the Quantum.Storage.Adapter module and I don't get why every function takes in a scheduler_module. This means that this behaviour implementation's functions would take the scheduler_module as a parameter and thus should somehow depend on it, but how?
I just don't see it right now.
Also I was considering the functions to accept %Job{} struct, from which they could extract the name and state if they need to. I mean I didn't plan to split name and state as separate function parameters.
Also the whole naming semantics looked more storage-centric to me, like insert, update, delete, get_all_jobs and save and save_all convenience methods. save would do either insert or update depending on whether the job is present in the storage (also an optional exist? function, which could be reduced to get function retrieving the job from the storage, was introduced. save_all - which, probably could be optional as well, should perform the save operation on [Job.t()] as an atomic operation from the storage standpoint.
From the bird eye view it's all the same more or less, but still very different if you would look into naming and stuff.
Going further, I'm thinking to "attach" last_execution_time or last_executed_at to the list of stored attributes for the job. Still a method retrieving this last execution time will be needed.
I think we could discuss those starting points early on the development phase and develop somewhat consistent view on the interface/behaviour and naming.

@pyatkov
Copy link
Contributor

pyatkov commented Mar 1, 2018

@maennchen
Considering determining the way the behaviour would be dispatched on concrete module, I see it that the module name would be retrieved from the opts parameters on the supervisor level and passed into JobBroadcaster and ExecutionBroadcaster accordingly, so that they could dispatch right on the module (like storage_implementation.save(job)), so there would be no need to pass the module argument to the storage functions.

@maennchen
Copy link
Member Author

I just looked through the definition of the Quantum.Storage.Adapter module and I don't get why every function takes in a scheduler_module.

Since there could be multiple instances of the Quantum.Scheduler, i though it makes sense to give the module name to the storage. This way the storage can be started multiple times and differentiate the schedulers.

Also I was considering the functions to accept %Job{} struct, from which they could extract the name and state if they need to. I mean I didn't plan to split name and state as separate function parameters.

The internal structure of quantum heavily depends on messaging. Only relevant information for a given action is part of the message. Therefore many of those events are not ware of the full job.

Also the whole naming semantics looked more storage-centric to me, like insert, update, delete, get_all_jobs and save and save_all convenience methods. save would do either insert or update depending on whether the job is present in the storage (also an optional exist? function, which could be reduced to get function retrieving the job from the storage, was introduced.

Those are the low-level needed function to connect to quantum. For sure we can add more convenience methods to the storage.

Going further, I'm thinking to "attach" last_execution_time or last_executed_at to the list of stored attributes for the job. Still a method retrieving this last execution time will be needed.

In my opinion the storage is a backend of quantum and should not have any execution logic by itself. Why would you want to save information that is not actually needed by the scheduler?

Also important to know is how quantum is built internally (mostly for performance and keeping parts small and maintainable). Not all stages have a full copy of a job and also do not need them.
Also since there are multiple stages, all the job data is duplicates of the JobBroadcaster. If we begin to modify the job struct, we'll have to sync those between all the stages.

Currently jobs behave a lot like elixir variables: They are immutable in the system (with the exception of state active / inactive). If you add a new job with the same name, the old one will be overwritten.

Considering determining the way the behaviour would be dispatched on concrete module, I see it that the module name would be retrieved from the opts parameters on the supervisor level and passed into JobBroadcaster and ExecutionBroadcaster accordingly, so that they could dispatch right on the module (like storage_implementation.save(job))

If you look into the new branch, you can already see this functionality implemented.

so there would be no need to pass the module argument to the storage functions.

What happens if I want to use the same Storage Implementation on two schedulers?

@pyatkov
Copy link
Contributor

pyatkov commented Mar 1, 2018

@maennchen
Also, will it be sufficient to store execution date per scheduler (so to speak)?
Could it be such that some job(s) did not get execued (while other jobs were), so that you might need to redo certain jobs which might need finer granularity to store last execution date.

Also, I see 3 types of jobs regarding redo option:

  1. transient jobs that do not need redo-ing even if their execution was skipped
  2. permanent jobs that would like to get executed once after the period of scheduler being down (no matter how many cycles they have missed). Like backup jobs.
  3. permanent_consecutive - (a better name is more than appreciated) that need to be redone exact the number of times they were not executed. I can imagine like jobs incrementing some counters. I believe I would be able to come up with a real-world scenario, but I need some thinking to come up with it.

So, based on this job type (so to speak) and the redo switch the system would decide upun which jobs and how many times it should executed after the scheduler is up and running again.

Also, probably not related to this issue (and PR), I was thinking of a neat way to execute one-time jobs, that need to be fired at the given datetime and removed froum the scheduler upon execution. I know it could be done by removing the job from the scheduler upon job task completion, but a more neat way to do it (like configure the job to be a one-time job) I guess would be well appreciated by the lib users.

@maennchen
Copy link
Member Author

maennchen commented Mar 1, 2018

Also, will it be sufficient to store execution date per scheduler (so to speak)?

Yes, because the execution broadcaster moves through time only if its job is done at the instant it was working on.

Could it be such that some job(s) did not get execued (while other jobs were), so that you might need to redo certain jobs which might need finer granularity to store last execution date.

Only if you kill (not stop) the execution supervisor. (If you for example pull the plug.)

Also, I see 3 types of jobs regarding redo option

Could we tackle that problem after the storage itself is done? This needs some work on other parts of the library as well and I would like to stabilise the storage interface first. (Don't take this as missing interest, I think that is an important point.)

Also, probably not related to this issue (and PR), I was thinking of a neat way to execute ote-time jobs, that need to be fired at the given datetime and removed froum the scheduler upon execution.

We have currently no option for that. I created #268 to see if there is interest for that.

@maennchen
Copy link
Member Author

@pyatkov I created #314 for the redo option.

@pyatkov
Copy link
Contributor

pyatkov commented Mar 1, 2018

Looked at PR #313 - solid work (especialy on the test side and logging stuff :-) )
I've read carefully your comments that I somehow didn;t see while I was continuously writing my own ones.
I will probably reconsider my finding based on the new branch and #313 PR.
I believe scheduler module name should be a kind of a partition for job's data storage.
Also I will look deeper into Job execution tracking to better understand (or make sure) that only last execution time per scheduler should be kept track of.
Its evening for me, so I'm looking forward to get into it again tomorrow.

@maennchen
Copy link
Member Author

Looked at PR #313 - solid work (especialy on the test side and logging stuff :-) )

Glad to hear that. I hope it is a good base to solve the problems. (I do expect that we'll need to change things, but it should be a good starting point.)

I believe scheduler module name should be a kind of a partition for job's data storage.

It was my idea that you could either start multiple GenServers for the storage and get the right one based on the scheduler_name or partitions on the storage. (If you have a better idea how to handle multiple schedulers we can change that.)

Also I will look deeper into Job execution tracking to better understand (or make sure) that only last execution time per scheduler should be kept track of.

Since there are really clear boundaries between the stages, I hope you find everything. If you have any questions, do not hesitate to ask!

@pyatkov
Copy link
Contributor

pyatkov commented Mar 5, 2018

@maennchen
I seem to have stabilized a draft implementation using Persistent ETS. I haven't written any tests for this implementation specifically (so far), but I guess we could indirectly test it by means of specifying the implementation module as an option for :storage option in the configuration.
I think it makes sense to start the codereview process soon, as I might be drawn in some other tasks, so the whole process might just get stuck from my side.
So, I will probably create a pull request for this issuue to start the review process.

@pyatkov
Copy link
Contributor

pyatkov commented Mar 5, 2018

@maennchen
I've created PR #318 with my draft implementation of Quantum.Storage.Adapter behaviour via Persistent ETS. I would appreciate your comments on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants