-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure replication running after replication restart #1422
Conversation
@andyedison I wonder if this fix is good enough for all cases. The sleep that exists now is a bit hacky
What should we be waiting for? I think you're saying the IO thread running. Whatever the answer, it would be safer if |
No I agree, I doubt this is good enough for all cases. This was a bit of an experiment to see if increasing this time would prevent the errors we were seeing in a particular environment from happening over a period of time. I believe it has, we just haven't had time to swing back to this and dig into it to find a more permanent solution |
I've updated the PR and title+description to better represent what this change is. Instead of assuming that replication has resumed successfully after the timeout, I made the change to instead check if it is running, and if not, wait an interval before trying to check again, erroring if we exceed a maximum wait time |
@@ -22,7 +22,8 @@ import ( | |||
"github.com/openark/golib/sqlutils" | |||
) | |||
|
|||
const startSlavePostWaitMilliseconds = 500 * time.Millisecond | |||
const startReplicationPostWait = 250 * time.Millisecond | |||
const startReplicationMaxWait = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds was somewhat arbitrary, I could be convinced to adjust this if others had strong opinions
Internal testing results look good, feel free to merge when ready 👍 |
A Pull Request should be associated with an Issue.
Related issue:
Description
This PR adds checks to function
restartReplication
that ensures that replication has started before continuing. Before adding this check, there as a hard coded 500ms wait time and then the program assumed that the replication threads were started and running (added in #337).We encountered situations in different environments that this wait time wasn't sufficient. As an experiment, we doubled this wait time and deployed it to our live environments to see if this resolves the issue. This did help solve the problem, so now we are coming back to find a better permanent fix.
example output from our logs:
old description
This PR increases the `startSlavePostWaitMilliseconds` as we are seeing an error when running `gh-ost` in some cloud environments that the `Slave_IO_Running` is `Connecting` rather than `Yes` as expected.We found this old PR that described the issue we're having #337 - as a first step we are increasing by doubling the value. If this test is successful, then we'll look into making this something that could be configured.
script/cibuild
returns with no formatting errors, build errors or unit test errors.