-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unmount disconnected filesystem when quorum lost #14
base: master
Are you sure you want to change the base?
Conversation
I hoped to get that done today but did not have time, Mon I'm on vacation, so I'm afraid this has to wait till Tue. |
All good, have a nice vacation. |
hm, if I get that right you basically assume that after quorum loss the FS is corrupted, right? That is true by default as of now, but even might change. little backstory: long story short: There are two settings and All in all I guess this improves the situation, but maybe it is not complete yet (and even if it is just setting |
My motivation here is I did encounter the issue that my VM had a network loss and the only way to recover was to SSH and manually unmount the FS. I first tried attempting using the This change only applies when |
There have been fixes in this area, does that reproduce with latest DRBD (9.1 or 9.0)? And that was triggered by |
I upgraded to the latest DRBD version 9.0.29-1. Then set the controller config option To reproduce the frozen container I ran on the node:
Then in another session dropping packets triggering loss of quorum:
In the docker container running:
Then restoring quorum with:
The syslog after the quorum is restored:
|
Hi @beornf, I am trying to reproduce the issue. (Without docker, just an XFS). So far I am failing to reproduce what you are describing. In my logs it looks like that:
In this case I interrupted the connection using After restoring Quorum it looks like that in the logs:
and |
Hi @Philipp-Reisner, The output from
I probably should have mentioned the DRBD admin status from nodeA is:
I'm running the XFS mount on a diskless node so it can be unmounted and mounted on another node quickly. I'll try reproducing the issue without Docker and compare logs. Thanks, |
I've done further testing by running
|
Hi @beornf, Good catch! It might be unintuitive, but the on-no-data setting has precedence in this case. This needs to be mentioned in the documentation. (Or changed in the code if someone has good reasons) Still, what you posted on May 18 was something different. I continue to investigate the May 18 issue. Just letting you know what was the reason for the IO error when you are on a diskless primary... |
Hi @Philipp-Reisner, I've set both settings to suspend-io to reproduce the May 18 issue. This time I mounted to a folder on the diskless primary at After interrupting the connection with On restoring the connection the command did not unfreeze. The relevant errors from the kernel were:
|
f2f2ef7
to
c14374f
Compare
Hi @Philipp-Reisner, Just wondering if the new DRBD versions 9.0.30-1 or 9.1.3 might have addressed the frozen XFS device issue. Thanks, |
HI @beornf , there was not patch merged in this regard. But let me share, what I did: Just to be sure: The stack trace you see in the kernel logs is not a problem or bug. Since IO is frozen a task that waits for IO completion is blocked for more that 120 seconds. The kernel prints this warning with a full stack trace. That is expected and okay. |
This addresses errors when unmounting a Linstor docker volume after quorum is lost on the primary. The kernel logs in this edge case are as follows:
When terminating the container the resource definition may not return due to loss of connection to the controller. The presence of the mount path is checked as a fallback.
The error returned from checking the mount point indicates the filesystem is corrupted and hence is mounted with
os.Stat
returninginput/output error
. See https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/util/mount/mount_helper_unix.go#L27.