Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] flock unlocking issues #13821

Closed
1 task done
michallenc opened this issue Oct 4, 2024 · 3 comments · Fixed by #13826
Closed
1 task done

[BUG] flock unlocking issues #13821

michallenc opened this issue Oct 4, 2024 · 3 comments · Fixed by #13826
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: File System File System issues

Comments

@michallenc
Copy link
Contributor

michallenc commented Oct 4, 2024

Description / Steps to reproduce the issue

File lock flock interface (and generally fs_lock.c implementation, I suppose the issue is valid for direct fnctl locks as well) does not seem to handle file unlocking correctly. A simple scenario where two threads open and access the file and both attempt to lock it with flock(fd, LOCK_EX); ends with the following result:

thread 1 locks file and accesses it
thread 2 locks file and blocks
thread 1 unlocks file
thread 2 should access the file, but it stays blocked

So far I have figured out two issues. One is the incorrect obtain of lock pid if the lock is taken from a POSIX thread created from the main process instead of a completely separate process. Imho task id (gettid()) should be used instead of pid (getpid()) to ensure every thread is treated independently and locking between threads is also possible. The following diff solves this.

diff --git a/fs/vfs/fs_lock.c b/fs/vfs/fs_lock.c
index 944b8a5ea1..45755c6c8f 100644
--- a/fs/vfs/fs_lock.c
+++ b/fs/vfs/fs_lock.c
@@ -670,12 +686,11 @@ int file_setlk(FAR struct file *filep, FAR struct flock *flock,
   if (ret < 0)
     {
       goto out_free;
     }
 
-  request.l_pid = getpid();
-
+  request.l_pid = gettid();

But the main issue I am facing is with the unlocking as described above. From my tests it seems the lock from thread 1 is unlocked successfully, deleted from the list and semaphore is posted, which releases the second thread. At this moment, the second thread jumps to retry label (see this line) and takes list_for_every_entry() which search through the list of active locks. This should be empty now as we released the only held lock. But for some reason the list returns an existing lock and goes to file_lock_is_conflict(). But the data in the returned lock are not valid! Or at least is seems to be that way, I get pid value of something like 541213036, so it seems we are accessing a bad part of the memory.

I have tested this while using SmartFS file system and NOR flash, but I suppose this is reproducible on other file systems as well as the issue seems to be in the common part of locking infrastructure or list implementation. One more change is required for SmartFS, I have not committed it yet to the mainline:

diff --git a/fs/smartfs/smartfs_smart.c b/fs/smartfs/smartfs_smart.c
index d41476065a..bc4e6477cd 100644
--- a/fs/smartfs/smartfs_smart.c
+++ b/fs/smartfs/smartfs_smart.c
@@ -1035,7 +1035,7 @@ static int smartfs_ioctl(FAR struct file *filep, int cmd, unsigned long arg)
         }
         break;
       default:
-        ret = -ENOSYS;
+        ret = -ENOTTY;
         break;
     }

I have CONFIG_FS_LOCK_BUCKET_SIZE=8, I suppose this is the only configuration needed.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 22.04.5, 6.8.0-45-generic

NuttX Version

master

Issue Architecture

[Arch: arm]

Issue Area

[Area: File System]

Verification

  • I have verified before submitting the report.
@github-actions github-actions bot added Arch: arm Issues related to ARM (32-bit) architecture Area: File System File System issues labels Oct 4, 2024
@michallenc
Copy link
Contributor Author

Pinging the flock implementation author @crafcat7, any ideas?

crafcat7 added a commit to crafcat7/crafcat7-nuttx that referenced this issue Oct 5, 2024
Summary:
  Fixed the problem of releasing the bucket prematurely in multi-threaded flock scenarios.

A thread setlk
B thread setlk_wait
A thread releases lock but fails to determine if nwaiter causes the bucket to be released prematurely
post B thread causes crash due to heap use after free

apache#13821

Signed-off-by: chenrun1 <[email protected]>
@crafcat7
Copy link
Contributor

crafcat7 commented Oct 5, 2024

But the main issue I am facing is with the unlocking as described above. From my tests it seems the lock from thread 1 is unlocked successfully, deleted from the list and semaphore is posted, which releases the second thread. At this moment, the second thread jumps to retry label (see this line) and takes list_for_every_entry() which search through the list of active locks. This should be empty now as we released the only held lock. But for some reason the list returns an existing lock and goes to file_lock_is_conflict(). But the data in the returned lock are not valid! Or at least is seems to be that way, I get pid value of something like 541213036, so it seems we are accessing a bad part of the memory.

Hi, Thanks for providing the steps to reproduce the issue, which has been fixed in the PR #13826

So far I have figured out two issues. One is the incorrect obtain of lock pid if the lock is taken from a POSIX thread created from the main process instead of a completely separate process. Imho task id (gettid()) should be used instead of pid (getpid()) to ensure every thread is treated independently and locking between threads is also possible. The following diff solves this.

Regarding the design that gettid / getpid should be used in file locks, I expect the implementation to be consistent with that in Linux. In the file lock implementation in Linux, I read that they use groupid instead of threadid if I understand correctly (refer to https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/fs/locks.c#L528-L533)

@xiaoxiang781216 xiaoxiang781216 linked a pull request Oct 5, 2024 that will close this issue
@michallenc
Copy link
Contributor Author

Hi, Thanks for providing the steps to reproduce the issue, which has been fixed in the PR #13826

Thanks for the quick fix!

Regarding the design that gettid / getpid should be used in file locks, I expect the implementation to be consistent with that in Linux. In the file lock implementation in Linux, I read that they use groupid instead of threadid if I understand correctly (refer to https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/fs/locks.c#L528-L533)

Hmm, I have tested the same app against Linux (kernel version 6.8.0-45) and thread locking works there. But yes, from the code you sent it seems they are using thread group ID, which should be the same as what we get from getpid() in NuttX.

xiaoxiang781216 pushed a commit that referenced this issue Oct 5, 2024
Summary:
  Fixed the problem of releasing the bucket prematurely in multi-threaded flock scenarios.

A thread setlk
B thread setlk_wait
A thread releases lock but fails to determine if nwaiter causes the bucket to be released prematurely
post B thread causes crash due to heap use after free

#13821

Signed-off-by: chenrun1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: File System File System issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants