[BUG] flock unlocking issues #13821

michallenc · 2024-10-04T12:21:10Z

Description / Steps to reproduce the issue

File lock flock interface (and generally fs_lock.c implementation, I suppose the issue is valid for direct fnctl locks as well) does not seem to handle file unlocking correctly. A simple scenario where two threads open and access the file and both attempt to lock it with flock(fd, LOCK_EX); ends with the following result:

thread 1 locks file and accesses it
thread 2 locks file and blocks
thread 1 unlocks file
thread 2 should access the file, but it stays blocked

So far I have figured out two issues. One is the incorrect obtain of lock pid if the lock is taken from a POSIX thread created from the main process instead of a completely separate process. Imho task id (gettid()) should be used instead of pid (getpid()) to ensure every thread is treated independently and locking between threads is also possible. The following diff solves this.

diff --git a/fs/vfs/fs_lock.c b/fs/vfs/fs_lock.c
index 944b8a5ea1..45755c6c8f 100644
--- a/fs/vfs/fs_lock.c
+++ b/fs/vfs/fs_lock.c
@@ -670,12 +686,11 @@ int file_setlk(FAR struct file *filep, FAR struct flock *flock,
   if (ret < 0)
     {
       goto out_free;
     }
 
-  request.l_pid = getpid();
-
+  request.l_pid = gettid();

But the main issue I am facing is with the unlocking as described above. From my tests it seems the lock from thread 1 is unlocked successfully, deleted from the list and semaphore is posted, which releases the second thread. At this moment, the second thread jumps to retry label (see this line) and takes list_for_every_entry() which search through the list of active locks. This should be empty now as we released the only held lock. But for some reason the list returns an existing lock and goes to file_lock_is_conflict(). But the data in the returned lock are not valid! Or at least is seems to be that way, I get pid value of something like 541213036, so it seems we are accessing a bad part of the memory.

I have tested this while using SmartFS file system and NOR flash, but I suppose this is reproducible on other file systems as well as the issue seems to be in the common part of locking infrastructure or list implementation. One more change is required for SmartFS, I have not committed it yet to the mainline:

diff --git a/fs/smartfs/smartfs_smart.c b/fs/smartfs/smartfs_smart.c
index d41476065a..bc4e6477cd 100644
--- a/fs/smartfs/smartfs_smart.c
+++ b/fs/smartfs/smartfs_smart.c
@@ -1035,7 +1035,7 @@ static int smartfs_ioctl(FAR struct file *filep, int cmd, unsigned long arg)
         }
         break;
       default:
-        ret = -ENOSYS;
+        ret = -ENOTTY;
         break;
     }

I have CONFIG_FS_LOCK_BUCKET_SIZE=8, I suppose this is the only configuration needed.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 22.04.5, 6.8.0-45-generic

NuttX Version

master

Issue Architecture

[Arch: arm]

Issue Area

[Area: File System]

Verification

I have verified before submitting the report.

The text was updated successfully, but these errors were encountered:

michallenc · 2024-10-04T12:22:49Z

Pinging the flock implementation author @crafcat7, any ideas?

Summary: Fixed the problem of releasing the bucket prematurely in multi-threaded flock scenarios. A thread setlk B thread setlk_wait A thread releases lock but fails to determine if nwaiter causes the bucket to be released prematurely post B thread causes crash due to heap use after free apache#13821 Signed-off-by: chenrun1 <[email protected]>

crafcat7 · 2024-10-05T06:10:32Z

But the main issue I am facing is with the unlocking as described above. From my tests it seems the lock from thread 1 is unlocked successfully, deleted from the list and semaphore is posted, which releases the second thread. At this moment, the second thread jumps to retry label (see this line) and takes list_for_every_entry() which search through the list of active locks. This should be empty now as we released the only held lock. But for some reason the list returns an existing lock and goes to file_lock_is_conflict(). But the data in the returned lock are not valid! Or at least is seems to be that way, I get pid value of something like 541213036, so it seems we are accessing a bad part of the memory.

Hi, Thanks for providing the steps to reproduce the issue, which has been fixed in the PR #13826

So far I have figured out two issues. One is the incorrect obtain of lock pid if the lock is taken from a POSIX thread created from the main process instead of a completely separate process. Imho task id (gettid()) should be used instead of pid (getpid()) to ensure every thread is treated independently and locking between threads is also possible. The following diff solves this.

Regarding the design that gettid / getpid should be used in file locks, I expect the implementation to be consistent with that in Linux. In the file lock implementation in Linux, I read that they use groupid instead of threadid if I understand correctly (refer to https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/fs/locks.c#L528-L533)

michallenc · 2024-10-05T09:51:27Z

Hi, Thanks for providing the steps to reproduce the issue, which has been fixed in the PR #13826

Thanks for the quick fix!

Regarding the design that gettid / getpid should be used in file locks, I expect the implementation to be consistent with that in Linux. In the file lock implementation in Linux, I read that they use groupid instead of threadid if I understand correctly (refer to https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/fs/locks.c#L528-L533)

Hmm, I have tested the same app against Linux (kernel version 6.8.0-45) and thread locking works there. But yes, from the code you sent it seems they are using thread group ID, which should be the same as what we get from getpid() in NuttX.

Summary: Fixed the problem of releasing the bucket prematurely in multi-threaded flock scenarios. A thread setlk B thread setlk_wait A thread releases lock but fails to determine if nwaiter causes the bucket to be released prematurely post B thread causes crash due to heap use after free #13821 Signed-off-by: chenrun1 <[email protected]>

github-actions bot added Arch: arm Issues related to ARM (32-bit) architecture Area: File System File System issues labels Oct 4, 2024

crafcat7 mentioned this issue Oct 5, 2024

fs_lock:Check the nwaiter when deleting a bucket #13826

Merged

xiaoxiang781216 linked a pull request Oct 5, 2024 that will close this issue

fs_lock:Check the nwaiter when deleting a bucket #13826

Merged

xiaoxiang781216 closed this as completed in #13826 Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] flock unlocking issues #13821

[BUG] flock unlocking issues #13821

michallenc commented Oct 4, 2024 •

edited

Loading

michallenc commented Oct 4, 2024

crafcat7 commented Oct 5, 2024 •

edited

Loading

michallenc commented Oct 5, 2024

[BUG] flock unlocking issues #13821

[BUG] flock unlocking issues #13821

Comments

michallenc commented Oct 4, 2024 • edited Loading

Description / Steps to reproduce the issue

On which OS does this issue occur?

What is the version of your OS?

NuttX Version

Issue Architecture

Issue Area

Verification

michallenc commented Oct 4, 2024

crafcat7 commented Oct 5, 2024 • edited Loading

michallenc commented Oct 5, 2024

michallenc commented Oct 4, 2024 •

edited

Loading

crafcat7 commented Oct 5, 2024 •

edited

Loading