[LU-11053] problem with loop device associated with lustre file - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

lustre's file open does not bring the file size info in-core inode. losetup avoids stat for the backing lustre file before LOOP_SET_FD:

open("/mnt/lustre/ext4image", O_RDWR)    = 3
open("/dev/loop0", O_RDWR)              = 4
ioctl(4, LOOP_SET_FD, 0x3)              = 0
stat("/mnt/lustre/ext4image"...

So losetup creates block device inode with zero size which eventually leads to failure on umounting the loop device at BUG_ON(!buffer_mapped(bh)) in submit_bh():

<2>kernel BUG at fs/buffer.c:3157!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 37
...
<4>Pid: 6751, comm: umount Not tainted 2.6.32-696.18.7.el6.x86_64 #1 BULL bullx blade/CHPD
<4>RIP: 0010:[<ffffffff811d0962>]  [<ffffffff811d0962>] submit_bh+0x152/0x1f0
<4>RSP: 0018:ffff88107483bd68  EFLAGS: 00010246
<4>RAX: 0000000000000005 RBX: ffff88087c7fbd60 RCX: 0000000000000017
...
<4>Call Trace:
<4> [<ffffffff811d2973>] __sync_dirty_buffer+0x53/0xf0
<4> [<ffffffff811d2a23>] sync_dirty_buffer+0x13/0x20
<4> [<ffffffffa0d1877b>] ext2_sync_super+0x5b/0x70 [ext2]
<4> [<ffffffffa0d19733>] ext2_put_super+0x133/0x150 [ext2]
<4> [<ffffffff8119cc4b>] generic_shutdown_super+0x5b/0xe0
<4> [<ffffffff8119cd01>] kill_block_super+0x31/0x50
<4> [<ffffffff8119d4d7>] deactivate_super+0x57/0x80
<4> [<ffffffff811bd50f>] mntput_no_expire+0xbf/0x110
<4> [<ffffffff811be05b>] sys_umount+0x7b/0x3a0

The modification to sanity.sh:test_54c (from Andrew Perepechko) illustrates the problem.

diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index c6b292b029..d2961dbf0e 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -4656,10 +4656,15 @@ test_54c() {
        mknod $loopdev b 7 $LOOPNUM
        echo "make a loop file system with $DIR/$tfile on $loopdev ($LOOPNUM)."
        dd if=/dev/zero of=$DIR/$tfile bs=$(get_page_size client) seek=1024 count=1 > /dev/null
+       mkfs.ext2 $DIR/$tfile  || error "mke2fs on $DIR/$tfile "
+       test_mkdir $DIR/$tdir
+
+       cancel_lru_locks mdc
+       cancel_lru_locks osc
+
        losetup $loopdev $DIR/$tfile ||
                error "can't set up $loopdev for $DIR/$tfile"
-       mkfs.ext2 $loopdev || error "mke2fs on $loopdev"
-       test_mkdir $DIR/$tdir
+
        mount -t ext2 $loopdev $DIR/$tdir ||
                error "error mounting $loopdev on $DIR/$tdir"
        dd if=/dev/zero of=$DIR/$tdir/tmp bs=$(get_page_size client) count=30 ||

Attachments

Activity

[LU-11053] problem with loop device associated with lustre file

Rajeev Mishra added a comment - 2 days ago - edited

patch sent

git send-email -{-}to="axboe@kernel.dk" --cc="linux-block@vger.kernel.org" --cc="linux-kernel@vger.kernel.org" --subject-prefix="PATCH" 0001-loop-sync-filesystem-cache-before-getting-file-size{-}.patch --smtp-debug
0001-loop-sync-filesystem-cache-before-getting-file-size-.patch
(mbox) Adding cc: Rajeev Mishra <rajeevm@hpe.com> from line 'From: Rajeev Mishra <rajeevm@hpe.com>'
(body) Adding cc: Rajeev Mishra <rajeevm@hpe.com> from line 'Signed-off-by: Rajeev Mishra <rajeevm@hpe.com>'
From: Rajeev Mishra <rajeevm@hpe.com>
To: axboe@kernel.dk
Cc: linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org,
Rajeev Mishra <rajeevm@hpe.com>
Subject: [PATCH 1/1] loop: sync filesystem cache before getting file size in get_size()
Date: Thu, 7 Aug 2025 14:44:49 +0000
Message-ID: <20250807144449.189076-1-rajeevm@hpe.com>
X-Mailer: git-send-email 2.43.7
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
The Cc list above has been expanded by additional
addresses found in the patch commit message. By default
send-email prompts before sending whenever this occurs.
This behavior is controlled by the sendemail.confirm
configuration setting.
For additional information, run 'git send-email --help'.
To retain the current behavior, but squelch this message,
run 'git config --global sendemail.confirm auto'.
Send this email? ([y]es|[n]o|[e]dit|[q]uit|[a]ll): y
OK. Log says:
Sendmail: /usr/sbin/sendmail -i axboe@kernel.dk linux-block@vger.kernel.org linux-kernel@vger.kernel.org rajeevm@hpe.com
From: Rajeev Mishra <rajeevm@hpe.com>
To: axboe@kernel.dk
Cc: linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org,
Rajeev Mishra <rajeevm@hpe.com>
Subject: [PATCH 1/1] loop: sync filesystem cache before getting file size in get_size()
Date: Thu, 7 Aug 2025 14:44:49 +0000
Message-ID: <20250807144449.189076-1-rajeevm@hpe.com>
X-Mailer: git-send-email 2.43.7
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Result: OK

Rajeev Mishra added a comment - 2 days ago - edited patch sent git send-email -{-}to="axboe@kernel.dk" --cc="linux-block@vger.kernel.org" --cc="linux-kernel@vger.kernel.org" --subject-prefix="PATCH" 0001-loop-sync-filesystem-cache-before-getting-file-size{-}.patch --smtp-debug 0001-loop-sync-filesystem-cache-before-getting-file-size-.patch (mbox) Adding cc: Rajeev Mishra <rajeevm@hpe.com> from line 'From: Rajeev Mishra <rajeevm@hpe.com>' (body) Adding cc: Rajeev Mishra <rajeevm@hpe.com> from line 'Signed-off-by: Rajeev Mishra <rajeevm@hpe.com>' From: Rajeev Mishra <rajeevm@hpe.com> To: axboe@kernel.dk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Rajeev Mishra <rajeevm@hpe.com> Subject: [PATCH 1/1] loop: sync filesystem cache before getting file size in get_size() Date: Thu, 7 Aug 2025 14:44:49 +0000 Message-ID: <20250807144449.189076-1-rajeevm@hpe.com> X-Mailer: git-send-email 2.43.7 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit The Cc list above has been expanded by additional addresses found in the patch commit message. By default send-email prompts before sending whenever this occurs. This behavior is controlled by the sendemail.confirm configuration setting. For additional information, run 'git send-email --help'. To retain the current behavior, but squelch this message, run 'git config --global sendemail.confirm auto'. Send this email? ([y]es|[n]o|[e]dit|[q]uit|[a]ll): y OK. Log says: Sendmail: /usr/sbin/sendmail -i axboe@kernel.dk linux-block@vger.kernel.org linux-kernel@vger.kernel.org rajeevm@hpe.com From: Rajeev Mishra <rajeevm@hpe.com> To: axboe@kernel.dk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Rajeev Mishra <rajeevm@hpe.com> Subject: [PATCH 1/1] loop: sync filesystem cache before getting file size in get_size() Date: Thu, 7 Aug 2025 14:44:49 +0000 Message-ID: <20250807144449.189076-1-rajeevm@hpe.com> X-Mailer: git-send-email 2.43.7 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Result: OK

Rajeev Mishra added a comment - 2 days ago - edited

I am in the process of using the fix using following local case
https://jira-pro.it.hpe.com:8443/browse/USS-4481

[rocky@test-build-fcntl linux]$ git diff cbe7bfa2a73ac71a1f608aecc864c57dc6b9b283^ cbe7bfa2a73ac71a1f608aecc864c57dc6b9b283
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1b6ee91f8eb9..15d5edbc69ce 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -137,12 +137,39 @@ static void loop_global_unlock(struct loop_device *lo, bool global)
 static int max_part;
 static int part_shift;
 
+/**
+ * get_size - calculate the effective size of a loop device
+ * @offset: offset into the backing file
+ * @sizelimit: user-specified size limit
+ * @file: the backing file
+ *
+ * Calculate the effective size of the loop device
+ *
+ * Returns: size in 512-byte sectors, or 0 if invalid
+ */
 static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file)
 {
+       struct kstat stat;
        loff_t loopsize;
+       int ret;
+
+       /*
+        * Get file attributes for validation. We use vfs_getattr() to ensure
+        * we have up-to-date file size information.
+        */
+       ret = vfs_getattr_nosec(&file->f_path, &stat, STATX_SIZE, 
+                               AT_STATX_SYNC_AS_STAT);
+       if (ret) {
+               /*
+                * If we can't get attributes, fall back to i_size_read()
+                * which should work for most cases.
+                */
+               loopsize = i_size_read(file->f_mapping->host);
+       } else {
+               /* Use the size from getattr for consistency */
+               loopsize = stat.size;
+       }
 
-       /* Compute loopsize in bytes */
-       loopsize = i_size_read(file->f_mapping->host);
        if (offset > 0)
                loopsize -= offset;
        /* offset is beyond i_size, weird but possible */

Rajeev Mishra added a comment - 2 days ago - edited I am in the process of using the fix using following local case https://jira-pro.it.hpe.com:8443/browse/USS-4481 [rocky@test-build-fcntl linux]$ git diff cbe7bfa2a73ac71a1f608aecc864c57dc6b9b283^ cbe7bfa2a73ac71a1f608aecc864c57dc6b9b283 diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 1b6ee91f8eb9..15d5edbc69ce 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -137,12 +137,39 @@ static void loop_global_unlock(struct loop_device *lo, bool global) static int max_part; static int part_shift; +/** + * get_size - calculate the effective size of a loop device + * @offset: offset into the backing file + * @sizelimit: user-specified size limit + * @file: the backing file + * + * Calculate the effective size of the loop device + * + * Returns: size in 512- byte sectors, or 0 if invalid + */ static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file) { + struct kstat stat; loff_t loopsize; + int ret; + + /* + * Get file attributes for validation. We use vfs_getattr() to ensure + * we have up-to-date file size information. + */ + ret = vfs_getattr_nosec(&file->f_path, &stat, STATX_SIZE, + AT_STATX_SYNC_AS_STAT); + if (ret) { + /* + * If we can't get attributes, fall back to i_size_read() + * which should work for most cases. + */ + loopsize = i_size_read(file->f_mapping->host); + } else { + /* Use the size from getattr for consistency */ + loopsize = stat.size; + } - /* Compute loopsize in bytes */ - loopsize = i_size_read(file->f_mapping->host); if (offset > 0) loopsize -= offset; /* offset is beyond i_size, weird but possible */

Rajeev Mishra added a comment - 07/Apr/25 6:03 PM

Thanks @Andreas, will do the needful. Same issue is present on 5.14.0-503.34.1.el9_5.x86_64
[rocky@fcntl-rocky ~]$ uname -r
5.14.0-503.34.1.el9_5.x86_64
[rocky@fcntl-rocky ~]$ sudo dd if=/dev/zero bs=1 count=512 of=/mnt/lustre/backing_file.img
512+0 records in
512+0 records out
512 bytes copied, 0.00277791 s, 184 kB/s
sudo ./a.out
write to loop device: No space left on device

Rajeev Mishra added a comment - 07/Apr/25 6:03 PM Thanks @Andreas, will do the needful. Same issue is present on 5.14.0-503.34.1.el9_5.x86_64 [rocky@fcntl-rocky ~] $ uname -r 5.14.0-503.34.1.el9_5.x86_64 [rocky@fcntl-rocky ~] $ sudo dd if=/dev/zero bs=1 count=512 of=/mnt/lustre/backing_file.img 512+0 records in 512+0 records out 512 bytes copied, 0.00277791 s, 184 kB/s sudo ./a.out write to loop device: No space left on device

Andreas Dilger added a comment - 06/Apr/25 7:30 AM

I will make sure to clearly highlight that this is not a Lustre issue

Please do not mention Lustre at all in any upstream patch submission. This will likely result in the patch being rejected, regardless of whether it is correct or not.

Just write something generic like:

Ensure that the file data is synced to server and the file size is
revalidated before configuring the loop device to avoid issues with
loopback configuration if the client does not have the latest size.

Andreas Dilger added a comment - 06/Apr/25 7:30 AM I will make sure to clearly highlight that this is not a Lustre issue Please do not mention Lustre at all in any upstream patch submission. This will likely result in the patch being rejected, regardless of whether it is correct or not. Just write something generic like: Ensure that the file data is synced to server and the file size is revalidated before configuring the loop device to avoid issues with loopback configuration if the client does not have the latest size.

Rajeev Mishra added a comment - 06/Apr/25 1:42 AM

@Andreas I have not been able to reproduce the issue on NFS, possibly because the NFS cache refreshes too quickly (I think). However, on Lustre, a sync is required to get the correct file size.

All my tests so far have been conducted on the following kernel version:

uname -r
4.18.0-425.3.1.el8.x86_64

I have also set up a system with a newer kernel:

rocky@fcntl-rocky ~]$ uname -r
5.14.0-503.34.1.el9_5.x86_64

I will run tests on this system as well to see if there’s any difference in behavior.

I will make sure to clearly highlight that this is not a Lustre issue, and the fix is needed in the loop device to retrieves the correct file size. Thanks for your help and guidance

Rajeev Mishra added a comment - 06/Apr/25 1:42 AM @Andreas I have not been able to reproduce the issue on NFS, possibly because the NFS cache refreshes too quickly (I think). However, on Lustre, a sync is required to get the correct file size. All my tests so far have been conducted on the following kernel version: uname -r 4.18.0-425.3.1.el8.x86_64 I have also set up a system with a newer kernel: rocky@fcntl-rocky ~]$ uname -r 5.14.0-503.34.1.el9_5.x86_64 I will run tests on this system as well to see if there’s any difference in behavior. I will make sure to clearly highlight that this is not a Lustre issue, and the fix is needed in the loop device to retrieves the correct file size. Thanks for your help and guidance

Andreas Dilger added a comment - 04/Apr/25 9:34 PM - edited

I would still prefer to use "vfs_getattr_nosec(file->f_path, &stat, STATX_SIZE, AT_STATX_FORCE_SYNC)" for two reasons:

this will also force an fsync() of the client cache under the covers if it is needed
it is sure to get the latest file size from the OSTs if needed

I don't think we should expect/require correct behavior of the userspace applications. I'm not against patching losetup to call fsync() on the file before configuring it, but there will still be older versions of losetup around, and other userspace code that is directly calling the ioctl(fd, LOOP_SET_FD) as with the test program.

It is probably best to do both - submit one patch to util-linux (where losetup lives) and one to the kernel, but please only mention NFS and not Lustre.

Andreas Dilger added a comment - 04/Apr/25 9:34 PM - edited I would still prefer to use " vfs_getattr_nosec(file->f_path, &stat, STATX_SIZE, AT_STATX_FORCE_SYNC) " for two reasons: this will also force an fsync() of the client cache under the covers if it is needed it is sure to get the latest file size from the OSTs if needed I don't think we should expect/require correct behavior of the userspace applications. I'm not against patching losetup to call fsync() on the file before configuring it, but there will still be older versions of losetup around, and other userspace code that is directly calling the ioctl(fd, LOOP_SET_FD) as with the test program. It is probably best to do both - submit one patch to util-linux (where losetup lives) and one to the kernel, but please only mention NFS and not Lustre.

Rajeev Mishra added a comment - 04/Apr/25 8:35 PM - edited

This issue is related to caching and may not be visible on all the filesystems. However, it is observed on Lustre.
I have identified two possible fixes:
1. Perform a sync within the loop device to ensure the actual file size is obtained. This can be done by calling vfs_fsync() before reading the file size, as shown below (in block/loop.c):

static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file)
{
      loff_t loopsize;       /* Sync to ensure accurate file size */   
      vfs_fsync(file, 0); <<<< this is added  
      loopsize = i_size_read(file->f_mapping->host);

2. Ensure the application performs a sync after writing to guarantee that all data is properly accounted for. This can be done using fsync(backing_fd), as I did in the provided test case.

Rajeev Mishra added a comment - 04/Apr/25 8:35 PM - edited This issue is related to caching and may not be visible on all the filesystems. However, it is observed on Lustre. I have identified two possible fixes: 1. Perform a sync within the loop device to ensure the actual file size is obtained. This can be done by calling vfs_fsync() before reading the file size, as shown below (in block/loop.c): static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file) { loff_t loopsize; /* Sync to ensure accurate file size */ vfs_fsync(file, 0); <<<< this is added loopsize = i_size_read(file->f_mapping->host); 2. Ensure the application performs a sync after writing to guarantee that all data is properly accounted for. This can be done using fsync(backing_fd), as I did in the provided test case.

Andreas Dilger added a comment - 04/Apr/25 8:29 PM

The "loopsize=1" could potentially happen because the inode i_blocks count has not been updated due to the "dd" write not being synchronous and no blocks are allocated to the file yet. IIRC, the Lustre client will return the blocks count =1 while the file is only in the client cache, so that userspace applications (e.g. "cp") do not incorrectly assume the whole file is sparse and not try to read it at all...

Does this problem reproduce if you use "dd ... oflag=sync" or "dd ... oflag=direct" so that the backing OST blocks are immediately allocated and the block count is synchronously returned to the client? Have you tried it with a patched kernel as suggested above to force losetup to fetch the correct file size and blocks from the OSTs?

Andreas Dilger added a comment - 04/Apr/25 8:29 PM The "loopsize=1" could potentially happen because the inode i_blocks count has not been updated due to the "dd" write not being synchronous and no blocks are allocated to the file yet. IIRC, the Lustre client will return the blocks count =1 while the file is only in the client cache, so that userspace applications (e.g. "cp") do not incorrectly assume the whole file is sparse and not try to read it at all... Does this problem reproduce if you use " dd ... oflag=sync " or " dd ... oflag=direct " so that the backing OST blocks are immediately allocated and the block count is synchronously returned to the client? Have you tried it with a patched kernel as suggested above to force losetup to fetch the correct file size and blocks from the OSTs?

Rajeev Mishra added a comment - 04/Apr/25 5:37 PM

The i_size_read on the file reports following
[ 1395.572922] Rajeev loop offset =0 loopsize=512 <<< in case of NFS
[ 1395.573708] loop3: detected capacity change from 0 to 512
[ 1427.038818] Rajeev loop offset =0 loopsize=1 <<< in case of Lustre

checking further as why Lustre is reporting wrong i_size

Rajeev Mishra added a comment - 04/Apr/25 5:37 PM The i_size_read on the file reports following [ 1395.572922] Rajeev loop offset =0 loopsize=512 <<< in case of NFS [ 1395.573708] loop3: detected capacity change from 0 to 512 [ 1427.038818] Rajeev loop offset =0 loopsize=1 <<< in case of Lustre checking further as why Lustre is reporting wrong i_size

Rajeev Mishra added a comment - 03/Apr/25 9:32 PM

The problem on NFS was different. I further debuded it and realized the block size of file was the issue on NFS
sudo dd if=/dev/zero bs=1 count=512 of=./backing_file.img
512+0 records in
512+0 records out
512 bytes copied, 1.25623 s, 0.4 kB/s
[rocky@test2-fcntl test]$ sudo ./a.out_nfs
data len =13, back file_size = 512 loop_file sz = 0
debug=super warning error emerg console
Wrote 13 bytes to /dev/loop3
lustre case
[rocky@test2-fcntl test]$ sudo dd if=/dev/zero bs=1 count=512 of=/mnt/lustre/backing_file.img
512+0 records in
512+0 records out
512 bytes copied, 0.00515372 s, 99.3 kB/s
[rocky@test2-fcntl test]$ sudo ./a.out_lustr
data len =13, back file_size = 512 loop_file sz = 0
debug=super warning error emerg console
write to loop device: No space left on device

Rajeev Mishra added a comment - 03/Apr/25 9:32 PM The problem on NFS was different. I further debuded it and realized the block size of file was the issue on NFS sudo dd if=/dev/zero bs=1 count=512 of=./backing_file.img 512+0 records in 512+0 records out 512 bytes copied, 1.25623 s, 0.4 kB/s [rocky@test2-fcntl test] $ sudo ./a.out_nfs data len =13, back file_size = 512 loop_file sz = 0 debug=super warning error emerg console Wrote 13 bytes to /dev/loop3 lustre case [rocky@test2-fcntl test] $ sudo dd if=/dev/zero bs=1 count=512 of=/mnt/lustre/backing_file.img 512+0 records in 512+0 records out 512 bytes copied, 0.00515372 s, 99.3 kB/s [rocky@test2-fcntl test] $ sudo ./a.out_lustr data len =13, back file_size = 512 loop_file sz = 0 debug=super warning error emerg console write to loop device: No space left on device

People

Assignee:: Vladimir Saveliev

Reporter:: Vladimir Saveliev

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/May/18 10:28 AM

Updated:: Yesterday 6:21 AM