[LU-15122] Lustre ASSERTION( iobuf->dr_rw == 0 ) crash on Arm server end Created: 18/Oct/21  Updated: 20/Nov/21  Resolved: 20/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Xinliang Liu Assignee: Xinliang Liu
Resolution: Fixed Votes: 0
Labels: arm, arm-server
Environment:

[centos@beegfs-test-01 ~]$ lsb_release -a
LSB Version: :core-4.1-aarch64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 8.3.2011
Release: 8.3.2011
Codename: n/a
[centos@beegfs-test-01 ~]$ uname -a
Linux beegfs-test-01.novalocal 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
[centos@beegfs-test-01 ~]$


Issue Links:
Related
is related to LU-11787 sanityn test 71a fails with ‘data is ... Resolved
Epic/Theme: ldiskfs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It is easy to enconter bellow crash when running some tests on Arm all-in-one(server and client) test evironment. Tests such as fsx tests: sanityn 16a, 16b etc.

[57580.041008] Lustre: DEBUG MARKER: -----============= acceptance-small: sanityn ============----- Tue Oct 12 02:22:12 UTC 2021
[57580.414512] Lustre: DEBUG MARKER: excepting tests: 28
[57580.528887] Lustre: Mounted lustre-client
[57580.868328] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_config_client /mnt/lustre
[57581.696557] Lustre: DEBUG MARKER: Using TIMEOUT=20
[57581.854832] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
[57582.331371] Lustre: DEBUG MARKER: == sanityn test 16b: 12500 iterations of dual-mount fsx at small size ========================================================== 02:22:15 (1634005335)
[57585.485613] LustreError: 5594:0:(osd_io.c:538:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 1, block_idx 16, i 13,start_blocks: 0, count: 29, npages: 2
[57585.488764] LustreError: 5594:0:(osd_io.c:538:osd_do_bio()) LBUG
[57585.490011] Pid: 5594, comm: ll_ost_io03_000 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021
[57585.492109] Call Trace:
[57585.492633] [<0>] libcfs_call_trace+0xb8/0x118 [libcfs]
[57585.493719] [<0>] lbug_with_loc+0x60/0xa0 [libcfs]
[57585.494721] [<0>] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
[57585.496012] [<0>] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
[57585.497464] [<0>] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
[57585.498880] [<0>] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
[57585.500047] [<0>] ofd_commitrw_write+0x658/0x1e68 [ofd]
[57585.501133] [<0>] ofd_commitrw+0x360/0xa18 [ofd]
[57585.502180] [<0>] tgt_brw_write+0x1624/0x2d90 [ptlrpc]
[57585.503299] [<0>] tgt_handle_request0+0xd0/0x978 [ptlrpc]
[57585.504466] [<0>] tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
[57585.505663] [<0>] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
[57585.507026] [<0>] ptlrpc_main+0xd28/0x15f0 [ptlrpc]
[57585.508034] [<0>] kthread+0x130/0x138
[57585.508861] [<0>] ret_from_fork+0x10/0x18
[57585.509697] Kernel panic - not syncing: LBUG
[57585.510570] CPU: 6 PID: 5594 Comm: ll_ost_io03_000 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-305.7.1.el8_lustre.aarch64 #1
[57585.513181] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[57585.514576] Call trace:
[57585.515080]  dump_backtrace+0x0/0x188
[57585.515826]  show_stack+0x24/0x30
[57585.516530]  dump_stack+0x9c/0xbc
[57585.517189]  panic+0x130/0x2f8
[57585.517802]  param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs]
[57585.518972]  osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
[57585.520204]  osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
[57585.521607]  osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
[57585.522910]  osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
[57585.524012]  ofd_commitrw_write+0x658/0x1e68 [ofd]
[57585.525023]  ofd_commitrw+0x360/0xa18 [ofd]
[57585.525982]  tgt_brw_write+0x1624/0x2d90 [ptlrpc]
[57585.527046]  tgt_handle_request0+0xd0/0x978 [ptlrpc]
[57585.528160]  tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
[57585.529316]  ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
[57585.530622]  ptlrpc_main+0xd28/0x15f0 [ptlrpc]
[57585.531557]  kthread+0x130/0x138
[57585.532236]  ret_from_fork+0x10/0x18
[57585.533308] SMP: stopping secondary CPUs
[57585.539225] Starting crashdump kernel...
[57585.540042] Bye!



 Comments   
Comment by Xinliang Liu [ 18/Oct/21 ]

Add some print msgs in the blocks[] array assignment and grab the kernel log.

It seems the blocks[] array access is exceeding the assignment size.

Kernel msgs printing patch:

--- a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -1143,6 +1150,7 @@ cont_map: static int osd_ldiskfs_map_inode_pages(struct inode *inode, ## Assignment function
                         * want to avoid that as much as possible.
                         */
                        if (oh->oh_declared_ext <= 0) {
+                                CDEBUG(D_OTHER,"call osd_ldiskfs_map_write here!!\n");
                                rc = osd_ldiskfs_map_write(inode,
                                        iobuf, osd, start_blocks,
                                        count, &disk_size, user_size);

@@ -1188,6 +1198,18 @@ cont_map:
                        rc = 0;
                }+               if (create && iobuf->dr_rw) {
+                CDEBUG(D_OTHER,"Page_index=%lu: rc=%d, (blen=%ld, total=%ld, m_len=%d), blocks:{\n",
+                                fp->index, ret, blen, total, map.m_len);
+                for (c=0; c < total; c++){
+                        unsigned long long value =  *(blocks + c);
+                        if (value == 0)
+                                CDEBUG(D_OTHER,"blocks[%d]=%llu \n", c, value);
+                }
+                CDEBUG(D_OTHER,"}\n");
+                }
+
+

--- a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -530,12 +530,17 @@ static int osd_do_bio(struct osd_device *osd, struct inode *inode,
                        nblocks = 1;                        if (blocks[block_idx + i] == 0) {  /* hole */
-                               LASSERTF(iobuf->dr_rw == 0,
-                                        "page_idx %u, block_idx %u, i %u,"
-                                        "start_blocks: %llu, count: %llu, npages: %d\n",
-                                        page_idx, block_idx, i,
-                                        (unsigned long long)start_blocks,
-                                        (unsigned long long)count, npages);
+                               if (iobuf->dr_rw == 1) {
+                                       CDEBUG(D_OTHER,"Write with hole!!: blocks[%d]=0, iobuf_line=%d, iobuf_error=%d,"
+                                                "page_idx %u, block_idx %u, i %u,"
+                                                "start_blocks: %llu, count: %llu, npages: %d\n",
+                                                block_idx + i, iobuf->dr_init_at, iobuf->dr_error,
+                                                page_idx, block_idx, i,
+                                                (unsigned long long)start_blocks,
+                                                (unsigned long long)count, npages);
+                                       rc = -EFBIG;
+                                       goto out;
+                               }
                                memset(kmap(page) + page_offset, 0, blocksize);
                                kunmap(page);
                                continue;

Kernel log: 

blocks[] array assignment size is 25(see total var), so blocks[25] exceeds the assignment size.

(osd_io.c:1093:osd_ldiskfs_map_inode_pages()) inode 104: map 2 pages from 22
(osd_io.c:1095:osd_ldiskfs_map_inode_pages()) Enter inode 104: create:1, is_write:1 =======>
(osd_io.c:1203:osd_ldiskfs_map_inode_pages()) Page_index=22: rc=25, (blen=32, total=25, m_len=25), blocks:{
(osd_io.c:1209:osd_ldiskfs_map_inode_pages()) }
(osd_io.c:1153:osd_ldiskfs_map_inode_pages()) call osd_ldiskfs_map_write here!!
(osd_io.c:503:osd_do_bio()) Process entered
(osd_io.c:426:osd_bio_init()) Process entered
(osd_io.c:449:osd_bio_init()) Process leaving (rc=0 : 0 : 0)
(osd_io.c:540:osd_do_bio()) Write with hole!!: blocks[25]=0, iobuf_line=1593, iobuf_error=0,page_idx 1, block_idx 16, i 9,start_blocks: 0, count: 25, npages: 2
Comment by Gerrit Updater [ 19/Oct/21 ]

"xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/45288
Subject: LU-15122 ldiskfs: Fix ASSERTION( iobuf->dr_rw == 0 ) crash on arm
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6c695e86b8418a3b821062adf58000b5c47dcb69

Comment by Gerrit Updater [ 20/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45288/
Subject: LU-15122 osd-ldiskfs: Fix ASSERTION( iobuf->dr_rw == 0 ) with 64KB PAGE_SIZE
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ec03ee091f931125f3bbeb3628b0c5aaa4709930

Comment by Peter Jones [ 20/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:15:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.