[LU-11787] sanityn test 71a fails with ‘data is not flushed from client’ Created: 15/Dec/18 Updated: 26/Jan/24 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.2, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.15.0, Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Xinliang Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | arm, ppc64, ppc64le | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
sanityn test_71a fails with ‘data is not flushed from client’ for ARM clients. These failures started on October 26, 2018. Looking at a recent failure, https://testing.whamcloud.com/test_sets/f1121dd4-fdef-11e8-b837-52540065bddc , we see the following in the client test_log == sanityn test 71a: correct file map just after write operation is finished ========================= 00:36:26 (1544488586) CMD: trevis-24vm4 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 1+0 records in 1+0 records out 40960 bytes (41 kB) copied, 0.00437335 s, 9.4 MB/s 1+0 records in 1+0 records out 40960 bytes (41 kB) copied, 0.00155542 s, 26.3 MB/s File: '/mnt/lustre2/f71a.sanityn' Size: 163840 Blocks: 1 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205322894149 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2018-12-11 00:36:27.000000000 +0000 Modify: 2018-12-11 00:36:27.000000000 +0000 Change: 2018-12-11 00:36:27.000000000 +0000 Birth: - 209708 fd: 3 extent in offset 0, length 122880 flags: 80000000 extent in offset 122880, length 40960 flags: 80000001 No unwritten extents, extents number 2, file size 163840, original size 81920 sanityn test_71a: @@@@@@ FAIL: data is not flushed from client There are no errors in any of the console logs. Logs for more of these test failures are at |
| Comments |
| Comment by Peter Jones [ 17/Dec/18 ] |
|
Jian Could you please investigate? Thanks Peter |
| Comment by Sarah Liu [ 18/Oct/19 ] |
|
hit similar issue on PPC client on 2.12.3 == sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 (1570578863) CMD: trevis-55vm2 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 1+0 records in 1+0 records out 40960 bytes (41 kB) copied, 0.00459198 s, 8.9 MB/s 1+0 records in 1+0 records out 40960 bytes (41 kB) copied, 0.000819573 s, 50.0 MB/s File: '/mnt/lustre2/f71a.sanityn' Size: 163840 Blocks: 1 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205272562497 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-10-08 23:54:24.000000000 +0000 Modify: 2019-10-08 23:54:24.000000000 +0000 Change: 2019-10-08 23:54:24.000000000 +0000 Birth: - 149703 fd: 3 error while ioctl 2 sanityn test_71a: @@@@@@ FAIL: data is not flushed from client OSS crash [17669.845624] Lustre: DEBUG MARKER: dmesg [17670.427979] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 \(1570578863\) [17670.623376] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 (1570578863) [17671.194035] general protection fault: 0000 [#1] SMP [17671.195205] Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw [17671.209748] libata virtio_pci virtio_ring virtio 8139cp mii floppy [last unloaded: dm_flakey] [17671.211639] CPU: 1 PID: 27825 Comm: ps Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.x86_64 #1 [17671.213628] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17671.214627] task: ffff9b3dd354c100 ti: ffff9b3debae0000 task.ti: ffff9b3debae0000 [17671.216012] RIP: 0010:[<ffffffffbccf7ab9>] [<ffffffffbccf7ab9>] cap_capable+0x29/0x80 [17671.217533] RSP: 0018:ffff9b3debae3c60 EFLAGS: 00010283 [17671.218461] RAX: ffffffffbccf7a90 RBX: 000000000000000d RCX: 0000000000000000 [17671.219703] RDX: 0000000000000013 RSI: 80d764bdffffffff RDI: ffff9b3dd3a0d900 [17671.220929] RBP: ffff9b3debae3c60 R08: ffffffffbd64d780 R09: ffff9b3dfd001300 [17671.222150] R10: ffff9b3dfd001300 R11: ffffffffbcbd753d R12: ffff9b3df8fd0000 [17671.223393] R13: 0000000000000000 R14: ffff9b3def57d980 R15: ffffffffbd654920 [17671.224582] FS: 00007fd3c6583880(0000) GS:ffff9b3dffd00000(0000) knlGS:0000000000000000 [17671.225989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17671.226966] CR2: 00007fd3c5aba060 CR3: 000000005b5da000 CR4: 00000000000606e0 [17671.228175] Call Trace: [17671.228700] [<ffffffffbccf9e6b>] security_capable_noaudit+0x1b/0x20 [17671.229815] [<ffffffffbcaa6e45>] has_ns_capability_noaudit+0x15/0x20 [17671.230908] [<ffffffffbcaa7105>] ptrace_has_cap+0x35/0x40 [17671.231875] [<ffffffffbcaa7b51>] __ptrace_may_access+0x71/0x150 [17671.232896] [<ffffffffbcaa7efe>] ptrace_may_access+0x2e/0x50 [17671.233885] [<ffffffffbccc0841>] do_task_stat+0x91/0xbb0 [17671.234850] [<ffffffffbcc657a4>] ? mntput+0x24/0x40 [17671.235711] [<ffffffffbcc53ac2>] ? path_openat+0x172/0x640 [17671.236682] [<ffffffffbcc3f69a>] ? __check_object_size+0x1ca/0x250 [17671.237757] [<ffffffffbccc1e84>] proc_tgid_stat+0x14/0x20 [17671.238689] [<ffffffffbccbc062>] proc_single_show+0x52/0x90 [17671.239649] [<ffffffffbcc69fc0>] seq_read+0x130/0x440 [17671.240555] [<ffffffffbcc4256f>] vfs_read+0x9f/0x170 [17671.241425] [<ffffffffbcc4342f>] SyS_read+0x7f/0xf0 [17671.242295] [<ffffffffbd177d21>] ? system_call_after_swapgs+0xae/0x146 [17671.243447] [<ffffffffbd177ddb>] system_call_fastpath+0x22/0x27 [17671.244932] [<ffffffffbd177d21>] ? system_call_after_swapgs+0xae/0x146 [17671.246185] Code: 1f 00 66 66 66 66 90 55 4c 8b 87 80 00 00 00 48 89 e5 eb 09 66 0f 1f 44 00 00 48 89 c6 4c 39 c6 74 28 48 81 fe 80 d7 64 bd 74 3f <48> 8b 86 c8 00 00 00 4c 39 c0 75 e3 8b 86 d0 00 00 00 39 47 14 [17671.251802] RIP [<ffffffffbccf7ab9>] cap_capable+0x29/0x80 [17671.252816] RSP <ffff9b3debae3c60> |
| Comment by James Nunez (Inactive) [ 20/Feb/20 ] |
|
The patch that landed, https://review.whamcloud.com/37561/, puts sanityn tests 16a and 71a on the ALWAYS_EXCEPT list for PPC client testing. This ticket should remain open until those tests are fixed and the tests are taken off the list. |
| Comment by Xinliang Liu [ 11/Oct/21 ] |
|
Arm crashing when running sanityn test 71a for master branch [ 1330.973746] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 1333.254497] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params [ 1338.681588] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_logdir /tmp/test_logs/2021-10-11/020444 [ 1339.086489] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing yml_node [ 1339.419862] Lustre: DEBUG MARKER: Client: 2.14.52.80 [ 1339.495670] Lustre: DEBUG MARKER: MDS: 2.14.52.80 [ 1339.567308] Lustre: DEBUG MARKER: OSS: 2.14.52.80 [ 1339.627097] Lustre: DEBUG MARKER: -----============= acceptance-small: sanityn ============----- Mon Oct 11 02:05:23 UTC 2021 [ 1339.980338] Lustre: DEBUG MARKER: excepting tests: 28 [ 1340.091948] Lustre: Mounted lustre-client [ 1340.468327] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_config_client /mnt/lustre [ 1341.262854] Lustre: DEBUG MARKER: Using TIMEOUT=20 [ 1341.443750] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params [ 1342.212768] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================================================== 02:05:25 (1633917925) [ 1342.335774] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 1, block_idx 16, i 14,start_blocks: 0, count: 30, npages: 3 [ 1342.338924] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) LBUG [ 1342.340168] Pid: 4391, comm: ll_ost_io06_002 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021 [ 1342.342344] Call Trace: [ 1342.342858] [<0>] libcfs_call_trace+0xb8/0x118 [libcfs] [ 1342.343908] [<0>] lbug_with_loc+0x60/0xa0 [libcfs] [ 1342.344888] [<0>] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs] [ 1342.346130] [<0>] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs] [ 1342.347535] [<0>] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs] [ 1342.348841] [<0>] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs] [ 1342.349971] [<0>] ofd_commitrw_write+0x658/0x1e68 [ofd] [ 1342.351002] [<0>] ofd_commitrw+0x360/0xa18 [ofd] [ 1342.352134] [<0>] tgt_brw_write+0x1624/0x2d90 [ptlrpc] [ 1342.353229] [<0>] tgt_handle_request0+0xd0/0x978 [ptlrpc] [ 1342.354356] [<0>] tgt_request_handle+0x7c0/0x1a38 [ptlrpc] [ 1342.355513] [<0>] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc] [ 1342.356824] [<0>] ptlrpc_main+0xd28/0x15f0 [ptlrpc] [ 1342.357786] [<0>] kthread+0x130/0x138 [ 1342.358508] [<0>] ret_from_fork+0x10/0x18 [ 1342.359295] Kernel panic - not syncing: LBUG [ 1342.360146] CPU: 12 PID: 4391 Comm: ll_ost_io06_002 Kdump: loaded Tainted: P OE --------- - - 4.18.0-305.7.1.el8_lustre.aarch64 #1 [ 1342.362770] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 1342.364231] Call trace: [ 1342.364751] dump_backtrace+0x0/0x188 [ 1342.365526] show_stack+0x24/0x30 [ 1342.366234] dump_stack+0x9c/0xbc [ 1342.366937] panic+0x130/0x2f8 [ 1342.367608] param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs] [ 1342.368816] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs] [ 1342.370058] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs] [ 1342.371477] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs] [ 1342.372837] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs] [ 1342.373951] ofd_commitrw_write+0x658/0x1e68 [ofd] [ 1342.374970] ofd_commitrw+0x360/0xa18 [ofd] [ 1342.375934] tgt_brw_write+0x1624/0x2d90 [ptlrpc] [ 1342.376999] tgt_handle_request0+0xd0/0x978 [ptlrpc] [ 1342.378123] tgt_request_handle+0x7c0/0x1a38 [ptlrpc] [ 1342.379258] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc] [ 1342.380587] ptlrpc_main+0xd28/0x15f0 [ptlrpc] [ 1342.381530] kthread+0x130/0x138 [ 1342.382242] ret_from_fork+0x10/0x18 [ 1342.383021] SMP: stopping secondary CPUs [ 1342.389567] Starting crashdump kernel... [ 1342.390356] Bye!
|
| Comment by Xinliang Liu [ 21/Oct/21 ] |
|
Note that above crash is another issue, tracking in this bug:
|
| Comment by Gerrit Updater [ 20/Nov/21 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45629 |
| Comment by Xinliang Liu [ 16/Dec/21 ] |
|
Currently, ldiskfs block allocated is PAGE_SIZE aligned, see function osd_ldiskfs_map_inode_pages(), test case need to be modified to PAGE_SIZE aligned for Arm. |
| Comment by Andreas Dilger [ 29/Jan/22 ] |
|
Looking at the test itself, it seems clear that it won't work properly on a 64KB PAGE_SIZE system: # write data this way: hole - data - hole - data
dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=1 count=1
dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=3 count=1
because the "hole" is from 80KB-120KB, and with 64KB PAGE_SIZE the 40-80KB write will actually dirty the 64KB-128KB page and no hole is created... The test needs to be updated to use seek=0 for the first dd and seek=4 for the second dd and it should pass. It may need some adjustments to the later checks, but at least there will be a hole from 64KB-128KB for 64KB PAGE_SIZE clients, and 40-160KB for 4KB PAGE_SIZE clients. If we want to make the file consistent, it should use bs=64K. |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45629/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 12/Jun/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51287 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51287/ |