[LU-11787] sanityn test 71a fails with ‘data is not flushed from client’ Created: 15/Dec/18  Updated: 26/Jan/24  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.2, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.15.0, Lustre 2.15.3
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Xinliang Liu
Resolution: Fixed Votes: 0
Labels: arm, ppc64, ppc64le

Issue Links:
Related
is related to LU-15122 Lustre ASSERTION( iobuf->dr_rw == 0 )... Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanityn test_71a fails with ‘data is not flushed from client’ for ARM clients. These failures started on October 26, 2018.

Looking at a recent failure, https://testing.whamcloud.com/test_sets/f1121dd4-fdef-11e8-b837-52540065bddc , we see the following in the client test_log

== sanityn test 71a: correct file map just after write operation is finished ========================= 00:36:26 (1544488586)
CMD: trevis-24vm4 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
1+0 records in
1+0 records out
40960 bytes (41 kB) copied, 0.00437335 s, 9.4 MB/s
1+0 records in
1+0 records out
40960 bytes (41 kB) copied, 0.00155542 s, 26.3 MB/s
  File: '/mnt/lustre2/f71a.sanityn'
  Size: 163840    	Blocks: 1          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205322894149  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-12-11 00:36:27.000000000 +0000
Modify: 2018-12-11 00:36:27.000000000 +0000
Change: 2018-12-11 00:36:27.000000000 +0000
 Birth: -
209708
fd: 3
extent in offset 0, length 122880
flags: 80000000
extent in offset 122880, length 40960
flags: 80000001
No unwritten extents, extents number 2, file size 163840, original size 81920
 sanityn test_71a: @@@@@@ FAIL: data is not flushed from client 

There are no errors in any of the console logs.

Logs for more of these test failures are at
https://testing.whamcloud.com/test_sets/8b2f4282-d9d0-11e8-b46b-52540065bddc
https://testing.whamcloud.com/test_sets/83526f8c-df5f-11e8-a871-52540065bddc
https://testing.whamcloud.com/test_sets/1de368ba-fa38-11e8-bb6b-52540065bddc



 Comments   
Comment by Peter Jones [ 17/Dec/18 ]

Jian

Could you please investigate?

Thanks

Peter

Comment by Sarah Liu [ 18/Oct/19 ]

hit similar issue on PPC client on 2.12.3
https://testing.whamcloud.com/test_sets/dd84edc0-eb0e-11e9-b62b-52540065bddc

== sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 (1570578863)
CMD: trevis-55vm2 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
1+0 records in
1+0 records out
40960 bytes (41 kB) copied, 0.00459198 s, 8.9 MB/s
1+0 records in
1+0 records out
40960 bytes (41 kB) copied, 0.000819573 s, 50.0 MB/s
  File: '/mnt/lustre2/f71a.sanityn'
  Size: 163840    	Blocks: 1          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205272562497  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-08 23:54:24.000000000 +0000
Modify: 2019-10-08 23:54:24.000000000 +0000
Change: 2019-10-08 23:54:24.000000000 +0000
 Birth: -
149703
fd: 3
error while ioctl 2
 sanityn test_71a: @@@@@@ FAIL: data is not flushed from client 

OSS crash

[17669.845624] Lustre: DEBUG MARKER: dmesg
[17670.427979] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 \(1570578863\)
[17670.623376] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================= 23:54:23 (1570578863)
[17671.194035] general protection fault: 0000 [#1] SMP 
[17671.195205] Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw
[17671.209748]  libata virtio_pci virtio_ring virtio 8139cp mii floppy [last unloaded: dm_flakey]
[17671.211639] CPU: 1 PID: 27825 Comm: ps Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.x86_64 #1
[17671.213628] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[17671.214627] task: ffff9b3dd354c100 ti: ffff9b3debae0000 task.ti: ffff9b3debae0000
[17671.216012] RIP: 0010:[<ffffffffbccf7ab9>]  [<ffffffffbccf7ab9>] cap_capable+0x29/0x80
[17671.217533] RSP: 0018:ffff9b3debae3c60  EFLAGS: 00010283
[17671.218461] RAX: ffffffffbccf7a90 RBX: 000000000000000d RCX: 0000000000000000
[17671.219703] RDX: 0000000000000013 RSI: 80d764bdffffffff RDI: ffff9b3dd3a0d900
[17671.220929] RBP: ffff9b3debae3c60 R08: ffffffffbd64d780 R09: ffff9b3dfd001300
[17671.222150] R10: ffff9b3dfd001300 R11: ffffffffbcbd753d R12: ffff9b3df8fd0000
[17671.223393] R13: 0000000000000000 R14: ffff9b3def57d980 R15: ffffffffbd654920
[17671.224582] FS:  00007fd3c6583880(0000) GS:ffff9b3dffd00000(0000) knlGS:0000000000000000
[17671.225989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17671.226966] CR2: 00007fd3c5aba060 CR3: 000000005b5da000 CR4: 00000000000606e0
[17671.228175] Call Trace:
[17671.228700]  [<ffffffffbccf9e6b>] security_capable_noaudit+0x1b/0x20
[17671.229815]  [<ffffffffbcaa6e45>] has_ns_capability_noaudit+0x15/0x20
[17671.230908]  [<ffffffffbcaa7105>] ptrace_has_cap+0x35/0x40
[17671.231875]  [<ffffffffbcaa7b51>] __ptrace_may_access+0x71/0x150
[17671.232896]  [<ffffffffbcaa7efe>] ptrace_may_access+0x2e/0x50
[17671.233885]  [<ffffffffbccc0841>] do_task_stat+0x91/0xbb0
[17671.234850]  [<ffffffffbcc657a4>] ? mntput+0x24/0x40
[17671.235711]  [<ffffffffbcc53ac2>] ? path_openat+0x172/0x640
[17671.236682]  [<ffffffffbcc3f69a>] ? __check_object_size+0x1ca/0x250
[17671.237757]  [<ffffffffbccc1e84>] proc_tgid_stat+0x14/0x20
[17671.238689]  [<ffffffffbccbc062>] proc_single_show+0x52/0x90
[17671.239649]  [<ffffffffbcc69fc0>] seq_read+0x130/0x440
[17671.240555]  [<ffffffffbcc4256f>] vfs_read+0x9f/0x170
[17671.241425]  [<ffffffffbcc4342f>] SyS_read+0x7f/0xf0
[17671.242295]  [<ffffffffbd177d21>] ? system_call_after_swapgs+0xae/0x146
[17671.243447]  [<ffffffffbd177ddb>] system_call_fastpath+0x22/0x27
[17671.244932]  [<ffffffffbd177d21>] ? system_call_after_swapgs+0xae/0x146
[17671.246185] Code: 1f 00 66 66 66 66 90 55 4c 8b 87 80 00 00 00 48 89 e5 eb 09 66 0f 1f 44 00 00 48 89 c6 4c 39 c6 74 28 48 81 fe 80 d7 64 bd 74 3f <48> 8b 86 c8 00 00 00 4c 39 c0 75 e3 8b 86 d0 00 00 00 39 47 14 
[17671.251802] RIP  [<ffffffffbccf7ab9>] cap_capable+0x29/0x80
[17671.252816]  RSP <ffff9b3debae3c60>
Comment by James Nunez (Inactive) [ 20/Feb/20 ]

The patch that landed, https://review.whamcloud.com/37561/, puts sanityn tests 16a and 71a on the ALWAYS_EXCEPT list for PPC client testing. This ticket should remain open until those tests are fixed and the tests are taken off the list.

Comment by Xinliang Liu [ 11/Oct/21 ]

Arm crashing when running sanityn test 71a for master branch

[ 1330.973746] Lustre: DEBUG MARKER: Using TIMEOUT=20
[ 1333.254497] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
[ 1338.681588] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_logdir /tmp/test_logs/2021-10-11/020444
[ 1339.086489] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing yml_node
[ 1339.419862] Lustre: DEBUG MARKER: Client: 2.14.52.80
[ 1339.495670] Lustre: DEBUG MARKER: MDS: 2.14.52.80
[ 1339.567308] Lustre: DEBUG MARKER: OSS: 2.14.52.80
[ 1339.627097] Lustre: DEBUG MARKER: -----============= acceptance-small: sanityn ============----- Mon Oct 11 02:05:23 UTC 2021
[ 1339.980338] Lustre: DEBUG MARKER: excepting tests: 28
[ 1340.091948] Lustre: Mounted lustre-client
[ 1340.468327] Lustre: DEBUG MARKER: beegfs-test-01.novalocal: executing check_config_client /mnt/lustre
[ 1341.262854] Lustre: DEBUG MARKER: Using TIMEOUT=20
[ 1341.443750] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
[ 1342.212768] Lustre: DEBUG MARKER: == sanityn test 71a: correct file map just after write operation is finished ========================================================== 02:05:25 (1633917925)
[ 1342.335774] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 1, block_idx 16, i 14,start_blocks: 0, count: 30, npages: 3
[ 1342.338924] LustreError: 4391:0:(osd_io.c:538:osd_do_bio()) LBUG
[ 1342.340168] Pid: 4391, comm: ll_ost_io06_002 4.18.0-305.7.1.el8_lustre.aarch64 #1 SMP Mon Jul 19 08:24:26 UTC 2021
[ 1342.342344] Call Trace:
[ 1342.342858] [<0>] libcfs_call_trace+0xb8/0x118 [libcfs]
[ 1342.343908] [<0>] lbug_with_loc+0x60/0xa0 [libcfs]
[ 1342.344888] [<0>] osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
[ 1342.346130] [<0>] osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
[ 1342.347535] [<0>] osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
[ 1342.348841] [<0>] osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
[ 1342.349971] [<0>] ofd_commitrw_write+0x658/0x1e68 [ofd]
[ 1342.351002] [<0>] ofd_commitrw+0x360/0xa18 [ofd]
[ 1342.352134] [<0>] tgt_brw_write+0x1624/0x2d90 [ptlrpc]
[ 1342.353229] [<0>] tgt_handle_request0+0xd0/0x978 [ptlrpc]
[ 1342.354356] [<0>] tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
[ 1342.355513] [<0>] ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
[ 1342.356824] [<0>] ptlrpc_main+0xd28/0x15f0 [ptlrpc]
[ 1342.357786] [<0>] kthread+0x130/0x138
[ 1342.358508] [<0>] ret_from_fork+0x10/0x18
[ 1342.359295] Kernel panic - not syncing: LBUG
[ 1342.360146] CPU: 12 PID: 4391 Comm: ll_ost_io06_002 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-305.7.1.el8_lustre.aarch64 #1
[ 1342.362770] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 1342.364231] Call trace:
[ 1342.364751]  dump_backtrace+0x0/0x188
[ 1342.365526]  show_stack+0x24/0x30
[ 1342.366234]  dump_stack+0x9c/0xbc
[ 1342.366937]  panic+0x130/0x2f8
[ 1342.367608]  param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs]
[ 1342.368816]  osd_do_bio.constprop.21+0x6c4/0xcf8 [osd_ldiskfs]
[ 1342.370058]  osd_ldiskfs_map_write.constprop.20+0xb4/0xf0 [osd_ldiskfs]
[ 1342.371477]  osd_ldiskfs_map_inode_pages+0x66c/0x958 [osd_ldiskfs]
[ 1342.372837]  osd_write_commit+0x4fc/0xbb8 [osd_ldiskfs]
[ 1342.373951]  ofd_commitrw_write+0x658/0x1e68 [ofd]
[ 1342.374970]  ofd_commitrw+0x360/0xa18 [ofd]
[ 1342.375934]  tgt_brw_write+0x1624/0x2d90 [ptlrpc]
[ 1342.376999]  tgt_handle_request0+0xd0/0x978 [ptlrpc]
[ 1342.378123]  tgt_request_handle+0x7c0/0x1a38 [ptlrpc]
[ 1342.379258]  ptlrpc_server_handle_request+0x3bc/0x11e8 [ptlrpc]
[ 1342.380587]  ptlrpc_main+0xd28/0x15f0 [ptlrpc]
[ 1342.381530]  kthread+0x130/0x138
[ 1342.382242]  ret_from_fork+0x10/0x18
[ 1342.383021] SMP: stopping secondary CPUs
[ 1342.389567] Starting crashdump kernel...
[ 1342.390356] Bye!

 

Comment by Xinliang Liu [ 21/Oct/21 ]

Note that above crash is another issue, tracking in this bug: LU-15122

 

Comment by Gerrit Updater [ 20/Nov/21 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/45629
Subject: LU-11787 test: re-enable sanityn 71a for 64K page systems
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 49b892c53b16a7d02c29b4a041175d5303022cbb

Comment by Xinliang Liu [ 16/Dec/21 ]

Currently, ldiskfs block allocated is PAGE_SIZE aligned, see function osd_ldiskfs_map_inode_pages(), test case need to be modified to PAGE_SIZE aligned for Arm.

Comment by Andreas Dilger [ 29/Jan/22 ]

Looking at the test itself, it seems clear that it won't work properly on a 64KB PAGE_SIZE system:

         # write data this way: hole - data - hole - data
        dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=1 count=1
        dd if=/dev/urandom of=$DIR1/$tfile bs=40K seek=3 count=1

because the "hole" is from 80KB-120KB, and with 64KB PAGE_SIZE the 40-80KB write will actually dirty the 64KB-128KB page and no hole is created...

The test needs to be updated to use seek=0 for the first dd and seek=4 for the second dd and it should pass. It may need some adjustments to the later checks, but at least there will be a hole from 64KB-128KB for 64KB PAGE_SIZE clients, and 40-160KB for 4KB PAGE_SIZE clients. If we want to make the file consistent, it should use bs=64K.

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45629/
Subject: LU-11787 test: Fix checkfilemap tests for 64K page
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c88dfd28b5cc6114a85f187ecb2473657d42c9d

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 12/Jun/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51287
Subject: LU-11787 test: Fix checkfilemap tests for 64K page
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 5755b9c9148cfe235fb2f4dcb066eb0c83fab274

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51287/
Subject: LU-11787 test: Fix checkfilemap tests for 64K page
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 66d9916dc947064b598f52476fb6482c0bbaff10

Generated at Sat Feb 10 02:46:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.