[LU-14346] sanity-pcc test 7a hangs in multiop/down_read Created: 20/Jan/21  Updated: 08/Jun/22

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Kevin Zhao
Resolution: Unresolved Votes: 0
Labels: arm
Environment:

ARM Clients


Issue Links:
Related
is related to LU-14003 PCC: rework PCC mmap Open
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-pcc test_7a hangs in down_read for ARM client testing start on 19 DEC 2020 for Lustre 2.13.57.12 with https://testing.whamcloud.com/test_sets/99e85d3e-9f79-46c1-b0a0-be0eeca6ce02.

Looking at the latest failure at https://testing.whamcloud.com/test_sets/362b1cf0-6c36-4ef8-9627-817f93ab2818, the client2 (vm4) has the following in the console log

[205077.391310] Lustre: DEBUG MARKER: multiop /mnt/lustre/f7a.sanity-pcc OSMWUc
[205077.877415] Lustre: *** cfs_fail_loc=1412, val=0***
[205078.434483] WARNING: CPU: 0 PID: 743007 at kernel/locking/rwsem.c:283 down_read+0x5c/0xb8
[205078.440491] Modules linked in: loop nfsv3 nfs_acl obdecho(OE) ptlrpc_gss(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) crc32_generic rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm sunrpc iw_cm ib_core vfat fat crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_balloon ip_tables ext4 mbcache jbd2 virtio_net net_failover virtio_blk failover virtio_mmio [last unloaded: libcfs]
[205078.480885] CPU: 0 PID: 743007 Comm: multiop Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-193.28.1.el8_2.aarch64 #1
[205078.489182] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[205078.493966] pstate: 40000005 (nZcv daif -PAN -UAO)
[205078.497344] pc : down_read+0x5c/0xb8
[205078.499849] lr : down_read+0x28/0xb8
[205078.502347] sp : ffff000015c6fe10
[205078.504701] x29: ffff000015c6fe10 x28: 0000000000000002 
[205078.508440] x27: 0000000000000000 x26: ffff8000be9dfc60 
[205078.512107] x25: 000000009200004f x24: 0000000000000055 
[205078.515790] x23: ffff8000be9dfc00 x22: 000000009200004f 
[205078.519509] x21: 0000ffff95650000 x20: ffff80006ac7cc00 
[205078.523182] x19: ffff8000be9dfc60 x18: 0000000000000100 
[205078.526885] x17: ffffffffffffff00 x16: 0000000000000000 
[205078.530533] x15: 0000a1cc9b9b1c85 x14: 0000000000000000 
[205078.534284] x13: 0000ffff95363e38 x12: 0000000000000000 
[205078.537985] x11: 0000000000000028 x10: 0101010101010101 
[205078.541644] x9 : 0000000000000000 x8 : 00000000000000de 
[205078.545314] x7 : 0000000000000000 x6 : 0000000000000000 
[205078.548970] x5 : 0000000000000000 x4 : ffff0000109e0000 
[205078.552634] x3 : ffff0000109819b0 x2 : 0000000000000000 
[205078.556333] x1 : ffff8000be9dfc60 x0 : ffffffffffffff00 
[205078.560075] Call trace:
[205078.561815]  down_read+0x5c/0xb8
[205078.564088]  do_page_fault+0x120/0x470
[205078.566938]  do_mem_abort+0x50/0xa8
[205078.569436]  el0_da+0x24/0x28
[205078.571525] ---[ end trace 417884e623adee39 ]---

All the servers have the same trace

[111782.156099] sysrq: SysRq : Changing Loglevel
[111782.156972] sysrq: Loglevel set to 8
[111782.158967] sysrq: SysRq : Show State
[111782.159637]   task                        PC stack   pid father
[111782.160642] systemd         S    0     1      0 0x00000000
[111782.161590] Call Trace:
[111782.162092]  __schedule+0x2a6/0x700
[111782.162759]  schedule+0x38/0xa0
[111782.163307]  schedule_hrtimeout_range_clock+0xf7/0x110
[111782.164210]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[111782.165145]  ep_poll+0x3c8/0x410
[111782.165741]  ? wake_up_q+0x80/0x80
[111782.166329]  do_epoll_wait+0xb0/0xd0
[111782.166954]  __x64_sys_epoll_wait+0x1a/0x20
[111782.167718]  do_syscall_64+0x5b/0x1a0
[111782.168363]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[111782.169239] RIP: 0033:0x7f75ca0b3257
[111782.169878] Code: Bad RIP value.
[111782.170433] RSP: 002b:00007fff37502d90 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[111782.171683] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f75ca0b3257
[111782.172879] RDX: 0000000000000056 RSI: 00007fff37502dd0 RDI: 0000000000000004
[111782.174060] RBP: 00007fff37502dd0 R08: 0000000000000000 R09: 7465677261742e79
[111782.175249] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000056
[111782.176432] R13: 00000000ffffffff R14: 00007fff37502dd0 R15: 0000000000000001


 Comments   
Comment by Kevin Zhao [ 26/Jan/22 ]

This issue first occured between the test: https://testing.whamcloud.com/test_sessions/bf5e3637-ddd0-404b-8514-d5a1b93f18e3 and https://testing.whamcloud.com/test_sets/f3c1b7d7-c9c8-4dc6-bdfc-f599f6dd937c . The Lustre version is between: 2.13.53.126 and 2.13.53.163

Comment by Kevin Zhao [ 11/Mar/22 ]

This bug can also be reproduced on X86_64 now if we change the test: https://github.com/lustre/lustre-release/blob/master/lustre/tests/multiop.c#L725 as below:

case 'W':
    for (i = 0; i < mmap_len && mmap_ptr; i += 4096){
        int test_value = mmap_ptr[i];
        test_value += junk++;
        mmap_ptr[i] += test_value;
    }
    break; 

The function is the same, but it can easily hit the hang issue for sanity-pcc 7a.

The hanging occurs when we try to write to mmap_ptr[i], and hang forever.

Generated at Sat Feb 10 03:08:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.