Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.12.6
-
3
-
9223372036854775807
Description
parallel-scale-nfsv3 test_racer_on_nfs hangs in locking. Looking at the failure at https://testing.whamcloud.com/test_sets/be1b9c8d-3095-43fa-aaca-e43c7cbb84ce, in both client console logs, we see several dd and cp processes hung and nfs server not responding messages.
From client1 (vm5) console log,
[57611.728321] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699) [57611.932785] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [57614.148227] 8[14798]: segfault at 8 ip 00007f76d2c73798 sp 00007fff8d8b79f0 error 4 in ld-2.17.so[7f76d2c68000+22000] [57639.708629] 19[3351]: segfault at 8 ip 00007f1847a8b798 sp 00007ffd4c308b80 error 4 in ld-2.17.so[7f1847a80000+22000] [57641.608140] 6[7340]: segfault at 0 ip (null) sp 00007ffc2bb99258 error 14 in 6[400000+6000] [57734.427404] 1[458]: segfault at 8 ip 00007f8013852798 sp 00007fff1727cc30 error 4 in ld-2.17.so[7f8013847000+22000] [57740.031404] 16[11741]: segfault at 8 ip 00007f3ad2163798 sp 00007ffdce934730 error 4 in ld-2.17.so[7f3ad2158000+22000] [58019.646614] nfs: server trevis-57vm8 not responding, still trying [58019.647966] nfs: server trevis-57vm8 not responding, still trying [58019.649123] nfs: server trevis-57vm8 not responding, still trying [58020.038570] nfs: server trevis-57vm8 not responding, still trying [58020.467569] nfs: server trevis-57vm8 not responding, still trying [58020.911526] nfs: server trevis-57vm8 not responding, still trying [58020.912834] nfs: server trevis-57vm8 not responding, still trying [58020.916520] nfs: server trevis-57vm8 not responding, still trying [58078.437386] INFO: task dd:19266 blocked for more than 120 seconds. [58078.438624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [58078.439924] dd D ffff971669d362a0 0 19266 1 0x00000084 [58078.441291] Call Trace: [58078.441816] [<ffffffffa797f340>] ? bit_wait+0x50/0x50 [58078.442797] [<ffffffffa79811f9>] schedule+0x29/0x70 [58078.443700] [<ffffffffa797ed01>] schedule_timeout+0x221/0x2d0 [58078.444914] [<ffffffffc032f7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc] [58078.445993] [<ffffffffc033e5d0>] ? rpc_put_task+0x10/0x20 [sunrpc] [58078.447264] [<ffffffffa726d35e>] ? kvm_clock_get_cycles+0x1e/0x20 [58078.448360] [<ffffffffa7306392>] ? ktime_get_ts64+0x52/0xf0 [58078.449348] [<ffffffffa797f340>] ? bit_wait+0x50/0x50 [58078.450237] [<ffffffffa79808ed>] io_schedule_timeout+0xad/0x130 [58078.451240] [<ffffffffa7980988>] io_schedule+0x18/0x20 [58078.452200] [<ffffffffa797f351>] bit_wait_io+0x11/0x50 [58078.453120] [<ffffffffa797ee77>] __wait_on_bit+0x67/0x90 [58078.454100] [<ffffffffa73bcb91>] wait_on_page_bit+0x81/0xa0 [58078.455117] [<ffffffffa72c74d0>] ? wake_bit_function+0x40/0x40 [58078.456155] [<ffffffffa73bccc1>] __filemap_fdatawait_range+0x111/0x190 [58078.457363] [<ffffffffa73bcd54>] filemap_fdatawait_range+0x14/0x30 [58078.458460] [<ffffffffa73bf236>] filemap_write_and_wait_range+0x56/0x90 [58078.459672] [<ffffffffc05dc746>] nfs_file_fsync+0x86/0x110 [nfs] [58078.460758] [<ffffffffa747fc2e>] vfs_fsync+0x2e/0x40 [58078.461674] [<ffffffffc05dcb86>] nfs_file_flush+0x46/0x60 [nfs] [58078.462737] [<ffffffffa7447f87>] filp_close+0x37/0x90 [58078.463666] [<ffffffffa746b76c>] __close_fd+0x8c/0xb0 [58078.464578] [<ffffffffa7449aa3>] SyS_close+0x23/0x50 [58078.465473] [<ffffffffa798dede>] system_call_fastpath+0x25/0x2a [58078.466528] [<ffffffffa798de21>] ? system_call_after_swapgs+0xae/0x146 [58092.571814] nfs: server trevis-57vm8 not responding, still trying [58097.755630] nfs: server trevis-57vm8 not responding, still trying [58198.462827] INFO: task dd:19266 blocked for more than 120 seconds. [58198.464059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [58198.465353] dd D ffff971669d362a0 0 19266 1 0x00000084
On the MDS1,3 (vm8) console log, we see
[57623.525925] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699) [57631.121275] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x313e:0x0] get parent: rc = -2 [57636.101177] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3267:0x0] get parent: rc = -2 [57636.103551] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 6 previous similar messages [57637.950585] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3288:0x0] get parent: rc = -2 [57647.690933] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x334a:0x0] get parent: rc = -2 [57656.183969] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x33ce:0x0] get parent: rc = -2 [57656.186407] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 25 previous similar messages [57671.270437] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3542:0x0] get parent: rc = -2 [57671.273037] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 12 previous similar messages [57692.317861] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3752:0x0] get parent: rc = -2 [57692.320930] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 46 previous similar messages [57725.517726] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3ac4:0x0] get parent: rc = -2 [57725.521666] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 128 previous similar messages [57754.052394] reconnect_path: npd != pd [57790.136336] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3fe7:0x0] get parent: rc = -2 [57790.138707] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 207 previous similar messages [57913.388418] LNet: Service thread pid 22912 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [57913.391801] Pid: 22912, comm: mdt00_009 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020 [57913.393552] Call Trace: [57913.394117] [<ffffffffc0c80150>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [57913.395890] [<ffffffffc0c82181>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [57913.397261] [<ffffffffc10b369b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [57913.398722] [<ffffffffc10b3d20>] mdt_object_lock_internal+0x70/0x360 [mdt] [57913.400030] [<ffffffffc10b40c7>] mdt_object_lock_try+0x27/0xb0 [mdt] [57913.401320] [<ffffffffc10dab68>] mdt_object_open_lock+0x688/0xac0 [mdt] [57913.402637] [<ffffffffc10df3b7>] mdt_reint_open+0x15b7/0x3240 [mdt] [57913.404007] [<ffffffffc10d31b3>] mdt_reint_rec+0x83/0x210 [mdt] [57913.405289] [<ffffffffc10af383>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [57913.406619] [<ffffffffc10bc622>] mdt_intent_open+0x82/0x3a0 [mdt] [57913.407807] [<ffffffffc10ba745>] mdt_intent_policy+0x435/0xd80 [mdt] [57913.409088] [<ffffffffc0c66e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [57913.410443] [<ffffffffc0c8eba6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] [57913.411785] [<ffffffffc0d181a2>] tgt_enqueue+0x62/0x210 [ptlrpc] [57913.413151] [<ffffffffc0d1efba>] tgt_request_handle+0xada/0x1570 [ptlrpc] [57913.414553] [<ffffffffc0cc392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [57913.415991] [<ffffffffc0cc7294>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [57913.417401] [<ffffffffaf4c6321>] kthread+0xd1/0xe0 [57913.418379] [<ffffffffafb8ed37>] ret_from_fork_nospec_end+0x0/0x39 [57913.419653] [<ffffffffffffffff>] 0xffffffffffffffff [57913.420913] LustreError: dumping log to /tmp/lustre-log.1604093989.22912 [57913.858305] Pid: 19858, comm: mdt00_007 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020 [57913.860079] Call Trace:
This failure is for RHEL7.7, DNE ldiskfs testing.
This hang looks very similar to several other tickets, but for each case, there is a call missing from the logs:
Attachments
Issue Links
- mentioned in
-
Page Loading...