[LU-14112] parallel-scale-nfsv3 test racer_on_nfs hangs in locking Created: 03/Nov/20  Updated: 04/Nov/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv3 test_racer_on_nfs hangs in locking. Looking at the failure at https://testing.whamcloud.com/test_sets/be1b9c8d-3095-43fa-aaca-e43c7cbb84ce, in both client console logs, we see several dd and cp processes hung and nfs server not responding messages.
From client1 (vm5) console log,

[57611.728321] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
[57611.932785] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[57614.148227] 8[14798]: segfault at 8 ip 00007f76d2c73798 sp 00007fff8d8b79f0 error 4 in ld-2.17.so[7f76d2c68000+22000]
[57639.708629] 19[3351]: segfault at 8 ip 00007f1847a8b798 sp 00007ffd4c308b80 error 4 in ld-2.17.so[7f1847a80000+22000]
[57641.608140] 6[7340]: segfault at 0 ip           (null) sp 00007ffc2bb99258 error 14 in 6[400000+6000]
[57734.427404] 1[458]: segfault at 8 ip 00007f8013852798 sp 00007fff1727cc30 error 4 in ld-2.17.so[7f8013847000+22000]
[57740.031404] 16[11741]: segfault at 8 ip 00007f3ad2163798 sp 00007ffdce934730 error 4 in ld-2.17.so[7f3ad2158000+22000]
[58019.646614] nfs: server trevis-57vm8 not responding, still trying
[58019.647966] nfs: server trevis-57vm8 not responding, still trying
[58019.649123] nfs: server trevis-57vm8 not responding, still trying
[58020.038570] nfs: server trevis-57vm8 not responding, still trying
[58020.467569] nfs: server trevis-57vm8 not responding, still trying
[58020.911526] nfs: server trevis-57vm8 not responding, still trying
[58020.912834] nfs: server trevis-57vm8 not responding, still trying
[58020.916520] nfs: server trevis-57vm8 not responding, still trying
[58078.437386] INFO: task dd:19266 blocked for more than 120 seconds.
[58078.438624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[58078.439924] dd              D ffff971669d362a0     0 19266      1 0x00000084
[58078.441291] Call Trace:
[58078.441816]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
[58078.442797]  [<ffffffffa79811f9>] schedule+0x29/0x70
[58078.443700]  [<ffffffffa797ed01>] schedule_timeout+0x221/0x2d0
[58078.444914]  [<ffffffffc032f7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc]
[58078.445993]  [<ffffffffc033e5d0>] ? rpc_put_task+0x10/0x20 [sunrpc]
[58078.447264]  [<ffffffffa726d35e>] ? kvm_clock_get_cycles+0x1e/0x20
[58078.448360]  [<ffffffffa7306392>] ? ktime_get_ts64+0x52/0xf0
[58078.449348]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
[58078.450237]  [<ffffffffa79808ed>] io_schedule_timeout+0xad/0x130
[58078.451240]  [<ffffffffa7980988>] io_schedule+0x18/0x20
[58078.452200]  [<ffffffffa797f351>] bit_wait_io+0x11/0x50
[58078.453120]  [<ffffffffa797ee77>] __wait_on_bit+0x67/0x90
[58078.454100]  [<ffffffffa73bcb91>] wait_on_page_bit+0x81/0xa0
[58078.455117]  [<ffffffffa72c74d0>] ? wake_bit_function+0x40/0x40
[58078.456155]  [<ffffffffa73bccc1>] __filemap_fdatawait_range+0x111/0x190
[58078.457363]  [<ffffffffa73bcd54>] filemap_fdatawait_range+0x14/0x30
[58078.458460]  [<ffffffffa73bf236>] filemap_write_and_wait_range+0x56/0x90
[58078.459672]  [<ffffffffc05dc746>] nfs_file_fsync+0x86/0x110 [nfs]
[58078.460758]  [<ffffffffa747fc2e>] vfs_fsync+0x2e/0x40
[58078.461674]  [<ffffffffc05dcb86>] nfs_file_flush+0x46/0x60 [nfs]
[58078.462737]  [<ffffffffa7447f87>] filp_close+0x37/0x90
[58078.463666]  [<ffffffffa746b76c>] __close_fd+0x8c/0xb0
[58078.464578]  [<ffffffffa7449aa3>] SyS_close+0x23/0x50
[58078.465473]  [<ffffffffa798dede>] system_call_fastpath+0x25/0x2a
[58078.466528]  [<ffffffffa798de21>] ? system_call_after_swapgs+0xae/0x146
[58092.571814] nfs: server trevis-57vm8 not responding, still trying
[58097.755630] nfs: server trevis-57vm8 not responding, still trying
[58198.462827] INFO: task dd:19266 blocked for more than 120 seconds.
[58198.464059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[58198.465353] dd              D ffff971669d362a0     0 19266      1 0x00000084

On the MDS1,3 (vm8) console log, we see

[57623.525925] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
[57631.121275] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x313e:0x0] get parent: rc = -2
[57636.101177] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3267:0x0] get parent: rc = -2
[57636.103551] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 6 previous similar messages
[57637.950585] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3288:0x0] get parent: rc = -2
[57647.690933] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x334a:0x0] get parent: rc = -2
[57656.183969] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x33ce:0x0] get parent: rc = -2
[57656.186407] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 25 previous similar messages
[57671.270437] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3542:0x0] get parent: rc = -2
[57671.273037] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 12 previous similar messages
[57692.317861] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3752:0x0] get parent: rc = -2
[57692.320930] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 46 previous similar messages
[57725.517726] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3ac4:0x0] get parent: rc = -2
[57725.521666] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 128 previous similar messages
[57754.052394] reconnect_path: npd != pd
[57790.136336] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3fe7:0x0] get parent: rc = -2
[57790.138707] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 207 previous similar messages
[57913.388418] LNet: Service thread pid 22912 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[57913.391801] Pid: 22912, comm: mdt00_009 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
[57913.393552] Call Trace:
[57913.394117]  [<ffffffffc0c80150>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[57913.395890]  [<ffffffffc0c82181>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
[57913.397261]  [<ffffffffc10b369b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[57913.398722]  [<ffffffffc10b3d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
[57913.400030]  [<ffffffffc10b40c7>] mdt_object_lock_try+0x27/0xb0 [mdt]
[57913.401320]  [<ffffffffc10dab68>] mdt_object_open_lock+0x688/0xac0 [mdt]
[57913.402637]  [<ffffffffc10df3b7>] mdt_reint_open+0x15b7/0x3240 [mdt]
[57913.404007]  [<ffffffffc10d31b3>] mdt_reint_rec+0x83/0x210 [mdt]
[57913.405289]  [<ffffffffc10af383>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[57913.406619]  [<ffffffffc10bc622>] mdt_intent_open+0x82/0x3a0 [mdt]
[57913.407807]  [<ffffffffc10ba745>] mdt_intent_policy+0x435/0xd80 [mdt]
[57913.409088]  [<ffffffffc0c66e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[57913.410443]  [<ffffffffc0c8eba6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
[57913.411785]  [<ffffffffc0d181a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[57913.413151]  [<ffffffffc0d1efba>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[57913.414553]  [<ffffffffc0cc392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[57913.415991]  [<ffffffffc0cc7294>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[57913.417401]  [<ffffffffaf4c6321>] kthread+0xd1/0xe0
[57913.418379]  [<ffffffffafb8ed37>] ret_from_fork_nospec_end+0x0/0x39
[57913.419653]  [<ffffffffffffffff>] 0xffffffffffffffff
[57913.420913] LustreError: dumping log to /tmp/lustre-log.1604093989.22912
[57913.858305] Pid: 19858, comm: mdt00_007 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
[57913.860079] Call Trace:

This failure is for RHEL7.7, DNE ldiskfs testing.

This hang looks very similar to several other tickets, but for each case, there is a call missing from the logs:

  • looks very similar to LU-11359, but I don’t see “mdt_dom_discard_data” in the MDS console logs/call traces.
  • looks very similar to LU-11358, but I don’t see “mdt_brw_enqueue” in the MDS console logs/call traces.

Generated at Sat Feb 10 03:06:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.