[LU-14304] parallel-scale-nfsv3 test racer_on_nfs hangs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
None
Environment:
DNE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

parallel-scale-nfsv3 test_racer_on_nfs hangs. It’s not clear what the issue is, but there are plenty of hung process on the clients and error messages on the servers.

Looking at the failure at https://testing.whamcloud.com/test_sets/aad48503-18ba-4503-a214-24f726916d88, let’s look at the hung processes and the state of the NFS server. From client1 (vm1) console log we see rm and multiple ls hangs

[93747.775094] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:14:16 (1608952456)
[93748.134348] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[93982.476241] nfs: server trevis-200vm4 not responding, still trying
[93982.476262] nfs: server trevis-200vm4 not responding, still trying
[93982.477552] nfs: server trevis-200vm4 not responding, still trying
[93982.478644] nfs: server trevis-200vm4 not responding, still trying
[93982.479619] nfs: server trevis-200vm4 not responding, still trying
[93982.480631] nfs: server trevis-200vm4 not responding, still trying
[93982.481706] nfs: server trevis-200vm4 not responding, still trying
[93982.482691] nfs: server trevis-200vm4 not responding, still trying
[93982.483756] nfs: server trevis-200vm4 not responding, still trying
[93982.484736] nfs: server trevis-200vm4 not responding, still trying
[94002.957142] INFO: task rm:2413546 blocked for more than 120 seconds.
[94002.958274]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94002.959614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94002.960881] rm              D    0 2413546 2260953 0x00000080
[94002.961828] Call Trace:
[94002.962297]  __schedule+0x2a6/0x700
[94002.962917]  schedule+0x38/0xa0
[94002.963473]  rwsem_down_write_slowpath+0x32d/0x4e0
[94002.964283]  ? __wake_up_common_lock+0x89/0xc0
[94002.965039]  do_rmdir+0x164/0x200
[94002.965624]  do_syscall_64+0x5b/0x1a0
[94002.966256]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94002.967107] RIP: 0033:0x7f3b2117c3fb
[94002.967727] Code: Bad RIP value.
[94002.968298] RSP: 002b:00007ffecd9bc0e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[94002.969520] RAX: ffffffffffffffda RBX: 000055ecbfa422b0 RCX: 00007f3b2117c3fb
[94002.970676] RDX: 0000000000000200 RSI: 000055ecbfa41080 RDI: 00000000ffffff9c
[94002.971831] RBP: 000055ecbfa40ff0 R08: 0000000000000003 R09: 0000000000000000
[94002.972984] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffecd9bc2d0
[94002.974140] R13: 0000000000000001 R14: 000055ecbfa422b0 R15: 0000000000000001
[94002.975302] INFO: task ls:2413949 blocked for more than 120 seconds.
[94002.976341]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94002.977665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94002.978929] ls              D    0 2413949 2260977 0x00004080
[94002.979877] Call Trace:
[94002.980325]  __schedule+0x2a6/0x700
[94002.980930]  schedule+0x38/0xa0
[94002.981481]  rwsem_down_read_slowpath+0x3e5/0x520
[94002.982267]  path_openat+0xa26/0x14f0
[94002.982903]  ? mem_cgroup_commit_charge+0x7a/0x550
[94002.983701]  ? mem_cgroup_try_charge+0x8b/0x190
[94002.984465]  ? __raw_spin_unlock+0x5/0x10
[94002.985148]  do_filp_open+0x93/0x100
[94002.985766]  ? __check_object_size+0xa8/0x16b
[94002.986505]  do_sys_open+0x184/0x220
[94002.987123]  do_syscall_64+0x5b/0x1a0
[94002.987754]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94002.988593] RIP: 0033:0x7f4c64715a65
[94002.989210] Code: Bad RIP value.
[94002.989773] RSP: 002b:00007ffcfdeb3dc0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[94002.990988] RAX: ffffffffffffffda RBX: 000055e701449e30 RCX: 00007f4c64715a65
[94002.992141] RDX: 0000000000090800 RSI: 000055e701449f00 RDI: 00000000ffffff9c
[94002.993301] RBP: 0000000000000000 R08: 000055e701440012 R09: 0000000000000004
[94002.994458] R10: 0000000000000000 R11: 0000000000000287 R12: 00007f4c650314d8
[94002.995610] R13: 0000000000000000 R14: 000055e701449f00 R15: 000055e70003e564
[94002.996764] INFO: task ls:2413950 blocked for more than 120 seconds.
…

similar issue on client2 (vm2), we see ln, file_concat.sh, mv, etc. hang

[93752.122166] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:14:16 (1608952456)
[93752.457755] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[93866.016959] nfsacl: server trevis-200vm4 not responding, still trying
[93866.018297] nfsacl: server trevis-200vm4 not responding, still trying
[93866.019382] nfsacl: server trevis-200vm4 not responding, still trying
[93986.845019] nfs: server trevis-200vm4 not responding, still trying
[93986.845452] nfs: server trevis-200vm4 not responding, still trying
[93986.846287] nfs: server trevis-200vm4 not responding, still trying
[93986.846305] nfs: server trevis-200vm4 not responding, still trying
[93986.846361] nfs: server trevis-200vm4 not responding, still trying
[93986.847504] nfs: server trevis-200vm4 not responding, still trying
[93986.848460] nfs: server trevis-200vm4 not responding, still trying
[93986.848469] nfs: server trevis-200vm4 not responding, still trying
[93986.848510] nfs: server trevis-200vm4 not responding, still trying
[93986.848519] nfs: server trevis-200vm4 not responding, still trying
[94003.228890] INFO: task ln:1775239 blocked for more than 120 seconds.
[94003.229997]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.231330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.232590] ln              D    0 1775239 1614050 0x00000080
[94003.233530] Call Trace:
[94003.234007]  __schedule+0x2a6/0x700
[94003.234624]  schedule+0x38/0xa0
[94003.235179]  rwsem_down_write_slowpath+0x32d/0x4e0
[94003.235993]  ? getname_flags+0x4a/0x1e0
[94003.236650]  ? __check_object_size+0xa8/0x16b
[94003.237380]  filename_create+0x90/0x180
[94003.238048]  do_symlinkat+0x5b/0xe0
[94003.238653]  do_syscall_64+0x5b/0x1a0
[94003.239281]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94003.240130] RIP: 0033:0x7fbc377a333b
[94003.240749] Code: Bad RIP value.
[94003.241309] RSP: 002b:00007ffdea90e658 EFLAGS: 00000246 ORIG_RAX: 000000000000010a
[94003.242519] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc377a333b
[94003.243666] RDX: 00007ffdea90ff50 RSI: 00000000ffffff9c RDI: 00007ffdea90ff4a
[94003.244811] RBP: 00000000ffffff9c R08: 0000000000000000 R09: 0000000000000000
[94003.245959] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdea90ff50
[94003.247107] R13: 00007ffdea90ff4a R14: 0000000000000000 R15: 0000000000000000
[94003.248257] INFO: task file_concat.sh:1775657 blocked for more than 120 seconds.
[94003.249446]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.250785] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.252041] file_concat.sh  D    0 1775657 1614052 0x00000080
[94003.252986] Call Trace:
[94003.253430]  __schedule+0x2a6/0x700
[94003.254040]  schedule+0x38/0xa0
[94003.254589]  rwsem_down_write_slowpath+0x32d/0x4e0
[94003.255438]  ? nfs_permission+0x96/0x1c0 [nfs]
[94003.256191]  ? inode_permission+0xa6/0x130
[94003.256889]  path_openat+0x347/0x14f0
[94003.257514]  do_filp_open+0x93/0x100
[94003.258133]  ? __check_object_size+0xa8/0x16b
[94003.258872]  do_sys_open+0x184/0x220
[94003.259486]  do_syscall_64+0x5b/0x1a0
[94003.260117]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94003.260957] RIP: 0033:0x7f55a02a1552
[94003.261577] Code: Bad RIP value.
[94003.262138] RSP: 002b:00007ffce3887ea0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[94003.263351] RAX: ffffffffffffffda RBX: 0000563b0715d130 RCX: 00007f55a02a1552
[94003.264500] RDX: 0000000000000441 RSI: 0000563b0715fed0 RDI: 00000000ffffff9c
[94003.265652] RBP: 00007ffce3887fa0 R08: 0000000000000020 R09: 0000000000000001
[94003.266797] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000003
[94003.267946] R13: 0000000000000001 R14: 0000000000000001 R15: 0000563b0715fed0
[94003.269095] INFO: task file_concat.sh:1775765 blocked for more than 120 seconds.
[94003.270280]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.271600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.272851] file_concat.sh  D    0 1775765 1614016 0x00000080
[94003.273795] Call Trace:
[94003.274238]  __schedule+0x2a6/0x700
[94003.274845]  schedule+0x38/0xa0
[94003.275391]  rwsem_down_write_slowpath+0x32d/0x4e0
[94003.276195]  ? nfs_permission+0x96/0x1c0 [nfs]
[94003.276948]  ? inode_permission+0xa6/0x130
[94003.277643]  path_openat+0x347/0x14f0
[94003.278271]  do_filp_open+0x93/0x100
[94003.278892]  ? __check_object_size+0xa8/0x16b
[94003.279625]  do_sys_open+0x184/0x220
[94003.280239]  do_syscall_64+0x5b/0x1a0
[94003.280875]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94003.281713] RIP: 0033:0x7feccb277552
[94003.282327] Code: Bad RIP value.
[94003.282894] RSP: 002b:00007ffc6f329830 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[94003.284111] RAX: ffffffffffffffda RBX: 000055d4028529e0 RCX: 00007feccb277552
[94003.285272] RDX: 0000000000000441 RSI: 000055d402853bd0 RDI: 00000000ffffff9c
[94003.286422] RBP: 00007ffc6f329930 R08: 0000000000000020 R09: 0000000000000003
[94003.287567] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000003
[94003.288712] R13: 0000000000000001 R14: 0000000000000001 R15: 000055d402853bd0
[94003.289860] INFO: task ln:1775786 blocked for more than 120 seconds.
[94003.290893]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.292218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.293474] ln              D    0 1775786 1614032 0x00000080
[94003.294417] Call Trace:
[94003.294869]  __schedule+0x2a6/0x700
[94003.295467]  schedule+0x38/0xa0
[94003.296024]  rwsem_down_write_slowpath+0x32d/0x4e0
[94003.296827]  ? cp_new_stat+0x150/0x180
[94003.297468]  ? __check_object_size+0xa8/0x16b
[94003.298205]  filename_create+0x90/0x180
[94003.298869]  do_symlinkat+0x5b/0xe0
[94003.299468]  do_syscall_64+0x5b/0x1a0
[94003.300102]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94003.300943] RIP: 0033:0x7f488d16b33b
[94003.301557] Code: Bad RIP value.
[94003.302119] RSP: 002b:00007fffa7038f98 EFLAGS: 00000246 ORIG_RAX: 000000000000010a
[94003.303336] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f488d16b33b
[94003.304481] RDX: 00007fffa703af4f RSI: 00000000ffffff9c RDI: 00007fffa703af46
[94003.305632] RBP: 00000000ffffff9c R08: 0000000000000000 R09: 0000000000000000
[94003.306776] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffa703af4f
[94003.307917] R13: 00007fffa703af46 R14: 0000000000000000 R15: 0000000000000000
[94003.309062] INFO: task mv:1775793 blocked for more than 120 seconds.
[94003.310096]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.311429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.312694] mv              D    0 1775793 1614030 0x00000080
[94003.313636] Call Trace:
[94003.314084]  __schedule+0x2a6/0x700
[94003.314687]  schedule+0x38/0xa0
[94003.315241]  rwsem_down_write_slowpath+0x32d/0x4e0
[94003.316050]  lock_rename+0xac/0xb0
[94003.316639]  do_renameat2+0x2a9/0x530
[94003.317266]  __x64_sys_renameat2+0x20/0x30
[94003.317968]  do_syscall_64+0x5b/0x1a0
[94003.318610]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[94003.319446] RIP: 0033:0x7f333f86678d
[94003.320071] Code: Bad RIP value.
[94003.320631] RSP: 002b:00007ffd35fe1af8 EFLAGS: 00000246 ORIG_RAX: 000000000000013c
[94003.321841] RAX: ffffffffffffffda RBX: 00007ffd35fe2f2b RCX: 00007f333f86678d
[94003.322996] RDX: 00000000ffffff9c RSI: 00007ffd35fe2f2b RDI: 00000000ffffff9c
[94003.324151] RBP: 0000000000000001 R08: 0000000000000001 R09: 00007ffd35fe1b60
[94003.325307] R10: 00007ffd35fe2f50 R11: 0000000000000246 R12: 00007ffd35fe2f50
[94003.326462] R13: 00000000ffffff9c R14: 00000000ffffff9c R15: 0000000000000000
[94003.327617] INFO: task mkdir:1775806 blocked for more than 120 seconds.
[94003.328695]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
[94003.330011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94003.331269] mkdir           D    0 1775806 1614028 0x00000080
[94003.332215] Call Trace:
[94003.332661]  __schedule+0x2a6/0x700
[94003.333264]  schedule+0x38/0xa0
[94003.333820]  rwsem_down_write_slowpath+0x32d/0x4e0

On the console log for the OSS (vm3), we see network issues

[93760.059200] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:14:16 (1608952456)
[93859.822917] Lustre: ll_ost_io00_011: service thread pid 1987754 was inactive for 41.129 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[93859.823180] Pid: 1987802, comm: ll_ost_io00_059 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[93859.825924] Lustre: Skipped 1 previous similar message
[93859.827537] Call Trace TBD:
[93859.827550] Pid: 1987820, comm: ll_ost_io00_073 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[93859.830375] Call Trace TBD:
[93859.830827] Pid: 1987754, comm: ll_ost_io00_011 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[93859.832435] Call Trace TBD:
[93880.302810] Lustre: ll_ost_io00_029: service thread pid 1987772 was inactive for 41.131 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[94414.764053] Lustre: 18276:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1608952511/real 1608952511]  req@000000001d0da127 x1687011117363904/t0(0) o601->lustre-MDT0000-lwp-OST0006@10.9.7.104@tcp:23/10 lens 336/336 e 24 to 1 dl 1608953112 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'ll_ost_io00_074.0'
[94414.768730] Lustre: 18276:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
[94414.768758] Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.9.7.104@tcp) was lost; in progress operations using this service will wait for recovery to complete
[94414.772724] Lustre: Skipped 7 previous similar messages
[94414.774802] Lustre: lustre-MDT0000-lwp-OST0006: Connection restored to 10.9.7.104@tcp (at 10.9.7.104@tcp)
…
[97184.670860] Lustre: 18275:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1608955281/real 1608955281]  req@000000001d27bbb1 x1687011118521856/t0(0) o601->lustre-MDT0000-lwp-OST0000@10.9.7.104@tcp:23/10 lens 336/336 e 1 to 1 dl 1608955882 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'lquota_wb_lustr.0'
[97184.675543] Lustre: 18275:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
[97184.677059] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.9.7.104@tcp) was lost; in progress operations using this service will wait for recovery to complete
[97184.679520] Lustre: Skipped 6 previous similar messages
[97184.681454] Lustre: lustre-MDT0000-lwp-OST0000: Connection restored to 10.9.7.104@tcp (at 10.9.7.104@tcp)
[97184.683017] Lustre: Skipped 6 previous similar messages
[97410.756198] sysrq: SysRq : Changing Loglevel
[97410.756931] sysrq: Loglevel set to 8
[97410.758707] sysrq: SysRq : Show State
[97410.759323]   task                        PC stack   pid father
[97410.760245] systemd         S    0     1      0 0x00000000
[97410.761099] Call Trace:
[97410.761514]  __schedule+0x2a6/0x700
[97410.762085]  schedule+0x38/0xa0
[97410.762580]  schedule_hrtimeout_range_clock+0xf7/0x110
[97410.763385]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[97410.764244]  ep_poll+0x3c8/0x410
[97410.764774]  ? wake_up_q+0x80/0x80
[97410.765451]  do_epoll_wait+0xb0/0xd0
[97410.766030]  __x64_sys_epoll_wait+0x1a/0x20
[97410.766704]  do_syscall_64+0x5b/0x1a0
[97410.767302]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[97410.768114] RIP: 0033:0x7fd4589a8257
[97410.768681] Code: Bad RIP value.
[97410.769208] RSP: 002b:00007ffcd50e51b0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[97410.770394] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fd4589a8257
[97410.771491] RDX: 0000000000000055 RSI: 00007ffcd50e51f0 RDI: 0000000000000004
[97410.772608] RBP: 00007ffcd50e51f0 R08: 0000000000000000 R09: 7465677261742e79
[97410.773708] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000055
[97410.774808] R13: 00000000ffffffff R14: 00007ffcd50e51f0 R15: 0000000000000001
[97410.775913] kthreadd        S    0     2      0 0x80004000
…

In the console log for MDS1,3 (vm4) we see hung threads

[74660.282956] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:14:16 (1608952456)
[74756.313777] Lustre: mdt00_003: service thread pid 2745633 was inactive for 42.397 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[74756.314045] Pid: 2762099, comm: mdt00_006 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[74756.316659] Lustre: Skipped 1 previous similar message
[74756.318154] Call Trace TBD:
[74756.319399] Pid: 2745633, comm: mdt00_003 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[74756.320999] Call Trace TBD:
[74756.321446] Pid: 2765712, comm: mdt00_010 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
[74756.322936] Call Trace TBD:
[74756.323372] Lustre: mdt00_004: service thread pid 2748823 was inactive for 42.485 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[74784.985640] Lustre: mdt_rdpg00_002: service thread pid 2742038 was inactive for 66.038 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[74784.988019] Lustre: Skipped 6 previous similar messages
[74797.273572] Lustre: mdt_rdpg00_009: service thread pid 2782530 was inactive for 62.973 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[74805.465517] Lustre: mdt_rdpg00_010: service thread pid 2782531 was inactive for 66.045 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[74805.467849] Lustre: Skipped 4 previous similar messages
[74854.617294] Lustre: mdt_rdpg00_014: service thread pid 2782535 was inactive for 114.687 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[74957.016770] INFO: task jbd2/dm-6-8:2740887 blocked for more than 120 seconds.
[74957.018159]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[74957.019471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74957.020650] jbd2/dm-6-8     D    0 2740887      2 0x80004080
[74957.021512] Call Trace:
[74957.021934]  __schedule+0x2a6/0x700
[74957.022480]  ? __wake_up_common_lock+0x89/0xc0
[74957.023174]  ? finish_wait+0x80/0x80
[74957.023728]  schedule+0x38/0xa0
[74957.024236]  jbd2_journal_commit_transaction+0x25d/0x19f0 [jbd2]
[74957.025240]  ? __switch_to_asm+0x35/0x70
[74957.025855]  ? __switch_to_asm+0x41/0x70
[74957.026452]  ? __switch_to_asm+0x35/0x70
[74957.027059]  ? __switch_to_asm+0x41/0x70
[74957.027654]  ? __switch_to_asm+0x35/0x70
[74957.028258]  ? __switch_to_asm+0x41/0x70
[74957.028875]  ? __switch_to_asm+0x35/0x70
[74957.029468]  ? __switch_to_asm+0x41/0x70
[74957.030082]  ? __switch_to_asm+0x41/0x70
[74957.030676]  ? __switch_to_asm+0x35/0x70
[74957.031280]  ? __switch_to_asm+0x41/0x70
[74957.031887]  ? __switch_to_asm+0x35/0x70
[74957.032482]  ? __switch_to_asm+0x41/0x70
[74957.033088]  ? __switch_to_asm+0x41/0x70
[74957.033688]  ? finish_wait+0x80/0x80
[74957.034244]  ? finish_task_switch+0x77/0x2a0
[74957.034908]  ? lock_timer_base+0x67/0x80
[74957.035508]  kjournald2+0xbd/0x270 [jbd2]
[74957.036136]  ? finish_wait+0x80/0x80
[74957.036682]  ? commit_timeout+0x10/0x10 [jbd2]
[74957.037369]  kthread+0x112/0x130
[74957.037882]  ? kthread_flush_work_fn+0x10/0x10
[74957.038556]  ret_from_fork+0x35/0x40
[74957.039135] INFO: task mdt_rdpg00_000:2740909 blocked for more than 120 seconds.
[74957.040257]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[74957.041559] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74957.042732] mdt_rdpg00_000  D    0 2740909      2 0x80004080
[74957.043586] Call Trace:
[74957.043986]  __schedule+0x2a6/0x700
[74957.044519]  schedule+0x38/0xa0
[74957.045014]  schedule_preempt_disabled+0xa/0x10
[74957.045699]  __mutex_lock.isra.5+0x2d0/0x4a0
[74957.046701]  ldlm_handle_enqueue0+0xe0a/0x16d0 [ptlrpc]
[74957.047559]  tgt_enqueue+0xa4/0x1f0 [ptlrpc]
[74957.048281]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
[74957.049093]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[74957.050030]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
[74957.050754]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
[74957.051601]  kthread+0x112/0x130
[74957.052106]  ? kthread_flush_work_fn+0x10/0x10
[74957.052788]  ret_from_fork+0x35/0x40
[74957.053339] INFO: task mdt_rdpg00_001:2740910 blocked for more than 120 seconds.
[74957.054452]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[74957.055758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74957.056940] mdt_rdpg00_001  D    0 2740910      2 0x80004080
[74957.057799] Call Trace:
[74957.058186]  __schedule+0x2a6/0x700
[74957.058725]  schedule+0x38/0xa0
[74957.059213]  wait_transaction_locked+0x89/0xc0 [jbd2]
[74957.059994]  ? finish_wait+0x80/0x80
[74957.060539]  add_transaction_credits+0xd4/0x290 [jbd2]
[74957.061331]  ? xas_load+0x8/0x80
[74957.061839]  ? find_get_entry+0x51/0x170
[74957.062437]  start_this_handle+0x10a/0x510 [jbd2]
[74957.063160]  ? _cond_resched+0x15/0x30
[74957.063743]  jbd2__journal_start+0xea/0x1f0 [jbd2]
[74957.064548]  ? osd_trans_start+0x2cc/0x530 [osd_ldiskfs]
[74957.065411]  __ldiskfs_journal_start_sb+0x6e/0x130 [ldiskfs]
[74957.066299]  osd_trans_start+0x2cc/0x530 [osd_ldiskfs]
[74957.067148]  qmt_trans_start_with_slv+0x4ec/0x780 [lquota]
[74957.068003]  qmt_set_with_lqe+0x7e/0xca0 [lquota]
[74957.068758]  ? lprocfs_counter_sub+0xd8/0x120 [obdclass]
[74957.069583]  ? iam_lfix_start+0x20/0x20 [osd_ldiskfs]
[74957.070373]  ? iam_path_release+0x32/0x50 [osd_ldiskfs]
[74957.071179]  qmt_lqe_set_default+0x1e7/0x230 [lquota]
[74957.071959]  qmt_lqe_read+0x2a8/0x360 [lquota]
[74957.072656]  ? lprocfs_counter_add+0xd2/0x140 [obdclass]
[74957.073482]  lqe_locate_find+0x2b0/0x7c0 [lquota]
[74957.074219]  qmt_pool_lqe_lookup+0xb3/0x230 [lquota]
[74957.074988]  qmt_lvbo_init+0x2e8/0x7d0 [lquota]
[74957.075703]  ldlm_handle_enqueue0+0xe33/0x16d0 [ptlrpc]
[74957.076543]  tgt_enqueue+0xa4/0x1f0 [ptlrpc]
[74957.077249]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
[74957.078060]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[74957.078994]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
[74957.079704]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
[74957.080556]  kthread+0x112/0x130
[74957.081062]  ? kthread_flush_work_fn+0x10/0x10
[74957.081743]  ret_from_fork+0x35/0x40
[74957.082298] INFO: task qmt_reba_lustre:2740927 blocked for more than 120 seconds.
[74957.083421]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[74957.084725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74957.085911] qmt_reba_lustre D    0 2740927      2 0x80004080
[74957.086768] Call Trace:
[74957.087157]  __schedule+0x2a6/0x700
[74957.087689]  ? __switch_to_asm+0x35/0x70
[74957.088305]  ? __switch_to_asm+0x35/0x70
[74957.089016]  schedule+0x38/0xa0
[74957.089497]  rwsem_down_write_slowpath+0x32d/0x4e0
[74957.090246]  qmt_reba_thread+0x6ca/0x9b0 [lquota]
[74957.090979]  ? qmt_glimpse_lock.isra.19+0xfb0/0xfb0 [lquota]
[74957.091842]  kthread+0x112/0x130
[74957.092340]  ? kthread_flush_work_fn+0x10/0x10
[74957.093025]  ret_from_fork+0x35/0x40
[74957.093571] INFO: task osp-syn-0-0:2740936 blocked for more than 120 seconds.
[74957.094650]       Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[74957.095959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74957.097136] osp-syn-0-0     D    0 2740936      2 0x80004080
[74957.097998] Call Trace:
[74957.098379]  __schedule+0x2a6/0x700
[74957.098925]  schedule+0x38/0xa0
[74957.099409]  wait_transaction_locked+0x89/0xc0 [jbd2]
[74957.100184]  ? finish_wait+0x80/0x80
[74957.100748]  add_transaction_credits+0xd4/0x290 [jbd2]
…

MDS2,4 (vm5) does not show hung processes

[93775.605131] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 03:14:16 (1608952456)
[97440.358977] sysrq: SysRq : Changing Loglevel
[97440.359720] sysrq: Loglevel set to 8
[97440.361502] sysrq: SysRq : Show State
[97440.362101]   task                        PC stack   pid father
[97440.363127] systemd         S    0     1      0 0x00000000
[97440.364358] Call Trace:
[97440.364902]  __schedule+0x2a6/0x700
[97440.365470]  schedule+0x38/0xa0
[97440.365963]  schedule_hrtimeout_range_clock+0xf7/0x110
[97440.366765]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[97440.367615]  ep_poll+0x3c8/0x410
[97440.368135]  ? wake_up_q+0x80/0x80
[97440.368668]  do_epoll_wait+0xb0/0xd0
[97440.369238]  __x64_sys_epoll_wait+0x1a/0x20
[97440.369884]  do_syscall_64+0x5b/0x1a0
[97440.370466]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[97440.371257] RIP: 0033:0x7f5e4901b257
[97440.371813] Code: Bad RIP value.
[97440.372329] RSP: 002b:00007ffc19fc2c70 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[97440.373488] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f5e4901b257
[97440.374579] RDX: 0000000000000055 RSI: 00007ffc19fc2cb0 RDI: 0000000000000004
[97440.375673] RBP: 00007ffc19fc2cb0 R08: 0000000000000000 R09: 7465677261742e79
[97440.376764] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000055
[97440.377855] R13: 00000000ffffffff R14: 00007ffc19fc2cb0 R15: 0000000000000001
[97440.378956] kthreadd        S    0     2      0 0x80004000

There are some similarities with the racer hang in the same test session at https://testing.whamcloud.com/test_sets/ee656977-4dbb-4d0d-a70f-9bb722f665a8 .

Attachments

Issue Links

is related to

LU-14187 LU-14134 causes many racer timeouts

Resolved

LU-16163 parallel-scale-nfsv3 test racer_on_nfs hangs with ‘general protection fault’ in nfs3_proc_setacls()

Resolved

LU-14134 reduce credits for new writing potentially

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(17 mentioned in)

Activity

[LU-14304] parallel-scale-nfsv3 test racer_on_nfs hangs

Andreas Dilger added a comment - 19/Apr/24 6:42 AM

patch https://review.whamcloud.com/0579 "LU-16163 tests: skip racer_on_nfs for NFSv3" pushed to exclude this test

Andreas Dilger added a comment - 19/Apr/24 6:42 AM patch https://review.whamcloud.com/0579 " LU-16163 tests: skip racer_on_nfs for NFSv3 " pushed to exclude this test

John Hammond added a comment - 07/Jan/21 4:00 PM

Potentially fixed by the landing of ~~LU-14187~~ (osd-ldiskfs: fix locking in write commit) to master.

John Hammond added a comment - 07/Jan/21 4:00 PM Potentially fixed by the landing of LU-14187 (osd-ldiskfs: fix locking in write commit) to master.

People

Assignee:: WC Triage

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Jan/21 2:35 AM

Updated:: 19/Apr/24 6:43 AM

Resolved:: 19/Apr/24 6:43 AM