Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
recovery-small test_110k hangs. It looks like these hangs started on 21 JULY 2019 for Lustre 2.12.56.17 and only seen on master/b2_13.
Looking at the console logs for the hang at https://testing.whamcloud.com/test_sets/b732f8c4-e37a-11e9-add9-52540065bddc, we see the following:
client1 (vm11)
[18823.080898] Lustre: DEBUG MARKER: == recovery-small test 110k: FID_QUERY failed during recovery ======================================== 15:24:43 (1569770683) [18826.019012] LustreError: 11-0: lustre-MDT0001-mdc-ffff986e5f303800: operation mds_statfs to node 10.9.4.165@tcp failed: rc = -107 [18826.020320] LustreError: Skipped 4 previous similar messages [22427.043134] sysrq: SysRq : Changing Loglevel [22427.043724] sysrq: Loglevel set to 8 [22427.046389] sysrq: SysRq : Show State [22427.046821] task PC stack pid father [22427.047454] systemd S 0 1 0 0x00000000 [22427.048036] Call Trace: [22427.048363] ? __schedule+0x254/0x840 [22427.048753] schedule+0x28/0x70 [22427.049116] schedule_hrtimeout_range_clock+0x18d/0x1a0 [22427.049667] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [22427.050253] ep_poll+0x3c8/0x410 [22427.050622] ? wake_up_q+0x70/0x70 [22427.050994] do_epoll_wait+0xb0/0xd0 [22427.051394] __x64_sys_epoll_wait+0x1a/0x20 [22427.051842] do_syscall_64+0x5b/0x1b0 [22427.052260] entry_SYSCALL_64_after_hwframe+0x65/0xca [22427.052789] RIP: 0033:0x7f49dea94d97 [22427.053197] Code: Bad RIP value. [22427.053545] RSP: 002b:00007fffe0967b50 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [22427.054327] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f49dea94d97 [22427.055061] RDX: 000000000000005c RSI: 00007fffe0967b90 RDI: 0000000000000004 [22427.055777] RBP: 00007fffe0967b90 R08: 0000000000000000 R09: 7465677261742e79 [22427.056507] R10: 00000000ffffffff R11: 0000000000000293 R12: 000000000000005c [22427.057238] R13: 00000000ffffffff R14: 00007fffe0967b90 R15: 0000000000000001 [22427.057954] kthreadd S 0 2 0 0x80000000 [22427.058521] Call Trace: [22427.058804] ? __schedule+0x254/0x840 [22427.059207] schedule+0x28/0x70 [22427.059550] kthreadd+0x2e4/0x300 [22427.059912] ? kthread_create_on_cpu+0xa0/0xa0 [22427.060397] ret_from_fork+0x35/0x40 [22427.060794] rcu_gp I 0 3 2 0x80000000 [22427.061372] Workqueue: (null) (rcu_gp) [22427.061855] Call Trace: [22427.062147] ? __schedule+0x254/0x840 [22427.062544] schedule+0x28/0x70 [22427.062886] rescuer_thread+0x2ee/0x340 [22427.063319] ? flush_rcu_work+0x30/0x30 [22427.063721] kthread+0x112/0x130 [22427.064084] ? kthread_bind+0x30/0x30 [22427.064476] ret_from_fork+0x35/0x40 [22427.064854] rcu_par_gp I 0 4 2 0x80000000 [22427.065429] Call Trace: [22427.065708] ? __schedule+0x254/0x840 [22427.066113] schedule+0x28/0x70 [22427.066455] rescuer_thread+0x2ee/0x340 [22427.066855] ? flush_rcu_work+0x30/0x30 [22427.067282] kthread+0x112/0x130 [22427.067637] ? kthread_bind+0x30/0x30 [22427.068038] ret_from_fork+0x35/0x40 … [22428.500219] kworker/1:1 D 0 16467 2 0x80000080 [22428.500791] Workqueue: events key_garbage_collector [22428.501318] Call Trace: [22428.501596] ? __schedule+0x254/0x840 [22428.501993] schedule+0x28/0x70 [22428.502343] schedule_timeout+0x26d/0x390 [22428.502766] ? check_preempt_curr+0x62/0x90 [22428.503222] wait_for_completion+0x11f/0x190 [22428.503667] ? wake_up_q+0x70/0x70 [22428.504050] __wait_rcu_gp+0x12c/0x160 [22428.504453] synchronize_rcu+0x76/0x80 [22428.504853] ? __call_rcu+0x280/0x280 [22428.505260] ? __bpf_trace_rcu_utilization+0x10/0x10 [22428.505773] key_garbage_collector+0x1d7/0x3e0 [22428.506268] process_one_work+0x1a7/0x360 [22428.506690] worker_thread+0x30/0x390 [22428.507102] ? pwq_unbound_release_workfn+0xd0/0xd0 [22428.507608] kthread+0x112/0x130 [22428.507958] ? kthread_bind+0x30/0x30 [22428.508363] ret_from_fork+0x35/0x40
client2 (vm12)
[18827.149580] Lustre: DEBUG MARKER: == recovery-small test 110k: FID_QUERY failed during recovery ======================================== 15:24:43 (1569770683) [22433.833049] sysrq: SysRq : Changing Loglevel [22433.833608] sysrq: Loglevel set to 8 [22433.838786] sysrq: SysRq : Show State [22433.839253] task PC stack pid father [22433.839871] systemd S 0 1 0 0x00000000 [22433.840457] Call Trace: [22433.840780] ? __schedule+0x254/0x840 [22433.841198] schedule+0x28/0x70 [22433.841551] schedule_hrtimeout_range_clock+0x18d/0x1a0 [22433.842134] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [22433.842706] ep_poll+0x3c8/0x410 [22433.843101] ? wake_up_q+0x70/0x70 [22433.843477] do_epoll_wait+0xb0/0xd0 [22433.843864] __x64_sys_epoll_wait+0x1a/0x20 [22433.844330] do_syscall_64+0x5b/0x1b0 [22433.844735] entry_SYSCALL_64_after_hwframe+0x65/0xca [22433.845288] RIP: 0033:0x7fd7332c2d97 [22433.845678] Code: Bad RIP value. [22433.846045] RSP: 002b:00007ffdb0a7b080 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [22433.846810] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fd7332c2d97 [22433.847548] RDX: 0000000000000059 RSI: 00007ffdb0a7b0c0 RDI: 0000000000000004 [22433.848290] RBP: 00007ffdb0a7b0c0 R08: 0000000000000000 R09: 7465677261742e79 [22433.849032] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000059 [22433.849755] R13: 00000000ffffffff R14: 00007ffdb0a7b0c0 R15: 0000000000000001 [22433.850497] kthreadd S 0 2 0 0x80000000 [22433.851081] Call Trace: [22433.851364] ? __schedule+0x254/0x840 [22433.851756] schedule+0x28/0x70 [22433.852118] kthreadd+0x2e4/0x300 [22433.852494] ? kthread_create_on_cpu+0xa0/0xa0 [22433.852965] ret_from_fork+0x35/0x40 [22433.853384] rcu_gp I 0 3 2 0x80000000 [22433.853959] Call Trace: [22433.854259] ? __schedule+0x254/0x840 [22433.854655] schedule+0x28/0x70 [22433.855020] rescuer_thread+0x2ee/0x340 [22433.855443] ? flush_rcu_work+0x30/0x30 [22433.855856] kthread+0x112/0x130 [22433.856239] ? kthread_bind+0x30/0x30 [22433.856635] ret_from_fork+0x35/0x40 … [ 8826.737348] kworker/1:1 D 0 5439 2 0x80000080 [ 8826.737911] Workqueue: events key_garbage_collector [ 8826.738582] Call Trace: [ 8826.738989] ? __schedule+0x254/0x840 [ 8826.739599] ? pcpu_next_unpop+0x37/0x50 [ 8826.740216] schedule+0x28/0x70 [ 8826.740713] schedule_timeout+0x26d/0x390 [ 8826.741367] ? _cond_resched+0x15/0x30 [ 8826.741859] ? cpumask_next+0x17/0x20 [ 8826.742372] ? kmem_cache_open+0x3f0/0x430 [ 8826.742804] wait_for_completion+0x11f/0x190 [ 8826.743451] ? wake_up_q+0x70/0x70 [ 8826.743999] __wait_rcu_gp+0x12c/0x160 [ 8826.744654] synchronize_rcu+0x76/0x80 [ 8826.745261] ? __call_rcu+0x280/0x280 [ 8826.745857] ? __bpf_trace_rcu_utilization+0x10/0x10 [ 8826.746413] key_garbage_collector+0x1d7/0x3e0 [ 8826.747031] process_one_work+0x1a7/0x360 [ 8826.747687] worker_thread+0x1fa/0x390 [ 8826.748294] ? pwq_unbound_release_workfn+0xd0/0xd0 [ 8826.748839] kthread+0x112/0x130 [ 8826.749434] ? kthread_bind+0x30/0x30 [ 8826.750218] ret_from_fork+0x35/0x40
MDS1,3 (vm6)
[18819.270132] Lustre: DEBUG MARKER: == recovery-small test 110k: FID_QUERY failed during recovery ======================================== 15:24:43 (1569770683) [18822.297574] LustreError: 11-0: lustre-MDT0001-osp-MDT0002: operation mds_statfs to node 10.9.4.165@tcp failed: rc = -107 [18822.299461] LustreError: Skipped 6 previous similar messages
MDS2,4 (vm7)
[18824.196956] Lustre: DEBUG MARKER: == recovery-small test 110k: FID_QUERY failed during recovery ======================================== 15:24:43 (1569770683) [18824.510121] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [18825.124512] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [18825.423041] Lustre: Failing over lustre-MDT0001 [18825.423922] Lustre: Skipped 7 previous similar messages [18825.657887] Lustre: server umount lustre-MDT0001 complete [18825.658935] Lustre: Skipped 7 previous similar messages [18826.212991] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [18826.212991] lctl dl | grep ' ST ' || true [18826.797370] Lustre: DEBUG MARKER: modprobe dm-flakey; [18826.797370] dmsetup targets | grep -q flakey [18826.868811] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [18826.871737] LustreError: Skipped 187 previous similar messages [18903.428904] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [18903.432265] LustreError: Skipped 197 previous similar messages [19053.669128] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [19053.672181] LustreError: Skipped 387 previous similar messages [19354.148751] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [19354.152173] LustreError: Skipped 777 previous similar messages [19954.814825] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.118@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [19954.817893] LustreError: Skipped 1553 previous similar messages [20556.069346] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [20556.072539] LustreError: Skipped 1555 previous similar messages [21157.029130] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [21157.032971] LustreError: Skipped 1552 previous similar messages [21757.094270] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.118@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [21757.097259] LustreError: Skipped 1553 previous similar messages [22358.949018] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.9.4.165@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [22358.952110] LustreError: Skipped 1555 previous similar messages [22443.966949] SysRq : Changing Loglevel [22443.967688] Loglevel set to 8
We have several instances of this hang:
https://testing.whamcloud.com/test_sets/74cea80c-03e1-11ea-b934-52540065bddc
https://testing.whamcloud.com/test_sets/304f5830-03f9-11ea-a9d7-52540065bddc