Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.12.0, Lustre 2.12.2
-
None
-
3
-
9223372036854775807
Description
racer test_1 hangs in locking. This issue looks a lot like LU-10852, but the stack trace is a bit different and this failure is not in a DNE environment. So, opening a new ticket.
Looking at the logs at https://testing.whamcloud.com/test_sets/1c19f7d0-f9d7-11e8-b216-52540065bddc, we can see that man processes on the clients are blocked. From client 2 (vm9), in the client console we see
[46895.114548] LustreError: 11-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: operation ldlm_enqueue to node 10.2.8.143@tcp failed: rc = -107 [46895.115945] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection to lustre-MDT0000 (at 10.2.8.143@tcp) was lost; in progress operations using this service will wait for recovery to complete [46895.119762] LustreError: 167-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [46895.133418] LustreError: 2635:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000403:0x942:0x0] error: rc = -5 [46895.140482] LustreError: 2167:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200000405:0x70a:0x0] error -108. [46895.146261] LustreError: 3286:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19 [46895.147487] LustreError: 18765:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108 [46895.168736] LustreError: 18884:0:(mdc_request.c:1427:mdc_read_page()) lustre-MDT0000-mdc-ffff9d96fa9a7800: [0x200000402:0x23c:0x0] lock enqueue fails: rc = -108 [46895.208950] LustreError: 19579:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9d96fa9a7800: inode [0x200000402:0x23c:0x0] mdc close failed: rc = -108 [46895.222023] LustreError: 12501:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff9d96fa9a7800: namespace resource [0x200000402:0x23d:0x0].0x0 (ffff9d96f98e9e40) refcount nonzero (1) after lock cleanup; forcing cleanup. [46895.633613] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108 [46895.634939] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 609 previous similar messages [46895.635482] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp) [46920.333393] INFO: task setfattr:32731 blocked for more than 120 seconds. [46920.334207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.335014] setfattr D ffff9d9695c36eb0 0 32731 638 0x00000080 [46920.335773] Call Trace: [46920.336068] [<ffffffffae22ae2e>] ? generic_permission+0x15e/0x1d0 [46920.336704] [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70 [46920.337337] [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0 [46920.337958] [<ffffffffae716fff>] mutex_lock+0x1f/0x2f [46920.338471] [<ffffffffae24700f>] vfs_removexattr+0x5f/0x130 [46920.339043] [<ffffffffae247135>] removexattr+0x55/0x80 [46920.339578] [<ffffffffae230a7d>] ? putname+0x3d/0x60 [46920.340101] [<ffffffffae231c92>] ? user_path_at_empty+0x72/0xc0 [46920.340711] [<ffffffffae2221c8>] ? __sb_start_write+0x58/0x110 [46920.341297] [<ffffffffae72056c>] ? __do_page_fault+0x1bc/0x4f0 [46920.341900] [<ffffffffae241d9c>] ? mnt_want_write+0x2c/0x50 [46920.342455] [<ffffffffae247f34>] SyS_removexattr+0x94/0xd0 [46920.343017] [<ffffffffae72579b>] system_call_fastpath+0x22/0x27 [46920.343616] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.344266] INFO: task cp:3161 blocked for more than 120 seconds. [46920.344878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.345642] cp D ffff9d96e3c45ee0 0 3161 645 0x00000080 [46920.346377] Call Trace: [46920.346659] [<ffffffffae2d5152>] ? security_inode_permission+0x22/0x30 [46920.347301] [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70 [46920.347946] [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0 [46920.348543] [<ffffffffae716fff>] mutex_lock+0x1f/0x2f [46920.349083] [<ffffffffae710bee>] lookup_slow+0x33/0xa7 [46920.349608] [<ffffffffae22e7a8>] path_lookupat+0x838/0x8b0 [46920.350178] [<ffffffffae1970cb>] ? unlock_page+0x2b/0x30 [46920.350732] [<ffffffffae1faf39>] ? kmem_cache_alloc+0x179/0x1f0 [46920.351320] [<ffffffffae230aef>] ? getname_flags+0x4f/0x1a0 [46920.351894] [<ffffffffae22e84b>] filename_lookup+0x2b/0xc0 [46920.352442] [<ffffffffae231c87>] user_path_at_empty+0x67/0xc0 [46920.353036] [<ffffffffae127b72>] ? from_kgid_munged+0x12/0x20 [46920.353630] [<ffffffffae2251df>] ? cp_new_stat+0x14f/0x180 [46920.354180] [<ffffffffae231cf1>] user_path_at+0x11/0x20 [46920.354718] [<ffffffffae224cd3>] vfs_fstatat+0x63/0xc0 [46920.355234] [<ffffffffae22523e>] SYSC_newstat+0x2e/0x60 [46920.355776] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.356427] [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.357093] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.357748] [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.358394] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.359050] [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.359715] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.360355] [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.361018] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.361676] [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.362314] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.362967] [<ffffffffae22551e>] SyS_newstat+0xe/0x10 [46920.363469] [<ffffffffae72579b>] system_call_fastpath+0x22/0x27 [46920.364082] [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.364739] INFO: task ln:3169 blocked for more than 120 seconds. [46920.365340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.366106] ln D ffff9d96d8966eb0 0 3169 616 0x00000080 [46920.366863] Call Trace: [46920.367122] [<ffffffffae22e092>] ? path_lookupat+0x122/0x8b0 [46920.367698] [<ffffffffae1f8e19>] ? ___slab_alloc+0x209/0x4f0 [46920.368264] [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
On client 1 (vm10), we see dir_create, ls, mkdir and other calls blocked
[46775.380960] LustreError: 167-0: lustre-MDT0000-mdc-ffff9b35bbd7d000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [46775.392616] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000401:0x122:0x0] error: rc = -5 [46775.394009] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages [46775.395224] LustreError: 30588:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108 [46775.402186] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9b35bbd7d000: inode [0x200000404:0x148:0x0] mdc close failed: rc = -108 [46775.403682] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) Skipped 1 previous similar message [46775.415258] LustreError: 17871:0:(llite_lib.c:1547:ll_md_setattr()) md_setattr fails: rc = -108 [46775.421674] LustreError: 18956:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19 [46775.431989] Lustre: lustre-MDT0000-mdc-ffff9b35bbd7d000: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp) [46920.246223] INFO: task dir_create.sh:4489 blocked for more than 120 seconds. [46920.247112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.247952] dir_create.sh D ffff9b35baf1af70 0 4489 4447 0x00000080 [46920.248755] Call Trace: [46920.249111] [<ffffffffc0ea57e2>] ? ll_dcompare+0x72/0x2e0 [lustre] [46920.249780] [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70 [46920.250439] [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0 [46920.251074] [<ffffffff94316fff>] mutex_lock+0x1f/0x2f [46920.251612] [<ffffffff94310bee>] lookup_slow+0x33/0xa7 [46920.252169] [<ffffffff93e2d01f>] link_path_walk+0x80f/0x8b0 [46920.252746] [<ffffffff93e30205>] path_openat+0xb5/0x640 [46920.253285] [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0 [46920.253848] [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170 [46920.254401] [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240 [46920.254959] [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.255631] [<ffffffff93e1e1fe>] SyS_open+0x1e/0x20 [46920.256150] [<ffffffff9432579b>] system_call_fastpath+0x22/0x27 [46920.256760] [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.257485] INFO: task ls:6624 blocked for more than 120 seconds. [46920.258108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.258901] ls D ffff9b359eb6eeb0 0 6624 4495 0x00000080 [46920.259661] Call Trace: [46920.259932] [<ffffffff93cc6081>] ? in_group_p+0x31/0x40 [46920.260487] [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70 [46920.261139] [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0 [46920.261777] [<ffffffff94316fff>] mutex_lock+0x1f/0x2f [46920.262300] [<ffffffff93e2f11f>] do_last+0x28f/0x12c0 [46920.262826] [<ffffffff93e30227>] path_openat+0xd7/0x640 [46920.263362] [<ffffffff93e30aef>] ? getname_flags+0x4f/0x1a0 [46920.263957] [<ffffffffc0eb2159>] ? ll_file_data_put+0x89/0x180 [lustre] [46920.264625] [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0 [46920.265173] [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170 [46920.265739] [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240 [46920.266281] [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146 [46920.266939] [<ffffffff93e1e214>] SyS_openat+0x14/0x20 [46920.267473] [<ffffffff9432579b>] system_call_fastpath+0x22/0x27 [46920.268081] [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146 [46920.268778] INFO: task ln:8693 blocked for more than 120 seconds. [46920.269388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [46920.270180] ln D ffff9b3595468fd0 0 8693 4475 0x00000080 [46920.270940] Call Trace: [46920.271201] [<ffffffff93e2e092>] ? path_lookupat+0x122/0x8b0 [46920.271790] [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
ON the MDS (vm12), we see many stack traces like the following
[46559.329558] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=900 ============ 09:34:14 (1544088854) [46564.017497] LustreError: 5798:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 544 for EA 568 (max_mdsize 568): rc = -34 [46643.846755] LNet: Service thread pid 5790 was inactive for 62.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [46643.848574] Pid: 5790, comm: mdt00_011 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018 [46643.849563] Call Trace: [46643.849882] [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [46643.850848] [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] [46643.851624] [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [46643.852657] [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt] [46643.853470] [<ffffffffc11e9f47>] mdt_object_lock_try+0x27/0xb0 [mdt] [46643.854167] [<ffffffffc11eb687>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] [46643.854970] [<ffffffffc11f2ba5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [46643.855694] [<ffffffffc11efa08>] mdt_intent_policy+0x2e8/0xd00 [mdt] [46643.856459] [<ffffffffc0edeec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] [46643.857209] [<ffffffffc0f078a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] [46643.858036] [<ffffffffc0f8e242>] tgt_enqueue+0x62/0x210 [ptlrpc] [46643.858866] [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [46643.859648] [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [46643.860560] [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [46643.861315] [<ffffffff95cbdf21>] kthread+0xd1/0xe0 [46643.861876] [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39 [46643.862590] [<ffffffffffffffff>] 0xffffffffffffffff [46643.863234] LustreError: dumping log to /tmp/lustre-log.1544088939.5790 … [47031.433294] Pid: 5782, comm: mdt00_004 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018 [47031.434325] Call Trace: [47031.434609] [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [47031.435489] [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] [47031.436322] [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [47031.437360] [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt] [47031.438194] [<ffffffffc11ec09a>] mdt_object_find_lock+0x6a/0x1a0 [mdt] [47031.438983] [<ffffffffc120aa5e>] mdt_reint_setxattr+0x1ce/0xfd0 [mdt] [47031.439782] [<ffffffffc12085d3>] mdt_reint_rec+0x83/0x210 [mdt] [47031.440507] [<ffffffffc11e51a3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [47031.441242] [<ffffffffc11f0487>] mdt_reint+0x67/0x140 [mdt] [47031.441942] [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [47031.442833] [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [47031.443725] [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [47031.444586] [<ffffffff95cbdf21>] kthread+0xd1/0xe0 [47031.445234] [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39 [47031.445925] [<ffffffffffffffff>] 0xffffffffffffffff [47031.446664] LustreError: dumping log to /tmp/lustre-log.1544089327.5782
Additional logs for this issue are at
https://testing.whamcloud.com/sub_tests/04a3b75e-f729-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/sub_tests/24b8ffb6-f9ce-11e8-bb6b-52540065bddc
Attachments
Issue Links
- is related to
-
LU-11359 racer test 1 times out with client hung in dir_create.sh, ls, … and MDS in ldlm_completion_ast()
- Resolved
-
LU-11285 don't stop on the first blocked lock in ldlm_reprocess_queue()
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...