Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11751

racer deadlocks due to DOM glimpse request

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      racer test_1 hangs in locking. This issue looks a lot like LU-10852, but the stack trace is a bit different and this failure is not in a DNE environment. So, opening a new ticket.

      Looking at the logs at https://testing.whamcloud.com/test_sets/1c19f7d0-f9d7-11e8-b216-52540065bddc, we can see that man processes on the clients are blocked. From client 2 (vm9), in the client console we see

      [46895.114548] LustreError: 11-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: operation ldlm_enqueue to node 10.2.8.143@tcp failed: rc = -107
      [46895.115945] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection to lustre-MDT0000 (at 10.2.8.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [46895.119762] LustreError: 167-0: lustre-MDT0000-mdc-ffff9d96fa9a7800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [46895.133418] LustreError: 2635:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000403:0x942:0x0] error: rc = -5
      [46895.140482] LustreError: 2167:0:(vvp_io.c:1495:vvp_io_init()) lustre: refresh file layout [0x200000405:0x70a:0x0] error -108.
      [46895.146261] LustreError: 3286:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19
      [46895.147487] LustreError: 18765:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
      [46895.168736] LustreError: 18884:0:(mdc_request.c:1427:mdc_read_page()) lustre-MDT0000-mdc-ffff9d96fa9a7800: [0x200000402:0x23c:0x0] lock enqueue fails: rc = -108
      [46895.208950] LustreError: 19579:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9d96fa9a7800: inode [0x200000402:0x23c:0x0] mdc close failed: rc = -108
      [46895.222023] LustreError: 12501:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff9d96fa9a7800: namespace resource [0x200000402:0x23d:0x0].0x0 (ffff9d96f98e9e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [46895.633613] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108
      [46895.634939] LustreError: 13031:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 609 previous similar messages
      [46895.635482] Lustre: lustre-MDT0000-mdc-ffff9d96fa9a7800: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
      [46920.333393] INFO: task setfattr:32731 blocked for more than 120 seconds.
      [46920.334207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.335014] setfattr        D ffff9d9695c36eb0     0 32731    638 0x00000080
      [46920.335773] Call Trace:
      [46920.336068]  [<ffffffffae22ae2e>] ? generic_permission+0x15e/0x1d0
      [46920.336704]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
      [46920.337337]  [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0
      [46920.337958]  [<ffffffffae716fff>] mutex_lock+0x1f/0x2f
      [46920.338471]  [<ffffffffae24700f>] vfs_removexattr+0x5f/0x130
      [46920.339043]  [<ffffffffae247135>] removexattr+0x55/0x80
      [46920.339578]  [<ffffffffae230a7d>] ? putname+0x3d/0x60
      [46920.340101]  [<ffffffffae231c92>] ? user_path_at_empty+0x72/0xc0
      [46920.340711]  [<ffffffffae2221c8>] ? __sb_start_write+0x58/0x110
      [46920.341297]  [<ffffffffae72056c>] ? __do_page_fault+0x1bc/0x4f0
      [46920.341900]  [<ffffffffae241d9c>] ? mnt_want_write+0x2c/0x50
      [46920.342455]  [<ffffffffae247f34>] SyS_removexattr+0x94/0xd0
      [46920.343017]  [<ffffffffae72579b>] system_call_fastpath+0x22/0x27
      [46920.343616]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.344266] INFO: task cp:3161 blocked for more than 120 seconds.
      [46920.344878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.345642] cp              D ffff9d96e3c45ee0     0  3161    645 0x00000080
      [46920.346377] Call Trace:
      [46920.346659]  [<ffffffffae2d5152>] ? security_inode_permission+0x22/0x30
      [46920.347301]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
      [46920.347946]  [<ffffffffae717c17>] __mutex_lock_slowpath+0xc7/0x1d0
      [46920.348543]  [<ffffffffae716fff>] mutex_lock+0x1f/0x2f
      [46920.349083]  [<ffffffffae710bee>] lookup_slow+0x33/0xa7
      [46920.349608]  [<ffffffffae22e7a8>] path_lookupat+0x838/0x8b0
      [46920.350178]  [<ffffffffae1970cb>] ? unlock_page+0x2b/0x30
      [46920.350732]  [<ffffffffae1faf39>] ? kmem_cache_alloc+0x179/0x1f0
      [46920.351320]  [<ffffffffae230aef>] ? getname_flags+0x4f/0x1a0
      [46920.351894]  [<ffffffffae22e84b>] filename_lookup+0x2b/0xc0
      [46920.352442]  [<ffffffffae231c87>] user_path_at_empty+0x67/0xc0
      [46920.353036]  [<ffffffffae127b72>] ? from_kgid_munged+0x12/0x20
      [46920.353630]  [<ffffffffae2251df>] ? cp_new_stat+0x14f/0x180
      [46920.354180]  [<ffffffffae231cf1>] user_path_at+0x11/0x20
      [46920.354718]  [<ffffffffae224cd3>] vfs_fstatat+0x63/0xc0
      [46920.355234]  [<ffffffffae22523e>] SYSC_newstat+0x2e/0x60
      [46920.355776]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.356427]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.357093]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.357748]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.358394]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.359050]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.359715]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.360355]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.361018]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.361676]  [<ffffffffae7256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.362314]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.362967]  [<ffffffffae22551e>] SyS_newstat+0xe/0x10
      [46920.363469]  [<ffffffffae72579b>] system_call_fastpath+0x22/0x27
      [46920.364082]  [<ffffffffae7256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.364739] INFO: task ln:3169 blocked for more than 120 seconds.
      [46920.365340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.366106] ln              D ffff9d96d8966eb0     0  3169    616 0x00000080
      [46920.366863] Call Trace:
      [46920.367122]  [<ffffffffae22e092>] ? path_lookupat+0x122/0x8b0
      [46920.367698]  [<ffffffffae1f8e19>] ? ___slab_alloc+0x209/0x4f0
      [46920.368264]  [<ffffffffae719e59>] schedule_preempt_disabled+0x29/0x70
      

      On client 1 (vm10), we see dir_create, ls, mkdir and other calls blocked

      [46775.380960] LustreError: 167-0: lustre-MDT0000-mdc-ffff9b35bbd7d000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [46775.392616] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000401:0x122:0x0] error: rc = -5
      [46775.394009] LustreError: 14241:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
      [46775.395224] LustreError: 30588:0:(mdc_locks.c:1257:mdc_intent_getattr_async_interpret()) ldlm_cli_enqueue_fini: -108
      [46775.402186] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff9b35bbd7d000: inode [0x200000404:0x148:0x0] mdc close failed: rc = -108
      [46775.403682] LustreError: 18530:0:(file.c:216:ll_close_inode_openhandle()) Skipped 1 previous similar message
      [46775.415258] LustreError: 17871:0:(llite_lib.c:1547:ll_md_setattr()) md_setattr fails: rc = -108
      [46775.421674] LustreError: 18956:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19
      [46775.431989] Lustre: lustre-MDT0000-mdc-ffff9b35bbd7d000: Connection restored to 10.2.8.143@tcp (at 10.2.8.143@tcp)
      [46920.246223] INFO: task dir_create.sh:4489 blocked for more than 120 seconds.
      [46920.247112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.247952] dir_create.sh   D ffff9b35baf1af70     0  4489   4447 0x00000080
      [46920.248755] Call Trace:
      [46920.249111]  [<ffffffffc0ea57e2>] ? ll_dcompare+0x72/0x2e0 [lustre]
      [46920.249780]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
      [46920.250439]  [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0
      [46920.251074]  [<ffffffff94316fff>] mutex_lock+0x1f/0x2f
      [46920.251612]  [<ffffffff94310bee>] lookup_slow+0x33/0xa7
      [46920.252169]  [<ffffffff93e2d01f>] link_path_walk+0x80f/0x8b0
      [46920.252746]  [<ffffffff93e30205>] path_openat+0xb5/0x640
      [46920.253285]  [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0
      [46920.253848]  [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170
      [46920.254401]  [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240
      [46920.254959]  [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.255631]  [<ffffffff93e1e1fe>] SyS_open+0x1e/0x20
      [46920.256150]  [<ffffffff9432579b>] system_call_fastpath+0x22/0x27
      [46920.256760]  [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.257485] INFO: task ls:6624 blocked for more than 120 seconds.
      [46920.258108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.258901] ls              D ffff9b359eb6eeb0     0  6624   4495 0x00000080
      [46920.259661] Call Trace:
      [46920.259932]  [<ffffffff93cc6081>] ? in_group_p+0x31/0x40
      [46920.260487]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
      [46920.261139]  [<ffffffff94317c17>] __mutex_lock_slowpath+0xc7/0x1d0
      [46920.261777]  [<ffffffff94316fff>] mutex_lock+0x1f/0x2f
      [46920.262300]  [<ffffffff93e2f11f>] do_last+0x28f/0x12c0
      [46920.262826]  [<ffffffff93e30227>] path_openat+0xd7/0x640
      [46920.263362]  [<ffffffff93e30aef>] ? getname_flags+0x4f/0x1a0
      [46920.263957]  [<ffffffffc0eb2159>] ? ll_file_data_put+0x89/0x180 [lustre]
      [46920.264625]  [<ffffffff93e31dbd>] do_filp_open+0x4d/0xb0
      [46920.265173]  [<ffffffff93e3f1e4>] ? __alloc_fd+0xc4/0x170
      [46920.265739]  [<ffffffff93e1e0d7>] do_sys_open+0x137/0x240
      [46920.266281]  [<ffffffff943256d5>] ? system_call_after_swapgs+0xa2/0x146
      [46920.266939]  [<ffffffff93e1e214>] SyS_openat+0x14/0x20
      [46920.267473]  [<ffffffff9432579b>] system_call_fastpath+0x22/0x27
      [46920.268081]  [<ffffffff943256e1>] ? system_call_after_swapgs+0xae/0x146
      [46920.268778] INFO: task ln:8693 blocked for more than 120 seconds.
      [46920.269388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [46920.270180] ln              D ffff9b3595468fd0     0  8693   4475 0x00000080
      [46920.270940] Call Trace:
      [46920.271201]  [<ffffffff93e2e092>] ? path_lookupat+0x122/0x8b0
      [46920.271790]  [<ffffffff94319e59>] schedule_preempt_disabled+0x29/0x70
      

      ON the MDS (vm12), we see many stack traces like the following

      [46559.329558] Lustre: DEBUG MARKER: == racer test 1: racer on clients: onyx-41vm10,onyx-41vm9.onyx.whamcloud.com DURATION=900 ============ 09:34:14 (1544088854)
      [46564.017497] LustreError: 5798:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 544 for EA 568 (max_mdsize 568): rc = -34
      [46643.846755] LNet: Service thread pid 5790 was inactive for 62.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [46643.848574] Pid: 5790, comm: mdt00_011 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
      [46643.849563] Call Trace:
      [46643.849882]  [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [46643.850848]  [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [46643.851624]  [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [46643.852657]  [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [46643.853470]  [<ffffffffc11e9f47>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [46643.854167]  [<ffffffffc11eb687>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
      [46643.854970]  [<ffffffffc11f2ba5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [46643.855694]  [<ffffffffc11efa08>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [46643.856459]  [<ffffffffc0edeec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [46643.857209]  [<ffffffffc0f078a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [46643.858036]  [<ffffffffc0f8e242>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [46643.858866]  [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [46643.859648]  [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [46643.860560]  [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [46643.861315]  [<ffffffff95cbdf21>] kthread+0xd1/0xe0
      [46643.861876]  [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39
      [46643.862590]  [<ffffffffffffffff>] 0xffffffffffffffff
      [46643.863234] LustreError: dumping log to /tmp/lustre-log.1544088939.5790
      …
      [47031.433294] Pid: 5782, comm: mdt00_004 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Wed Dec 5 03:41:24 UTC 2018
      [47031.434325] Call Trace:
      [47031.434609]  [<ffffffffc0ef8031>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [47031.435489]  [<ffffffffc0ef8dcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [47031.436322]  [<ffffffffc11e951b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [47031.437360]  [<ffffffffc11e9ba0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [47031.438194]  [<ffffffffc11ec09a>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
      [47031.438983]  [<ffffffffc120aa5e>] mdt_reint_setxattr+0x1ce/0xfd0 [mdt]
      [47031.439782]  [<ffffffffc12085d3>] mdt_reint_rec+0x83/0x210 [mdt]
      [47031.440507]  [<ffffffffc11e51a3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [47031.441242]  [<ffffffffc11f0487>] mdt_reint+0x67/0x140 [mdt]
      [47031.441942]  [<ffffffffc0f9529a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [47031.442833]  [<ffffffffc0f3991b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [47031.443725]  [<ffffffffc0f3d24c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [47031.444586]  [<ffffffff95cbdf21>] kthread+0xd1/0xe0
      [47031.445234]  [<ffffffff963255f7>] ret_from_fork_nospec_end+0x0/0x39
      [47031.445925]  [<ffffffffffffffff>] 0xffffffffffffffff
      [47031.446664] LustreError: dumping log to /tmp/lustre-log.1544089327.5782
      

      Additional logs for this issue are at
      https://testing.whamcloud.com/sub_tests/04a3b75e-f729-11e8-bfe1-52540065bddc
      https://testing.whamcloud.com/sub_tests/24b8ffb6-f9ce-11e8-bb6b-52540065bddc

      Attachments

        1. fir-md1-s1-kernel-20190428.log
          971 kB
          Stephane Thiell
        2. fir-md1-s2-kernel-20190428.log
          627 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: