Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7992

HSM: mutual deadlock when killing a copytool

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      If something tries to kill a copytool is while it is performing some requests, it will deadlock. It won't finish ever, and remains unkillable even with -KILL. The Lustre client has to be rebooted to fix the situation.

      The Lustre kernel thread:

      [62266.422298] LNet: Service thread pid 30493 was inactive for 62.18s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [62266.424940] Pid: 30493, comm: ldlm_cb03_001
      [62266.426214] 
      Call Trace:
      [62266.428573]  [<ffffffff81661eb9>] schedule+0x29/0x70
      [62266.429763]  [<ffffffff811f3266>] pipe_wait+0x76/0xd0
      [62266.430928]  [<ffffffff810a8040>] ? autoremove_wake_function+0x0/0x40
      [62266.432062]  [<ffffffff811f357f>] pipe_write+0x25f/0x600
      [62266.433174]  [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80
      [62266.434270]  [<ffffffff8101cd89>] ? sched_clock+0x9/0x10
      [62266.435351]  [<ffffffff81310823>] ? number.isra.2+0x323/0x360
      [62266.436418]  [<ffffffff811e9d7d>] do_sync_write+0x8d/0xd0
      [62266.437455]  [<ffffffff811ea5d0>] vfs_write+0xc0/0x1f0
      [62266.438570]  [<ffffffffa07ba119>] libcfs_kkuc_msg_put+0xa9/0x250 [obdclass]
      [62266.439598]  [<ffffffffa07ba33e>] libcfs_kkuc_group_put+0x7e/0x190 [obdclass]
      [62266.440604]  [<ffffffffa0ba69bd>] mdc_set_info_async+0x7ad/0x820 [mdc]
      [62266.441692]  [<ffffffffa09d1c81>] ldlm_callback_handler.part.24+0x1781/0x2140 [ptlrpc]
      [62266.442721]  [<ffffffffa068d937>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [62266.443712]  [<ffffffffa09d267a>] ldlm_callback_handler+0x3a/0xd0 [ptlrpc]
      [62266.444699]  [<ffffffffa09ff081>] ptlrpc_server_handle_request+0x231/0xad0 [ptlrpc]
      [62266.445672]  [<ffffffffa09fcba5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [62266.446599]  [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180
      [62266.447533]  [<ffffffffa0a032c0>] ptlrpc_main+0xab0/0x1e30 [ptlrpc]
      [62266.448416]  [<ffffffff810b2ba4>] ? finish_task_switch+0x44/0x180
      [62266.449313]  [<ffffffffa0a02810>] ? ptlrpc_main+0x0/0x1e30 [ptlrpc]
      [62266.450138]  [<ffffffff810a6e6d>] kthread+0xed/0x100
      [62266.450950]  [<ffffffff810a6d80>] ? kthread+0x0/0x100
      [62266.451757]  [<ffffffff8166d4d8>] ret_from_fork+0x58/0x90
      [62266.452539]  [<ffffffff810a6d80>] ? kthread+0x0/0x100
      

      The copy tool:

      [62338.455850] INFO: task tas_cmm:25225 blocked for more than 120 seconds.
      [62338.457167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [62338.458499] tas_cmm         D ffff88062d8ec650     0 25225      1 0x00000084
      [62338.459736]  ffff8800a0463a78 0000000000000046 ffff88027685ddc0 ffff8800a0463fd8
      [62338.460889]  ffff8800a0463fd8 ffff8800a0463fd8 ffff88027685ddc0 ffff88027685ddc0
      [62338.462227]  ffffffffa0806640 ffffffffa0806648 ffffffff00000000 ffffffffa0806690
      [62338.463510] Call Trace:
      [62338.464693]  [<ffffffff81661eb9>] schedule+0x29/0x70
      [62338.465447]  [<ffffffff81663555>] rwsem_down_write_failed+0x105/0x1c0
      [62338.466116]  [<ffffffff81312400>] ? pointer.isra.19+0x190/0x4c0
      [62338.466954]  [<ffffffff81314f43>] call_rwsem_down_write_failed+0x13/0x20
      [62338.468232]  [<ffffffff81661135>] ? down_write+0x85/0x9c
      [62338.469526]  [<ffffffffa07ba4c9>] ? libcfs_kkuc_group_rem+0x79/0x290 [obdclass]
      [62338.470827]  [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass]
      [62338.471585]  [<ffffffffa094e895>] ? obd_iocontrol+0xd5/0x350 [lmv]
      [62338.472276]  [<ffffffffa0954bbe>] lmv_iocontrol+0x78e/0x2720 [lmv]
      [62338.472973]  [<ffffffff810bd815>] ? local_clock+0x25/0x30
      [62338.473657]  [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0
      [62338.474318]  [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80
      [62338.475000]  [<ffffffff8101cd89>] ? sched_clock+0x9/0x10
      [62338.475678]  [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0
      [62338.476329]  [<ffffffff811cd743>] ? kmem_cache_alloc_trace+0x1b3/0x1d0
      [62338.477018]  [<ffffffffa118b7cf>] ? copy_and_ioctl.constprop.22+0x2f/0x4f0 [lustre]
      [62338.477714]  [<ffffffffa118b98b>] copy_and_ioctl.constprop.22+0x1eb/0x4f0 [lustre]
      [62338.478393]  [<ffffffffa1191a88>] ll_dir_ioctl+0x2438/0x6420 [lustre]
      [62338.479074]  [<ffffffff81657e30>] ? __slab_free+0x10e/0x277
      [62338.479771]  [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80
      [62338.480410]  [<ffffffff8101cd89>] ? sched_clock+0x9/0x10
      [62338.481077]  [<ffffffff810bd815>] ? local_clock+0x25/0x30
      [62338.481736]  [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0
      [62338.482361]  [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80
      [62338.483012]  [<ffffffff8101cd89>] ? sched_clock+0x9/0x10
      [62338.483662]  [<ffffffff810bd815>] ? local_clock+0x25/0x30
      [62338.484280]  [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180
      [62338.484939]  [<ffffffff810af97f>] ? lg_local_unlock+0x1f/0x60
      [62338.485565]  [<ffffffff8120b98e>] ? mntput_no_expire+0x3e/0x120
      [62338.486220]  [<ffffffff811fe795>] do_vfs_ioctl+0x2e5/0x4d0
      [62338.486863]  [<ffffffff811fea21>] SyS_ioctl+0xa1/0xc0
      [62338.487451]  [<ffffffff8166d589>] system_call_fastpath+0x16/0x1b
      

      Lock debug reports:

      [62362.788685] 1 lock held by ldlm_cb03_001/30493:
      [62362.789459]  #0:  (kg_sem){......}, at: [<ffffffffa07ba307>] libcfs_kkuc_group_put+0x47/0x190 [obdclass]
      [62362.800051] 1 lock held by tas_cmm/25225:
      [62362.800886]  #0:  (kg_sem){......}, at: [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass]
      

      It appears that ldlm_cb03_001 block on the pipe_wait, while holding the lock that the copytool needs to exit. The copytool can't read the message because it is finishing and waiting for the lock that ldlm_cb03_001 has.

      Attachments

        Activity

          People

            jhammond John Hammond
            fzago Frank Zago (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: