Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4840

Deadlock when truncating file during lfs migrate

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.4.2
    • 3
    • 13336

    Description

      While migrating a file with "lfs migrate", if a process tries to truncate the file, both lfs migrate and truncating processes will deadlock.

      This will result in both processes never finishing (unless it is killed) and watchdog messages saying that the processes did not progress for the last XXX seconds.

      Here is a reproducer:

      [root@lustre24cli ~]# cat reproducer.sh
      #!/bin/sh
      
      FS=/test
      FILE=${FS}/file
      
      rm -f ${FILE}
      # Create a file on OST 1 of size 512M
      lfs setstripe -o 1 -c 1 ${FILE}
      dd if=/dev/zero of=${FILE} bs=1M count=512
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Launch a migrate to OST 0 and a bit later open it for write
      lfs migrate -i 0 --block ${FILE} &
      sleep 2
      dd if=/dev/zero of=${FILE} bs=1M count=512 
      

      Once the last dd tries to open the file, both lfs and dd processes stay forever with this stack:

      lfs stack:

      [<ffffffff8128e864>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffa08d98dd>] ll_file_io_generic+0x29d/0x600 [lustre]
      [<ffffffffa08d9d7f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa08da61c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff811896b5>] vfs_read+0xb5/0x1a0
      [<ffffffff811897f1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      dd stack:

      [<ffffffffa03436fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa04779fa>] cl_lock_state_wait+0x1aa/0x320 [obdclass]
      [<ffffffffa04781eb>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
      [<ffffffffa0478d6e>] cl_lock_request+0x7e/0x270 [obdclass]
      [<ffffffffa047e00c>] cl_io_lock+0x3cc/0x560 [obdclass]
      [<ffffffffa047e242>] cl_io_loop+0xa2/0x1b0 [obdclass]
      [<ffffffffa092a8c8>] cl_setattr_ost+0x208/0x2c0 [lustre]
      [<ffffffffa08f8a0e>] ll_setattr_raw+0x9ce/0x1000 [lustre]
      [<ffffffffa08f909b>] ll_setattr+0x5b/0xf0 [lustre]
      [<ffffffff811a7348>] notify_change+0x168/0x340
      [<ffffffff81187074>] do_truncate+0x64/0xa0
      [<ffffffff8119bcc1>] do_filp_open+0x861/0xd20
      [<ffffffff81185d39>] do_sys_open+0x69/0x140
      [<ffffffff81185e50>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            [LU-4840] Deadlock when truncating file during lfs migrate

            Frank, Henri, Jinshan,
            according to Oleg's last comments, he was still able to hit this deadlock even when the patch was applied, which raises a concern whether the risk of landing this complex patch is worth the risk at this late stage.

            Could you please confirm that the current patch has resolved the deadlock in your testing? It may be that Oleg is hitting a second issue that is not directly related.

            The second question is whether you are currently running with this patch in your other testing and can confirm that it doesn't introduce other problems?

            adilger Andreas Dilger added a comment - Frank, Henri, Jinshan, according to Oleg's last comments, he was still able to hit this deadlock even when the patch was applied, which raises a concern whether the risk of landing this complex patch is worth the risk at this late stage. Could you please confirm that the current patch has resolved the deadlock in your testing? It may be that Oleg is hitting a second issue that is not directly related. The second question is whether you are currently running with this patch in your other testing and can confirm that it doesn't introduce other problems?

            I couldn't reproduce the deadlock problem on MDT. Please collect a core dump when you see the deadlock issue again.

            jay Jinshan Xiong (Inactive) added a comment - I couldn't reproduce the deadlock problem on MDT. Please collect a core dump when you see the deadlock issue again.
            green Oleg Drokin added a comment -

            for the record: using Jinshan's patch did not help all that much and I was still seeing deadlocks on mdt

            green Oleg Drokin added a comment - for the record: using Jinshan's patch did not help all that much and I was still seeing deadlocks on mdt
            fzago Frank Zago (Inactive) added a comment - - edited

            Patch that adds some tests for the new API: http://review.whamcloud.com/13441/
            It has a couple questions left (see BUG?? in source) but otherwise is complete.
            It has to be applied on top on Henri's patch.

            fzago Frank Zago (Inactive) added a comment - - edited Patch that adds some tests for the new API: http://review.whamcloud.com/13441/ It has a couple questions left (see BUG?? in source) but otherwise is complete. It has to be applied on top on Henri's patch.

            please apply http://review.whamcloud.com/13344 to your tree. It worked well after that patch was applied in my test.

            jay Jinshan Xiong (Inactive) added a comment - please apply http://review.whamcloud.com/13344 to your tree. It worked well after that patch was applied in my test.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I'm investigating this issue.

            jay Jinshan Xiong (Inactive) added a comment - - edited I'm investigating this issue.

            Dropping this from Blocker to Critical, since it is not a new issue for 2.7.0 (it exists since migrate was added in 2.4.0), and only affects a subset of users of the migrate functionality, and not anyone else.

            adilger Andreas Dilger added a comment - Dropping this from Blocker to Critical, since it is not a new issue for 2.7.0 (it exists since migrate was added in 2.4.0), and only affects a subset of users of the migrate functionality, and not anyone else.
            green Oleg Drokin added a comment -

            Just to draw attention to my comment in gerrit.
            The latest patch still deadlocks in racer on mds, also seems to be leaking ost locks at times?

            8.832781] LNet: Service thread pid 26108 was inactive for 62.00s. The thread mig
            8.833657] Pid: 26108, comm: mdt00_007
            8.833906] 
            8.833907] Call Trace:
            8.834350]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            8.834649]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            8.834934]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            8.835310]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            8.835629]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            8.836070]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            8.836267]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            8.836475]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            8.836735]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            8.836950]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            8.837190]  [<ffffffffa0574805>] mdt_object_local_lock+0x3c5/0xa80 [mdt]
            8.837391]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            8.837638]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            8.837852]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            8.838074]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            8.838265]  [<ffffffffa059328a>] mdt_reint_unlink+0x20a/0x10c0 [mdt]
            8.838485]  [<ffffffffa120fa80>] ? lu_ucred+0x20/0x30 [obdclass]
            8.838676]  [<ffffffffa056ad25>] ? mdt_ucred+0x15/0x20 [mdt]
            8.838898]  [<ffffffffa05858bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
            8.839232]  [<ffffffffa1434e02>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
            8.839566]  [<ffffffffa0589aad>] mdt_reint_rec+0x5d/0x200 [mdt]
            8.839881]  [<ffffffffa056f5ab>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
            8.840205]  [<ffffffffa056fe0b>] mdt_reint+0x6b/0x120 [mdt]
            8.840550]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            8.840915]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            8.841304]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            8.841624]  [<ffffffff81098c06>] kthread+0x96/0xa0
            8.841917]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            8.842199]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            8.842492]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            8.842780] 
            8.842997] LustreError: dumping log to /tmp/lustre-log.1420492523.26108
            9.015282] Pid: 9643, comm: mdt00_006
            9.015565] 
            9.015566] Call Trace:
            9.016033]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.017088]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.017352]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.017687]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.018074]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.018593]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.018880]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.019196]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.019615]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.019940]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.020220]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.020558]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.020860]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.021150]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.021477]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.022987]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.023297]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.023573]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.023888]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.024182]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.024493]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.024789]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.025140]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.025454]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.025747]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.026616]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.026970]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.027313]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.027622]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.028084]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.029247] Pid: 6818, comm: mdt01_002
            9.029453] 
            9.029453] Call Trace:
            9.029739]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.029980]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.030164]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.030341]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.030579]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.031058]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.031335]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.031630]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.032198]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.032491]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.032773]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.033094]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.033387]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.033675]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.033988]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.034264]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.034544]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.034891]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.035203]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.035495]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.035774]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.036192]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.036479]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.036788]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.037136]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.037429]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.037735]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.039938]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.040213]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.040456]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.040707]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            9.040980]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            9.041230] 
            9.041415] Pid: 6815, comm: mdt00_002
            9.041637] 
            9.041638] Call Trace:
            9.042070]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.042351]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.042615]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.042890]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.043183]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.043656]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.044075]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.044366]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.044677]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.045145]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.045420]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.045705]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.046019]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.046309]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.046596]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.046873]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.047152]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.047427]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.047736]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.048031]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.048308]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.048604]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.048918]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.049232]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.049544]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.049852]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.050160]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.050453]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.050720]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.050970]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.051895] Pid: 6817, comm: mdt01_001
            9.052114] 
            9.052114] Call Trace:
            9.052817]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.053127]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.053391]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.053653]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.053952]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.054484]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.054768]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.055087]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.055394]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.055697]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.056558]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.056893]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.057247]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.057488]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.057687]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.057949]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.058223]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.058530]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.058752]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.058958]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.059174]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.059398]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.059629]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.059858]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.060079]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.060286]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.060501]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.060710]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.060914]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.061085]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.061275]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            9.061443]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            
            green Oleg Drokin added a comment - Just to draw attention to my comment in gerrit. The latest patch still deadlocks in racer on mds, also seems to be leaking ost locks at times? 8.832781] LNet: Service thread pid 26108 was inactive for 62.00s. The thread mig 8.833657] Pid: 26108, comm: mdt00_007 8.833906] 8.833907] Call Trace: 8.834350] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 8.834649] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 8.834934] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 8.835310] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 8.835629] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 8.836070] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 8.836267] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 8.836475] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 8.836735] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 8.836950] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 8.837190] [<ffffffffa0574805>] mdt_object_local_lock+0x3c5/0xa80 [mdt] 8.837391] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 8.837638] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 8.837852] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 8.838074] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 8.838265] [<ffffffffa059328a>] mdt_reint_unlink+0x20a/0x10c0 [mdt] 8.838485] [<ffffffffa120fa80>] ? lu_ucred+0x20/0x30 [obdclass] 8.838676] [<ffffffffa056ad25>] ? mdt_ucred+0x15/0x20 [mdt] 8.838898] [<ffffffffa05858bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 8.839232] [<ffffffffa1434e02>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] 8.839566] [<ffffffffa0589aad>] mdt_reint_rec+0x5d/0x200 [mdt] 8.839881] [<ffffffffa056f5ab>] mdt_reint_internal+0x4cb/0x7a0 [mdt] 8.840205] [<ffffffffa056fe0b>] mdt_reint+0x6b/0x120 [mdt] 8.840550] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 8.840915] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 8.841304] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 8.841624] [<ffffffff81098c06>] kthread+0x96/0xa0 8.841917] [<ffffffff8100c24a>] child_rip+0xa/0x20 8.842199] [<ffffffff81098b70>] ? kthread+0x0/0xa0 8.842492] [<ffffffff8100c240>] ? child_rip+0x0/0x20 8.842780] 8.842997] LustreError: dumping log to /tmp/lustre-log.1420492523.26108 9.015282] Pid: 9643, comm: mdt00_006 9.015565] 9.015566] Call Trace: 9.016033] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.017088] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.017352] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.017687] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.018074] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.018593] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.018880] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.019196] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.019615] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.019940] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.020220] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.020558] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.020860] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.021150] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.021477] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.022987] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.023297] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.023573] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.023888] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.024182] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.024493] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.024789] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.025140] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.025454] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.025747] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.026616] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.026970] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.027313] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.027622] [<ffffffff81098c06>] kthread+0x96/0xa0 9.028084] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.029247] Pid: 6818, comm: mdt01_002 9.029453] 9.029453] Call Trace: 9.029739] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.029980] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.030164] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.030341] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.030579] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.031058] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.031335] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.031630] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.032198] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.032491] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.032773] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.033094] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.033387] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.033675] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.033988] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.034264] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.034544] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.034891] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.035203] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.035495] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.035774] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.036192] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.036479] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.036788] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.037136] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.037429] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.037735] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.039938] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.040213] [<ffffffff81098c06>] kthread+0x96/0xa0 9.040456] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.040707] [<ffffffff81098b70>] ? kthread+0x0/0xa0 9.040980] [<ffffffff8100c240>] ? child_rip+0x0/0x20 9.041230] 9.041415] Pid: 6815, comm: mdt00_002 9.041637] 9.041638] Call Trace: 9.042070] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.042351] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.042615] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.042890] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.043183] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.043656] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.044075] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.044366] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.044677] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.045145] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.045420] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.045705] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.046019] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.046309] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.046596] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.046873] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.047152] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.047427] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.047736] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.048031] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.048308] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.048604] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.048918] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.049232] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.049544] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.049852] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.050160] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.050453] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.050720] [<ffffffff81098c06>] kthread+0x96/0xa0 9.050970] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.051895] Pid: 6817, comm: mdt01_001 9.052114] 9.052114] Call Trace: 9.052817] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.053127] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.053391] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.053653] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.053952] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.054484] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.054768] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.055087] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.055394] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.055697] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.056558] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.056893] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.057247] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.057488] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.057687] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.057949] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.058223] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.058530] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.058752] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.058958] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.059174] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.059398] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.059629] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.059858] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.060079] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.060286] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.060501] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.060710] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.060914] [<ffffffff81098c06>] kthread+0x96/0xa0 9.061085] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.061275] [<ffffffff81098b70>] ? kthread+0x0/0xa0 9.061443] [<ffffffff8100c240>] ? child_rip+0x0/0x20

            Thanks for the response, Henri. I'm glad to hear the group lock option was retained, and I see the deadlock with truncate was resolved as well.

            paf Patrick Farrell (Inactive) added a comment - Thanks for the response, Henri. I'm glad to hear the group lock option was retained, and I see the deadlock with truncate was resolved as well.

            Yes, it is still possible. Though an early version of the patch removed grouplock-protected migration, it has now been re-introduced. Migration can be either grouplock-protected and blocking (as before), or based on exclusive open and non-blocking (would safely abort if a concurrent process opens the file). We would need file leases to provide a notion of "group" to be able to implement non-blocking parallel migration too.

            hdoreau Henri Doreau (Inactive) added a comment - Yes, it is still possible. Though an early version of the patch removed grouplock-protected migration, it has now been re-introduced. Migration can be either grouplock-protected and blocking (as before), or based on exclusive open and non-blocking (would safely abort if a concurrent process opens the file). We would need file leases to provide a notion of "group" to be able to implement non-blocking parallel migration too.

            One advantage to the old approach of using group locks for migration was that it was theoretically possible to create a version of lfs migrate that could migrate a file in parallel using multiple clients. Is this still possible with the new approach?

            paf Patrick Farrell (Inactive) added a comment - - edited One advantage to the old approach of using group locks for migration was that it was theoretically possible to create a version of lfs migrate that could migrate a file in parallel using multiple clients. Is this still possible with the new approach?

            People

              bobijam Zhenyu Xu
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: