Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4840

Deadlock when truncating file during lfs migrate

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.4.2
    • 3
    • 13336

    Description

      While migrating a file with "lfs migrate", if a process tries to truncate the file, both lfs migrate and truncating processes will deadlock.

      This will result in both processes never finishing (unless it is killed) and watchdog messages saying that the processes did not progress for the last XXX seconds.

      Here is a reproducer:

      [root@lustre24cli ~]# cat reproducer.sh
      #!/bin/sh
      
      FS=/test
      FILE=${FS}/file
      
      rm -f ${FILE}
      # Create a file on OST 1 of size 512M
      lfs setstripe -o 1 -c 1 ${FILE}
      dd if=/dev/zero of=${FILE} bs=1M count=512
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Launch a migrate to OST 0 and a bit later open it for write
      lfs migrate -i 0 --block ${FILE} &
      sleep 2
      dd if=/dev/zero of=${FILE} bs=1M count=512 
      

      Once the last dd tries to open the file, both lfs and dd processes stay forever with this stack:

      lfs stack:

      [<ffffffff8128e864>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffa08d98dd>] ll_file_io_generic+0x29d/0x600 [lustre]
      [<ffffffffa08d9d7f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa08da61c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff811896b5>] vfs_read+0xb5/0x1a0
      [<ffffffff811897f1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      dd stack:

      [<ffffffffa03436fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa04779fa>] cl_lock_state_wait+0x1aa/0x320 [obdclass]
      [<ffffffffa04781eb>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
      [<ffffffffa0478d6e>] cl_lock_request+0x7e/0x270 [obdclass]
      [<ffffffffa047e00c>] cl_io_lock+0x3cc/0x560 [obdclass]
      [<ffffffffa047e242>] cl_io_loop+0xa2/0x1b0 [obdclass]
      [<ffffffffa092a8c8>] cl_setattr_ost+0x208/0x2c0 [lustre]
      [<ffffffffa08f8a0e>] ll_setattr_raw+0x9ce/0x1000 [lustre]
      [<ffffffffa08f909b>] ll_setattr+0x5b/0xf0 [lustre]
      [<ffffffff811a7348>] notify_change+0x168/0x340
      [<ffffffff81187074>] do_truncate+0x64/0xa0
      [<ffffffff8119bcc1>] do_filp_open+0x861/0xd20
      [<ffffffff81185d39>] do_sys_open+0x69/0x140
      [<ffffffff81185e50>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            [LU-4840] Deadlock when truncating file during lfs migrate

            please apply http://review.whamcloud.com/13344 to your tree. It worked well after that patch was applied in my test.

            jay Jinshan Xiong (Inactive) added a comment - please apply http://review.whamcloud.com/13344 to your tree. It worked well after that patch was applied in my test.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I'm investigating this issue.

            jay Jinshan Xiong (Inactive) added a comment - - edited I'm investigating this issue.

            Dropping this from Blocker to Critical, since it is not a new issue for 2.7.0 (it exists since migrate was added in 2.4.0), and only affects a subset of users of the migrate functionality, and not anyone else.

            adilger Andreas Dilger added a comment - Dropping this from Blocker to Critical, since it is not a new issue for 2.7.0 (it exists since migrate was added in 2.4.0), and only affects a subset of users of the migrate functionality, and not anyone else.
            green Oleg Drokin added a comment -

            Just to draw attention to my comment in gerrit.
            The latest patch still deadlocks in racer on mds, also seems to be leaking ost locks at times?

            8.832781] LNet: Service thread pid 26108 was inactive for 62.00s. The thread mig
            8.833657] Pid: 26108, comm: mdt00_007
            8.833906] 
            8.833907] Call Trace:
            8.834350]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            8.834649]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            8.834934]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            8.835310]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            8.835629]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            8.836070]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            8.836267]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            8.836475]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            8.836735]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            8.836950]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            8.837190]  [<ffffffffa0574805>] mdt_object_local_lock+0x3c5/0xa80 [mdt]
            8.837391]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            8.837638]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            8.837852]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            8.838074]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            8.838265]  [<ffffffffa059328a>] mdt_reint_unlink+0x20a/0x10c0 [mdt]
            8.838485]  [<ffffffffa120fa80>] ? lu_ucred+0x20/0x30 [obdclass]
            8.838676]  [<ffffffffa056ad25>] ? mdt_ucred+0x15/0x20 [mdt]
            8.838898]  [<ffffffffa05858bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
            8.839232]  [<ffffffffa1434e02>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
            8.839566]  [<ffffffffa0589aad>] mdt_reint_rec+0x5d/0x200 [mdt]
            8.839881]  [<ffffffffa056f5ab>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
            8.840205]  [<ffffffffa056fe0b>] mdt_reint+0x6b/0x120 [mdt]
            8.840550]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            8.840915]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            8.841304]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            8.841624]  [<ffffffff81098c06>] kthread+0x96/0xa0
            8.841917]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            8.842199]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            8.842492]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            8.842780] 
            8.842997] LustreError: dumping log to /tmp/lustre-log.1420492523.26108
            9.015282] Pid: 9643, comm: mdt00_006
            9.015565] 
            9.015566] Call Trace:
            9.016033]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.017088]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.017352]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.017687]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.018074]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.018593]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.018880]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.019196]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.019615]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.019940]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.020220]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.020558]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.020860]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.021150]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.021477]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.022987]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.023297]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.023573]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.023888]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.024182]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.024493]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.024789]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.025140]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.025454]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.025747]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.026616]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.026970]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.027313]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.027622]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.028084]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.029247] Pid: 6818, comm: mdt01_002
            9.029453] 
            9.029453] Call Trace:
            9.029739]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.029980]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.030164]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.030341]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.030579]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.031058]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.031335]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.031630]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.032198]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.032491]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.032773]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.033094]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.033387]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.033675]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.033988]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.034264]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.034544]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.034891]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.035203]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.035495]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.035774]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.036192]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.036479]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.036788]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.037136]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.037429]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.037735]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.039938]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.040213]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.040456]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.040707]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            9.040980]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            9.041230] 
            9.041415] Pid: 6815, comm: mdt00_002
            9.041637] 
            9.041638] Call Trace:
            9.042070]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.042351]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.042615]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.042890]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.043183]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.043656]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.044075]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.044366]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.044677]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.045145]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.045420]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.045705]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.046019]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.046309]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.046596]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.046873]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.047152]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.047427]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.047736]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.048031]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.048308]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.048604]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.048918]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.049232]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.049544]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.049852]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.050160]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.050453]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.050720]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.050970]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.051895] Pid: 6817, comm: mdt01_001
            9.052114] 
            9.052114] Call Trace:
            9.052817]  [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc]
            9.053127]  [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30
            9.053391]  [<ffffffff81514231>] schedule_timeout+0x191/0x2e0
            9.053653]  [<ffffffff81081e50>] ? process_timeout+0x0/0x10
            9.053952]  [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc
            9.054484]  [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc]
            9.054768]  [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
            9.055087]  [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc]
            9.055394]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.055697]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.056558]  [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt]
            9.056893]  [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            9.057247]  [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
            9.057488]  [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt]
            9.057687]  [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt]
            9.057949]  [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt]
            9.058223]  [<ffffffff8128863a>] ? strlcpy+0x4a/0x60
            9.058530]  [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            9.058752]  [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
            9.058958]  [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt]
            9.059174]  [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt]
            9.059398]  [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc]
            9.059629]  [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs]
            9.059858]  [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc]
            9.060079]  [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
            9.060286]  [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
            9.060501]  [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc]
            9.060710]  [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc]
            9.060914]  [<ffffffff81098c06>] kthread+0x96/0xa0
            9.061085]  [<ffffffff8100c24a>] child_rip+0xa/0x20
            9.061275]  [<ffffffff81098b70>] ? kthread+0x0/0xa0
            9.061443]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            
            green Oleg Drokin added a comment - Just to draw attention to my comment in gerrit. The latest patch still deadlocks in racer on mds, also seems to be leaking ost locks at times? 8.832781] LNet: Service thread pid 26108 was inactive for 62.00s. The thread mig 8.833657] Pid: 26108, comm: mdt00_007 8.833906] 8.833907] Call Trace: 8.834350] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 8.834649] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 8.834934] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 8.835310] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 8.835629] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 8.836070] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 8.836267] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 8.836475] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 8.836735] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 8.836950] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 8.837190] [<ffffffffa0574805>] mdt_object_local_lock+0x3c5/0xa80 [mdt] 8.837391] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 8.837638] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 8.837852] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 8.838074] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 8.838265] [<ffffffffa059328a>] mdt_reint_unlink+0x20a/0x10c0 [mdt] 8.838485] [<ffffffffa120fa80>] ? lu_ucred+0x20/0x30 [obdclass] 8.838676] [<ffffffffa056ad25>] ? mdt_ucred+0x15/0x20 [mdt] 8.838898] [<ffffffffa05858bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 8.839232] [<ffffffffa1434e02>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] 8.839566] [<ffffffffa0589aad>] mdt_reint_rec+0x5d/0x200 [mdt] 8.839881] [<ffffffffa056f5ab>] mdt_reint_internal+0x4cb/0x7a0 [mdt] 8.840205] [<ffffffffa056fe0b>] mdt_reint+0x6b/0x120 [mdt] 8.840550] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 8.840915] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 8.841304] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 8.841624] [<ffffffff81098c06>] kthread+0x96/0xa0 8.841917] [<ffffffff8100c24a>] child_rip+0xa/0x20 8.842199] [<ffffffff81098b70>] ? kthread+0x0/0xa0 8.842492] [<ffffffff8100c240>] ? child_rip+0x0/0x20 8.842780] 8.842997] LustreError: dumping log to /tmp/lustre-log.1420492523.26108 9.015282] Pid: 9643, comm: mdt00_006 9.015565] 9.015566] Call Trace: 9.016033] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.017088] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.017352] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.017687] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.018074] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.018593] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.018880] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.019196] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.019615] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.019940] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.020220] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.020558] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.020860] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.021150] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.021477] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.022987] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.023297] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.023573] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.023888] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.024182] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.024493] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.024789] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.025140] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.025454] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.025747] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.026616] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.026970] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.027313] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.027622] [<ffffffff81098c06>] kthread+0x96/0xa0 9.028084] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.029247] Pid: 6818, comm: mdt01_002 9.029453] 9.029453] Call Trace: 9.029739] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.029980] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.030164] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.030341] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.030579] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.031058] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.031335] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.031630] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.032198] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.032491] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.032773] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.033094] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.033387] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.033675] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.033988] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.034264] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.034544] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.034891] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.035203] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.035495] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.035774] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.036192] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.036479] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.036788] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.037136] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.037429] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.037735] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.039938] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.040213] [<ffffffff81098c06>] kthread+0x96/0xa0 9.040456] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.040707] [<ffffffff81098b70>] ? kthread+0x0/0xa0 9.040980] [<ffffffff8100c240>] ? child_rip+0x0/0x20 9.041230] 9.041415] Pid: 6815, comm: mdt00_002 9.041637] 9.041638] Call Trace: 9.042070] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.042351] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.042615] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.042890] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.043183] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.043656] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.044075] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.044366] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.044677] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.045145] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.045420] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.045705] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.046019] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.046309] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.046596] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.046873] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.047152] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.047427] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.047736] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.048031] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.048308] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.048604] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.048918] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.049232] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.049544] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.049852] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.050160] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.050453] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.050720] [<ffffffff81098c06>] kthread+0x96/0xa0 9.050970] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.051895] Pid: 6817, comm: mdt01_001 9.052114] 9.052114] Call Trace: 9.052817] [<ffffffffa13be503>] ? _ldlm_lock_debug+0x2e3/0x670 [ptlrpc] 9.053127] [<ffffffff81516894>] ? _spin_lock_irqsave+0x24/0x30 9.053391] [<ffffffff81514231>] schedule_timeout+0x191/0x2e0 9.053653] [<ffffffff81081e50>] ? process_timeout+0x0/0x10 9.053952] [<ffffffffa13decf0>] ? ldlm_expired_completion_wait+0x0/0x370 [ptlrpc 9.054484] [<ffffffffa13e3841>] ldlm_completion_ast+0x5e1/0x9b0 [ptlrpc] 9.054768] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20 9.055087] [<ffffffffa13e2c8e>] ldlm_cli_enqueue_local+0x21e/0x7f0 [ptlrpc] 9.055394] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.055697] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.056558] [<ffffffffa05745fb>] mdt_object_local_lock+0x1bb/0xa80 [mdt] 9.056893] [<ffffffffa056bbc0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 9.057247] [<ffffffffa13e3260>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 9.057488] [<ffffffffa0575245>] mdt_object_lock_internal+0x65/0x360 [mdt] 9.057687] [<ffffffffa0575604>] mdt_object_lock+0x14/0x20 [mdt] 9.057949] [<ffffffffa05806fc>] mdt_getattr_name_lock+0x103c/0x1ab0 [mdt] 9.058223] [<ffffffff8128863a>] ? strlcpy+0x4a/0x60 9.058530] [<ffffffffa140ff84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 9.058752] [<ffffffffa14116d0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 9.058958] [<ffffffffa0581692>] mdt_intent_getattr+0x292/0x470 [mdt] 9.059174] [<ffffffffa056e064>] mdt_intent_policy+0x494/0xce0 [mdt] 9.059398] [<ffffffffa13c305f>] ldlm_lock_enqueue+0x12f/0x950 [ptlrpc] 9.059629] [<ffffffffa10b9201>] ? cfs_hash_for_each_enter+0x1/0xa0 [libcfs] 9.059858] [<ffffffffa13eedeb>] ldlm_handle_enqueue0+0x51b/0x13e0 [ptlrpc] 9.060079] [<ffffffffa146dc72>] tgt_enqueue+0x62/0x1d0 [ptlrpc] 9.060286] [<ffffffffa146e85e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 9.060501] [<ffffffffa141fd64>] ptlrpc_main+0xdf4/0x1940 [ptlrpc] 9.060710] [<ffffffffa141ef70>] ? ptlrpc_main+0x0/0x1940 [ptlrpc] 9.060914] [<ffffffff81098c06>] kthread+0x96/0xa0 9.061085] [<ffffffff8100c24a>] child_rip+0xa/0x20 9.061275] [<ffffffff81098b70>] ? kthread+0x0/0xa0 9.061443] [<ffffffff8100c240>] ? child_rip+0x0/0x20

            Thanks for the response, Henri. I'm glad to hear the group lock option was retained, and I see the deadlock with truncate was resolved as well.

            paf Patrick Farrell (Inactive) added a comment - Thanks for the response, Henri. I'm glad to hear the group lock option was retained, and I see the deadlock with truncate was resolved as well.

            Yes, it is still possible. Though an early version of the patch removed grouplock-protected migration, it has now been re-introduced. Migration can be either grouplock-protected and blocking (as before), or based on exclusive open and non-blocking (would safely abort if a concurrent process opens the file). We would need file leases to provide a notion of "group" to be able to implement non-blocking parallel migration too.

            hdoreau Henri Doreau (Inactive) added a comment - Yes, it is still possible. Though an early version of the patch removed grouplock-protected migration, it has now been re-introduced. Migration can be either grouplock-protected and blocking (as before), or based on exclusive open and non-blocking (would safely abort if a concurrent process opens the file). We would need file leases to provide a notion of "group" to be able to implement non-blocking parallel migration too.

            One advantage to the old approach of using group locks for migration was that it was theoretically possible to create a version of lfs migrate that could migrate a file in parallel using multiple clients. Is this still possible with the new approach?

            paf Patrick Farrell (Inactive) added a comment - - edited One advantage to the old approach of using group locks for migration was that it was theoretically possible to create a version of lfs migrate that could migrate a file in parallel using multiple clients. Is this still possible with the new approach?

            Henri, I agree with Frank that we should not be landing a patch with significant known defects,since this would break the code for anyone testing this. Please merge the patches.

            adilger Andreas Dilger added a comment - Henri, I agree with Frank that we should not be landing a patch with significant known defects,since this would break the code for anyone testing this. Please merge the patches.

            Follow-up patch, fixes numerous issues with the first one: http://review.whamcloud.com/#/c/12616/
            Both patches can be merged if need be, just let me know what's preferred.

            hdoreau Henri Doreau (Inactive) added a comment - Follow-up patch, fixes numerous issues with the first one: http://review.whamcloud.com/#/c/12616/ Both patches can be merged if need be, just let me know what's preferred.

            Thanks. The fix works, and I can migrate a file between osts now.

            Regarding the junk output, I found the bug in llapi_file_open_param(). I'll submit a patch soon.

            fzago Frank Zago (Inactive) added a comment - Thanks. The fix works, and I can migrate a file between osts now. Regarding the junk output, I found the bug in llapi_file_open_param(). I'll submit a patch soon.

            Thanks Frank. Null pointer (sbi) dereference in ll_mdscapa_get(). Fixed in patchset #13. The file content ending up in the console remains unexplained to me so far. You said it was present before, is there an open ticket for that?

            hdoreau Henri Doreau (Inactive) added a comment - Thanks Frank. Null pointer (sbi) dereference in ll_mdscapa_get(). Fixed in patchset #13. The file content ending up in the console remains unexplained to me so far. You said it was present before, is there an open ticket for that?

            People

              bobijam Zhenyu Xu
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: