Details

    • 3
    • 4759

    Description

      When I rebooted two OSS to put a patch for bug LU-874 on the servers, quite a few of the clients have appear to have gotten deadlocked in recovery. Here's a backtrace of ptlrpcd-rcv on on client:

      crash> bt 5077
      PID: 5077   TASK: ffff88082da834c0  CPU: 8   COMMAND: "ptlrpcd-rcv"
       #0 [ffff88082da85430] schedule at ffffffff814ee3b2
       #1 [ffff88082da854f8] io_schedule at ffffffff814eeba3
       #2 [ffff88082da85518] sync_page at ffffffff81110fbd
       #3 [ffff88082da85528] __wait_on_bit_lock at ffffffff814ef40a
       #4 [ffff88082da85578] __lock_page at ffffffff81110f57
       #5 [ffff88082da855d8] vvp_page_own at ffffffffa093bf6a [lustre]
       #6 [ffff88082da855f8] cl_page_own0 at ffffffffa0601d3b [obdclass]
       #7 [ffff88082da85678] cl_page_own at ffffffffa0601fa0 [obdclass]
       #8 [ffff88082da85688] cl_page_gang_lookup at ffffffffa0603bb7 [obdclass]
       #9 [ffff88082da85758] cl_lock_page_out at ffffffffa06096fc [obdclass]
      #10 [ffff88082da85808] osc_lock_flush at ffffffffa0858e8f [osc]
      #11 [ffff88082da85858] osc_lock_cancel at ffffffffa0858f2a [osc]
      #12 [ffff88082da858d8] cl_lock_cancel0 at ffffffffa0604665 [obdclass]
      #13 [ffff88082da85928] cl_lock_cancel at ffffffffa06051ab [obdclass]
      #14 [ffff88082da85968] osc_ldlm_blocking_ast at ffffffffa0859cf8 [osc]
      #15 [ffff88082da859f8] ldlm_cancel_callback at ffffffffa06a1ba3 [ptlrpc]
      #16 [ffff88082da85a18] ldlm_lock_cancel at ffffffffa06a1c89 [ptlrpc]
      #17 [ffff88082da85a58] ldlm_cli_cancel_list_local at ffffffffa06bede8 [ptlrpc]
      #18 [ffff88082da85ae8] ldlm_cancel_lru_local at ffffffffa06bf255 [ptlrpc]
      #19 [ffff88082da85b08] ldlm_replay_locks at ffffffffa06bf385 [ptlrpc]
      #20 [ffff88082da85bb8] ptlrpc_import_recovery_state_machine at ffffffffa070ceea [ptlrpc]
      #21 [ffff88082da85c38] ptlrpc_connect_interpret at ffffffffa070db38 [ptlrpc]
      #22 [ffff88082da85d08] ptlrpc_check_set at ffffffffa06dd870 [ptlrpc]
      #23 [ffff88082da85de8] ptlrpcd_check at ffffffffa07113b8 [ptlrpc]
      #24 [ffff88082da85e48] ptlrpcd at ffffffffa071175b [ptlrpc]
      #25 [ffff88082da85f48] kernel_thread at ffffffff8100c14a
      

      I will need to do more investigation, but thats a start.

      Attachments

        Issue Links

          Activity

            [LU-948] Client recovery hang

            Old ticket for unsupported version

            simmonsja James A Simmons added a comment - Old ticket for unsupported version

            Sorry for delay response. This must be a race in read ahead pages vs. dlm lock issue, I'm looking at it.

            jay Jinshan Xiong (Inactive) added a comment - Sorry for delay response. This must be a race in read ahead pages vs. dlm lock issue, I'm looking at it.

            Attached seqlac2_console.txt showing backtraces of all processes on a PPC64 client.

            morrone Christopher Morrone (Inactive) added a comment - Attached seqlac2_console.txt showing backtraces of all processes on a PPC64 client.

            Jinshan, here is a console log showing all backtraces on the system. I'm not entirely certain if this will be useful, because this node was a PPC64 box running the older 2.1.0-13chaos code. 2.1.0-13chaos did NOT have the http://review.whamcloud.com/1955 fix.

            But the backtrace is the same for ptlrpcd-rcv, and you can see some other processes with backtraces under lustre ("cp" looks like an interesting one). I will attach seqlac2_console.txt.

            morrone Christopher Morrone (Inactive) added a comment - Jinshan, here is a console log showing all backtraces on the system. I'm not entirely certain if this will be useful, because this node was a PPC64 box running the older 2.1.0-13chaos code. 2.1.0-13chaos did NOT have the http://review.whamcloud.com/1955 fix. But the backtrace is the same for ptlrpcd-rcv, and you can see some other processes with backtraces under lustre ("cp" looks like an interesting one). I will attach seqlac2_console.txt.

            Hi Ned, can you please show me the backtrace of other processes when this issue happened?

            jay Jinshan Xiong (Inactive) added a comment - Hi Ned, can you please show me the backtrace of other processes when this issue happened?

            Thanks, that patch cleared up the vvp_page_unmap() assertion. We're now testing recovery with this tag https://github.com/chaos/lustre/commits/2.1.2-3chaos plus the following patches.

            • c882f3a LU-1059 clio: to not try to discard freeing pages
            • 9286c90 LU-948 clio: add a callback to cl_page_gang_lookup()
            • 82a8948 LU-1716 ptlrpc: Race in updating of connection flags on client.
            • 8812d32 LU-1299 clio: set the lock to CLS_NEW if unuse returns error

            Unfortunately I just found another client deadlocked in recovery with a similar stack trace to the description. So I'm afraid http://review.whamcloud.com/#change,1955 i does not fix the problem.

            crash> bt 6257
            PID: 6257   TASK: ffff8808317ccaa0  CPU: 4   COMMAND: "ptlrpcd-rcv"
             #0 [ffff88083084b680] schedule at ffffffff814ef052
             #1 [ffff88083084b748] io_schedule at ffffffff814ef833
             #2 [ffff88083084b768] sync_page at ffffffff81110d5d
             #3 [ffff88083084b778] __wait_on_bit_lock at ffffffff814f009a
             #4 [ffff88083084b7c8] __lock_page at ffffffff81110cf7
             #5 [ffff88083084b828] vvp_page_own at ffffffffa09ce18a [lustre]
             #6 [ffff88083084b858] cl_page_own0 at ffffffffa05ace9b [obdclass]
             #7 [ffff88083084b8a8] cl_page_own at ffffffffa05ad070 [obdclass]
             #8 [ffff88083084b8b8] check_and_discard_cb at ffffffffa05b26af [obdclass]
             #9 [ffff88083084b908] cl_page_gang_lookup at ffffffffa05ae763 [obdclass]
            #10 [ffff88083084b9b8] cl_lock_page_out at ffffffffa05afe3b [obdclass]
            #11 [ffff88083084ba28] osc_lock_flush at ffffffffa08b697f [osc]
            #12 [ffff88083084ba78] osc_lock_cancel at ffffffffa08b6a19 [osc]
            #13 [ffff88083084bac8] cl_lock_cancel0 at ffffffffa05af085 [obdclass]
            #14 [ffff88083084baf8] cl_lock_cancel at ffffffffa05af8b3 [obdclass]
            #15 [ffff88083084bb18] osc_ldlm_blocking_ast at ffffffffa08b7673 [osc]
            #16 [ffff88083084bb88] ldlm_cancel_callback at ffffffffa06a0083 [ptlrpc]
            #17 [ffff88083084bba8] ldlm_lock_cancel at ffffffffa06a0165 [ptlrpc]
            #18 [ffff88083084bbc8] ldlm_cli_cancel_list_local at ffffffffa06b7698 [ptlrpc]
            #19 [ffff88083084bc28] ldlm_cancel_lru_local at ffffffffa06b7a95 [ptlrpc]
            #20 [ffff88083084bc48] ldlm_replay_locks at ffffffffa06b7b9f [ptlrpc]
            #21 [ffff88083084bcc8] ptlrpc_import_recovery_state_machine at ffffffffa06f5c26 [ptlrpc]
            #22 [ffff88083084bd08] ptlrpc_connect_interpret at ffffffffa06f65bd [ptlrpc]
            #23 [ffff88083084bda8] ptlrpc_check_set at ffffffffa06d0955 [ptlrpc]
            #24 [ffff88083084be48] ptlrpcd_check at ffffffffa06f8ee0 [ptlrpc]
            #25 [ffff88083084be78] ptlrpcd at ffffffffa06f91fb [ptlrpc]
            #26 [ffff88083084bf48] kernel_thread at ffffffff8100c14a
            
            nedbass Ned Bass (Inactive) added a comment - Thanks, that patch cleared up the vvp_page_unmap() assertion. We're now testing recovery with this tag https://github.com/chaos/lustre/commits/2.1.2-3chaos plus the following patches. c882f3a LU-1059 clio: to not try to discard freeing pages 9286c90 LU-948 clio: add a callback to cl_page_gang_lookup() 82a8948 LU-1716 ptlrpc: Race in updating of connection flags on client. 8812d32 LU-1299 clio: set the lock to CLS_NEW if unuse returns error Unfortunately I just found another client deadlocked in recovery with a similar stack trace to the description. So I'm afraid http://review.whamcloud.com/#change,1955 i does not fix the problem. crash> bt 6257 PID: 6257 TASK: ffff8808317ccaa0 CPU: 4 COMMAND: "ptlrpcd-rcv" #0 [ffff88083084b680] schedule at ffffffff814ef052 #1 [ffff88083084b748] io_schedule at ffffffff814ef833 #2 [ffff88083084b768] sync_page at ffffffff81110d5d #3 [ffff88083084b778] __wait_on_bit_lock at ffffffff814f009a #4 [ffff88083084b7c8] __lock_page at ffffffff81110cf7 #5 [ffff88083084b828] vvp_page_own at ffffffffa09ce18a [lustre] #6 [ffff88083084b858] cl_page_own0 at ffffffffa05ace9b [obdclass] #7 [ffff88083084b8a8] cl_page_own at ffffffffa05ad070 [obdclass] #8 [ffff88083084b8b8] check_and_discard_cb at ffffffffa05b26af [obdclass] #9 [ffff88083084b908] cl_page_gang_lookup at ffffffffa05ae763 [obdclass] #10 [ffff88083084b9b8] cl_lock_page_out at ffffffffa05afe3b [obdclass] #11 [ffff88083084ba28] osc_lock_flush at ffffffffa08b697f [osc] #12 [ffff88083084ba78] osc_lock_cancel at ffffffffa08b6a19 [osc] #13 [ffff88083084bac8] cl_lock_cancel0 at ffffffffa05af085 [obdclass] #14 [ffff88083084baf8] cl_lock_cancel at ffffffffa05af8b3 [obdclass] #15 [ffff88083084bb18] osc_ldlm_blocking_ast at ffffffffa08b7673 [osc] #16 [ffff88083084bb88] ldlm_cancel_callback at ffffffffa06a0083 [ptlrpc] #17 [ffff88083084bba8] ldlm_lock_cancel at ffffffffa06a0165 [ptlrpc] #18 [ffff88083084bbc8] ldlm_cli_cancel_list_local at ffffffffa06b7698 [ptlrpc] #19 [ffff88083084bc28] ldlm_cancel_lru_local at ffffffffa06b7a95 [ptlrpc] #20 [ffff88083084bc48] ldlm_replay_locks at ffffffffa06b7b9f [ptlrpc] #21 [ffff88083084bcc8] ptlrpc_import_recovery_state_machine at ffffffffa06f5c26 [ptlrpc] #22 [ffff88083084bd08] ptlrpc_connect_interpret at ffffffffa06f65bd [ptlrpc] #23 [ffff88083084bda8] ptlrpc_check_set at ffffffffa06d0955 [ptlrpc] #24 [ffff88083084be48] ptlrpcd_check at ffffffffa06f8ee0 [ptlrpc] #25 [ffff88083084be78] ptlrpcd at ffffffffa06f91fb [ptlrpc] #26 [ffff88083084bf48] kernel_thread at ffffffff8100c14a

            please apply patch at LU-1059

            jay Jinshan Xiong (Inactive) added a comment - please apply patch at LU-1059

            We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch.

            LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed
            LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG
            
            PID: 3846   TASK: ffff88054af26aa0  CPU: 3   COMMAND: "ldlm_bl_10"
             #0 [ffff880567895948] machine_kexec at ffffffff8103216b
             #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12
             #2 [ffff880567895a78] panic at ffffffff814ee999
             #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs]
             #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs]
             #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre]
             #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass]
             #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass]
             #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass]
             #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass]
            #10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass]
            #11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc]
            #12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc]
            #13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass]
            #14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass]
            #15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc]
            #16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc]
            #17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc]
            #18 [ffff880567895f48] kernel_thread at ffffffff8100c14a
            
            nedbass Ned Bass (Inactive) added a comment - We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch. LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG PID: 3846 TASK: ffff88054af26aa0 CPU: 3 COMMAND: "ldlm_bl_10" #0 [ffff880567895948] machine_kexec at ffffffff8103216b #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12 #2 [ffff880567895a78] panic at ffffffff814ee999 #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs] #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs] #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre] #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass] #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass] #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass] #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass] #10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass] #11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc] #12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc] #13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass] #14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass] #15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc] #16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc] #17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc] #18 [ffff880567895f48] kernel_thread at ffffffff8100c14a
            bogl Bob Glossman (Inactive) added a comment - http://review.whamcloud.com/#change,2690 back port to b2_1
            pjones Peter Jones added a comment -

            Landed for 2.2

            pjones Peter Jones added a comment - Landed for 2.2

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: