Details

    • 3
    • 4759

    Description

      When I rebooted two OSS to put a patch for bug LU-874 on the servers, quite a few of the clients have appear to have gotten deadlocked in recovery. Here's a backtrace of ptlrpcd-rcv on on client:

      crash> bt 5077
      PID: 5077   TASK: ffff88082da834c0  CPU: 8   COMMAND: "ptlrpcd-rcv"
       #0 [ffff88082da85430] schedule at ffffffff814ee3b2
       #1 [ffff88082da854f8] io_schedule at ffffffff814eeba3
       #2 [ffff88082da85518] sync_page at ffffffff81110fbd
       #3 [ffff88082da85528] __wait_on_bit_lock at ffffffff814ef40a
       #4 [ffff88082da85578] __lock_page at ffffffff81110f57
       #5 [ffff88082da855d8] vvp_page_own at ffffffffa093bf6a [lustre]
       #6 [ffff88082da855f8] cl_page_own0 at ffffffffa0601d3b [obdclass]
       #7 [ffff88082da85678] cl_page_own at ffffffffa0601fa0 [obdclass]
       #8 [ffff88082da85688] cl_page_gang_lookup at ffffffffa0603bb7 [obdclass]
       #9 [ffff88082da85758] cl_lock_page_out at ffffffffa06096fc [obdclass]
      #10 [ffff88082da85808] osc_lock_flush at ffffffffa0858e8f [osc]
      #11 [ffff88082da85858] osc_lock_cancel at ffffffffa0858f2a [osc]
      #12 [ffff88082da858d8] cl_lock_cancel0 at ffffffffa0604665 [obdclass]
      #13 [ffff88082da85928] cl_lock_cancel at ffffffffa06051ab [obdclass]
      #14 [ffff88082da85968] osc_ldlm_blocking_ast at ffffffffa0859cf8 [osc]
      #15 [ffff88082da859f8] ldlm_cancel_callback at ffffffffa06a1ba3 [ptlrpc]
      #16 [ffff88082da85a18] ldlm_lock_cancel at ffffffffa06a1c89 [ptlrpc]
      #17 [ffff88082da85a58] ldlm_cli_cancel_list_local at ffffffffa06bede8 [ptlrpc]
      #18 [ffff88082da85ae8] ldlm_cancel_lru_local at ffffffffa06bf255 [ptlrpc]
      #19 [ffff88082da85b08] ldlm_replay_locks at ffffffffa06bf385 [ptlrpc]
      #20 [ffff88082da85bb8] ptlrpc_import_recovery_state_machine at ffffffffa070ceea [ptlrpc]
      #21 [ffff88082da85c38] ptlrpc_connect_interpret at ffffffffa070db38 [ptlrpc]
      #22 [ffff88082da85d08] ptlrpc_check_set at ffffffffa06dd870 [ptlrpc]
      #23 [ffff88082da85de8] ptlrpcd_check at ffffffffa07113b8 [ptlrpc]
      #24 [ffff88082da85e48] ptlrpcd at ffffffffa071175b [ptlrpc]
      #25 [ffff88082da85f48] kernel_thread at ffffffff8100c14a
      

      I will need to do more investigation, but thats a start.

      Attachments

        Issue Links

          Activity

            [LU-948] Client recovery hang

            please apply patch at LU-1059

            jay Jinshan Xiong (Inactive) added a comment - please apply patch at LU-1059

            We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch.

            LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed
            LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG
            
            PID: 3846   TASK: ffff88054af26aa0  CPU: 3   COMMAND: "ldlm_bl_10"
             #0 [ffff880567895948] machine_kexec at ffffffff8103216b
             #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12
             #2 [ffff880567895a78] panic at ffffffff814ee999
             #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs]
             #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs]
             #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre]
             #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass]
             #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass]
             #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass]
             #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass]
            #10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass]
            #11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc]
            #12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc]
            #13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass]
            #14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass]
            #15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc]
            #16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc]
            #17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc]
            #18 [ffff880567895f48] kernel_thread at ffffffff8100c14a
            
            nedbass Ned Bass (Inactive) added a comment - We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch. LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG PID: 3846 TASK: ffff88054af26aa0 CPU: 3 COMMAND: "ldlm_bl_10" #0 [ffff880567895948] machine_kexec at ffffffff8103216b #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12 #2 [ffff880567895a78] panic at ffffffff814ee999 #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs] #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs] #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre] #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass] #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass] #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass] #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass] #10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass] #11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc] #12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc] #13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass] #14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass] #15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc] #16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc] #17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc] #18 [ffff880567895f48] kernel_thread at ffffffff8100c14a
            bogl Bob Glossman (Inactive) added a comment - http://review.whamcloud.com/#change,2690 back port to b2_1
            pjones Peter Jones added a comment -

            Landed for 2.2

            pjones Peter Jones added a comment - Landed for 2.2

            Integrated in lustre-master » i686,client,el5,ofa #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/osc/osc_lock.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el5,ofa #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/osc/osc_lock.c

            Integrated in lustre-master » i686,client,el5,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/obdclass/cl_page.c
            • lustre/include/cl_object.h
            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el5,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/obdclass/cl_page.c lustre/include/cl_object.h lustre/osc/osc_lock.c lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h

            Integrated in lustre-master » i686,server,el5,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/osc/osc_lock.c
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/osc/osc_lock.c lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h

            Integrated in lustre-master » i686,server,el5,ofa #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/osc/osc_lock.c
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_internal.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,ofa #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/osc/osc_lock.c lustre/include/cl_object.h lustre/obdclass/cl_lock.c lustre/obdclass/cl_page.c lustre/obdclass/cl_internal.h

            Integrated in lustre-master » x86_64,server,el6,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/obdclass/cl_page.c
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            • lustre/osc/osc_lock.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/obdclass/cl_page.c lustre/include/cl_object.h lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h lustre/osc/osc_lock.c

            Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_internal.h
            • lustre/obdclass/cl_lock.c
            • lustre/osc/osc_lock.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/obdclass/cl_internal.h lustre/obdclass/cl_lock.c lustre/osc/osc_lock.c

            Integrated in lustre-master » i686,client,el6,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_internal.h
            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_lock.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el6,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/obdclass/cl_internal.h lustre/osc/osc_lock.c lustre/obdclass/cl_lock.c

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: