Details

    • 3
    • 4759

    Description

      When I rebooted two OSS to put a patch for bug LU-874 on the servers, quite a few of the clients have appear to have gotten deadlocked in recovery. Here's a backtrace of ptlrpcd-rcv on on client:

      crash> bt 5077
      PID: 5077   TASK: ffff88082da834c0  CPU: 8   COMMAND: "ptlrpcd-rcv"
       #0 [ffff88082da85430] schedule at ffffffff814ee3b2
       #1 [ffff88082da854f8] io_schedule at ffffffff814eeba3
       #2 [ffff88082da85518] sync_page at ffffffff81110fbd
       #3 [ffff88082da85528] __wait_on_bit_lock at ffffffff814ef40a
       #4 [ffff88082da85578] __lock_page at ffffffff81110f57
       #5 [ffff88082da855d8] vvp_page_own at ffffffffa093bf6a [lustre]
       #6 [ffff88082da855f8] cl_page_own0 at ffffffffa0601d3b [obdclass]
       #7 [ffff88082da85678] cl_page_own at ffffffffa0601fa0 [obdclass]
       #8 [ffff88082da85688] cl_page_gang_lookup at ffffffffa0603bb7 [obdclass]
       #9 [ffff88082da85758] cl_lock_page_out at ffffffffa06096fc [obdclass]
      #10 [ffff88082da85808] osc_lock_flush at ffffffffa0858e8f [osc]
      #11 [ffff88082da85858] osc_lock_cancel at ffffffffa0858f2a [osc]
      #12 [ffff88082da858d8] cl_lock_cancel0 at ffffffffa0604665 [obdclass]
      #13 [ffff88082da85928] cl_lock_cancel at ffffffffa06051ab [obdclass]
      #14 [ffff88082da85968] osc_ldlm_blocking_ast at ffffffffa0859cf8 [osc]
      #15 [ffff88082da859f8] ldlm_cancel_callback at ffffffffa06a1ba3 [ptlrpc]
      #16 [ffff88082da85a18] ldlm_lock_cancel at ffffffffa06a1c89 [ptlrpc]
      #17 [ffff88082da85a58] ldlm_cli_cancel_list_local at ffffffffa06bede8 [ptlrpc]
      #18 [ffff88082da85ae8] ldlm_cancel_lru_local at ffffffffa06bf255 [ptlrpc]
      #19 [ffff88082da85b08] ldlm_replay_locks at ffffffffa06bf385 [ptlrpc]
      #20 [ffff88082da85bb8] ptlrpc_import_recovery_state_machine at ffffffffa070ceea [ptlrpc]
      #21 [ffff88082da85c38] ptlrpc_connect_interpret at ffffffffa070db38 [ptlrpc]
      #22 [ffff88082da85d08] ptlrpc_check_set at ffffffffa06dd870 [ptlrpc]
      #23 [ffff88082da85de8] ptlrpcd_check at ffffffffa07113b8 [ptlrpc]
      #24 [ffff88082da85e48] ptlrpcd at ffffffffa071175b [ptlrpc]
      #25 [ffff88082da85f48] kernel_thread at ffffffff8100c14a
      

      I will need to do more investigation, but thats a start.

      Attachments

        Issue Links

          Activity

            [LU-948] Client recovery hang

            Integrated in lustre-master » x86_64,client,el6,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/include/cl_object.h
            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_internal.h
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_page.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el6,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/include/cl_object.h lustre/osc/osc_lock.c lustre/obdclass/cl_internal.h lustre/obdclass/cl_lock.c lustre/obdclass/cl_page.c

            Integrated in lustre-master » x86_64,client,el5,ofa #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el5,ofa #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/osc/osc_lock.c lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h lustre/include/cl_object.h lustre/obdclass/cl_page.c

            Integrated in lustre-master » x86_64,server,el5,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/include/cl_object.h
            • lustre/obdclass/cl_internal.h
            • lustre/obdclass/cl_page.c
            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_lock.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/include/cl_object.h lustre/obdclass/cl_internal.h lustre/obdclass/cl_page.c lustre/osc/osc_lock.c lustre/obdclass/cl_lock.c

            Integrated in lustre-master » i686,server,el6,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/osc/osc_lock.c
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_internal.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el6,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/osc/osc_lock.c lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/obdclass/cl_lock.c lustre/obdclass/cl_internal.h

            Integrated in lustre-master » x86_64,client,el5,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/obdclass/cl_lock.c
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_internal.h
            • lustre/osc/osc_lock.c
            • lustre/include/cl_object.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el5,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/obdclass/cl_lock.c lustre/obdclass/cl_page.c lustre/obdclass/cl_internal.h lustre/osc/osc_lock.c lustre/include/cl_object.h

            Integrated in lustre-master » x86_64,server,el5,ofa #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/osc/osc_lock.c
            • lustre/obdclass/cl_internal.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_lock.c
            • lustre/include/cl_object.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,ofa #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/osc/osc_lock.c lustre/obdclass/cl_internal.h lustre/obdclass/cl_page.c lustre/obdclass/cl_lock.c lustre/include/cl_object.h

            Integrated in lustre-master » x86_64,client,sles11,inkernel #440
            LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

            Result = SUCCESS
            Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
            Files :

            • lustre/obdclass/cl_lock.c
            • lustre/osc/osc_lock.c
            • lustre/include/cl_object.h
            • lustre/obdclass/cl_page.c
            • lustre/obdclass/cl_internal.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,sles11,inkernel #440 LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861) Result = SUCCESS Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861 Files : lustre/obdclass/cl_lock.c lustre/osc/osc_lock.c lustre/include/cl_object.h lustre/obdclass/cl_page.c lustre/obdclass/cl_internal.h
            jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/1955

            We only implemented some of the 1.8 version. The CLIO version for 2.X doesn't look at all familiar to me.

            In production, clients will need to replay tens of thousands of locks, which completely overwhelms the servers. Since most of those locks are completely usused, it is better to drop the unused locks rather than replay them. If they are needed again in the future, the load to recreate them on demand is easier to deal with than the flood of lock replays at recovery time.

            At the time we could even see the problem with just one or a few clients. If you did something like a linux kernel compilation out of lustre, you will wind up with tens of thousands of locks on just that one node.

            I don't think we really want to abandon this ability.

            morrone Christopher Morrone (Inactive) added a comment - We only implemented some of the 1.8 version. The CLIO version for 2.X doesn't look at all familiar to me. In production, clients will need to replay tens of thousands of locks, which completely overwhelms the servers. Since most of those locks are completely usused, it is better to drop the unused locks rather than replay them. If they are needed again in the future, the load to recreate them on demand is easier to deal with than the flood of lock replays at recovery time. At the time we could even see the problem with just one or a few clients. If you did something like a linux kernel compilation out of lustre, you will wind up with tens of thousands of locks on just that one node. I don't think we really want to abandon this ability.

            I think this issue can be fixed by reverting commit 6fd5e00ff03d41b427eec5d70efaef4bbdd8d59c which was added in bug 16774 to address client replaying lots of unused lock during recovery.

            Since this issue was filed and even implemented by you guys, can you please tell me what's the side effect for clients to replay unused lock during recovery?

            jay Jinshan Xiong (Inactive) added a comment - I think this issue can be fixed by reverting commit 6fd5e00ff03d41b427eec5d70efaef4bbdd8d59c which was added in bug 16774 to address client replaying lots of unused lock during recovery. Since this issue was filed and even implemented by you guys, can you please tell me what's the side effect for clients to replay unused lock during recovery?

            With LU-874 much improved, this is our next concern for 2.1. Have you had any time to look at the fix for this?

            morrone Christopher Morrone (Inactive) added a comment - With LU-874 much improved, this is our next concern for 2.1. Have you had any time to look at the fix for this?

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: