Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6273

Hard Failover replay-dual test_17: Failover OST mount hang

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
    • client and server: lustre-master build # 2856
      zfs
    • 3
    • 17590

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0429703c-ba58-11e4-8053-5254006e85c2.

      The sub-test test_17 failed with the following error:

      test failed to respond and timed out
      

      ost dmesg

      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 120 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
            Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
       ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
       ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
       ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
      Call Trace:
       [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
       [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
       [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
       [<ffffffff8152a213>] wait_for_common+0x123/0x180
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
       [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
       [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
       [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
       [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
       [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
       [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
       [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
       [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
       [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
       [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
       [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
       [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
       [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
       [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811ad74b>] do_mount+0x2fb/0x930
       [<ffffffff811ade10>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565647/real 1424565647]  req@ffff880070449080 x1493765169611180/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565672 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565712/real 1424565712]  req@ffff880070449680 x1493765169611316/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565737 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
            Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
       ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
       ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
       ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
      Call Trace:
       [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
       [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
       [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
       [<ffffffff8152a213>] wait_for_common+0x123/0x180
       [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
       [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
       [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
       [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
       [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
       [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
       [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
       [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
       [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
       [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
       [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
       [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
       [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
       [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
       [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
       [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
       [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
       [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
       [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811ad74b>] do_mount+0x2fb/0x930
       [<ffffffff811ade10>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.156@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 304 previous similar messages
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565842/real 1424565842]  req@ffff880070449c80 x1493765169611592/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
      INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      

      Attachments

        Issue Links

          Activity

            [LU-6273] Hard Failover replay-dual test_17: Failover OST mount hang
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/
            Subject: LU-6273 lwp: notify LWP users in dedicated thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/ Subject: LU-6273 lwp: notify LWP users in dedicated thread Project: fs/lustre-release Branch: master Current Patch Set: Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304
            Subject: LU-6273 lwp: notify LWP is ready after llog processing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1be148d882a209137e92e525bc69b601e114646c

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304 Subject: LU-6273 lwp: notify LWP is ready after llog processing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1be148d882a209137e92e525bc69b601e114646c
            I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.
            

            No, lustre_start_lwp does not promise LWP will be setup accordingly, because mdt0 might not be added to the client config log yet, for example, OST might be started before MDT0.

            di.wang Di Wang (Inactive) added a comment - I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup. No, lustre_start_lwp does not promise LWP will be setup accordingly, because mdt0 might not be added to the client config log yet, for example, OST might be started before MDT0.

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16303
            Subject: LU-6273 fld: update local FLDB in a separate thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 92feb9161586f00cc03c481c81598f3b77847fb3

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16303 Subject: LU-6273 fld: update local FLDB in a separate thread Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 92feb9161586f00cc03c481c81598f3b77847fb3

            I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

            tappro Mikhail Pershin added a comment - I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.
            di.wang Di Wang (Inactive) added a comment - - edited

            I discussed with niu a bit, and this patch is probably not enough according to niu's comment above. We probably should move this fled_update_FLDB to a separate thread, otherwise the LWP config log analyze will not continue, and LWP will never be able to connect. Yes, there is FLDB locally on each target, this FLDB will only happens during first initialization (checking lsf_new flags) or upgrade from 2.4 (DNE env) to the current DNE version (2.8). And I do not know why it only happen on ZFS. It maybe because fld_index_init is slow on ZFS, so OST is restart before fld_index_init finish.

            di.wang Di Wang (Inactive) added a comment - - edited I discussed with niu a bit, and this patch is probably not enough according to niu's comment above. We probably should move this fled_update_FLDB to a separate thread, otherwise the LWP config log analyze will not continue, and LWP will never be able to connect. Yes, there is FLDB locally on each target, this FLDB will only happens during first initialization (checking lsf_new flags) or upgrade from 2.4 (DNE env) to the current DNE version (2.8). And I do not know why it only happen on ZFS. It maybe because fld_index_init is slow on ZFS, so OST is restart before fld_index_init finish.

            yes, that looks like the reason, but I still wonder why ldiskfs setups are not affected. Di, IIRC, you did local FLDB in past, shouldn't OST have a local copy of FLDB?

            tappro Mikhail Pershin added a comment - yes, that looks like the reason, but I still wonder why ldiskfs setups are not affected. Di, IIRC, you did local FLDB in past, shouldn't OST have a local copy of FLDB?

            Hmm, I actually met this problem in one of my patch test
            http://review.whamcloud.com/#/c/15275/ (patch set 7)

            See this failure https://testing.hpdd.intel.com/test_sets/e2a60dca-3ff5-11e5-afd4-5254006e85c2

            we probably should set no_delay and allow_replay flag for update FLD RPC. So I think this patch should fix the problem

            diff --git a/lustre/fld/fld_request.c b/lustre/fld/fld_request.c
            index a59ab95..c82b098 100644
            --- a/lustre/fld/fld_request.c
            +++ b/lustre/fld/fld_request.c
            @@ -399,6 +399,14 @@ again:
             
                            req_capsule_set_size(&req->rq_pill, &RMF_GENERIC_DATA,
                                                 RCL_SERVER, PAGE_CACHE_SIZE);
            +
            +               /* This might happen before the import state becomes to FULL,
            +                * let's set allow_replay for this request to avoid deadlock
            +                * see LU-6273 */
            +               req->rq_allow_replay = 1;
            +               /* This will always use LWP connection, let's send the req
            +                * with no_delay flags, see above */
            +               req->rq_no_delay = 1;
                            break;
                    default:
                            rc = -EINVAL;
            

            I will push the fix to http://review.whamcloud.com/#/c/15275/

            di.wang Di Wang (Inactive) added a comment - Hmm, I actually met this problem in one of my patch test http://review.whamcloud.com/#/c/15275/ (patch set 7) See this failure https://testing.hpdd.intel.com/test_sets/e2a60dca-3ff5-11e5-afd4-5254006e85c2 we probably should set no_delay and allow_replay flag for update FLD RPC. So I think this patch should fix the problem diff --git a/lustre/fld/fld_request.c b/lustre/fld/fld_request.c index a59ab95..c82b098 100644 --- a/lustre/fld/fld_request.c +++ b/lustre/fld/fld_request.c @@ -399,6 +399,14 @@ again: req_capsule_set_size(&req->rq_pill, &RMF_GENERIC_DATA, RCL_SERVER, PAGE_CACHE_SIZE); + + /* This might happen before the import state becomes to FULL, + * let's set allow_replay for this request to avoid deadlock + * see LU-6273 */ + req->rq_allow_replay = 1; + /* This will always use LWP connection, let's send the req + * with no_delay flags, see above */ + req->rq_no_delay = 1; break; default: rc = -EINVAL; I will push the fix to http://review.whamcloud.com/#/c/15275/

            I see why the LWP on OST can't connect to the failover MDT now:

            On OST mount, we process the client log to setup LWP as following:
            1> Process LCFG_ADD_UUID record to setup LWP device, then connect to the MDT. (see lustre_lwp_setup());
            2> Process LCFG_ADD_CONN record to add failover connection;

            We can see if the mount process is blocked on step 1, then it will never have a chance to add failover connection, and LWP won't be able to switch to failover node forever.

            Unfortunately, the process could be blocked on step 1, see lustre_lwp_setup() -> lustre_lwp_connect():

                     rc = obd_connect(&env, &exp, lwp, uuid, data, NULL);
                    if (rc != 0) {
                            CERROR("%s: connect failed: rc = %d\n", lwp->obd_name, rc);
                    } else {
                            if (unlikely(lwp->obd_lwp_export != NULL))
                                    class_export_put(lwp->obd_lwp_export);
                            lwp->obd_lwp_export = class_export_get(exp);
                            lustre_notify_lwp_list(exp);
                    }
            

            lustre_notify_lwp_list() is called after obd_connect() is called successfully (which means CONNECT can be sent successfully, but not connection established), lustre_notify_lwp_list calls ofd_register_lwp_callback() to send FLD RPC, please note that it's now try to send RPC to the primary MDT0, (connection of failover MDT0 hasn't been added yet).

            Let look at fld_client_rpc():

                    if (rc != 0) {
                            if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                                    /* Since LWP is not replayable, so it will keep
                                     * trying unless umount happens, otherwise it would
                                     * cause unecessary failure of the application. */
                                    ptlrpc_req_finished(req);
                                    rc = 0;
                                    goto again;
                            }
                            GOTO(out_req, rc);
                    }
            

            It's trying to send the RPC to primary MDT0 again and again in a loop... At the end, the mount process is blocked here forever (before the failover connection being added).

            I'd think further on how to resolve this. Di, any thoughts? Thanks.

            niu Niu Yawei (Inactive) added a comment - I see why the LWP on OST can't connect to the failover MDT now: On OST mount, we process the client log to setup LWP as following: 1> Process LCFG_ADD_UUID record to setup LWP device, then connect to the MDT. (see lustre_lwp_setup()); 2> Process LCFG_ADD_CONN record to add failover connection; We can see if the mount process is blocked on step 1, then it will never have a chance to add failover connection, and LWP won't be able to switch to failover node forever. Unfortunately, the process could be blocked on step 1, see lustre_lwp_setup() -> lustre_lwp_connect(): rc = obd_connect(&env, &exp, lwp, uuid, data, NULL); if (rc != 0) { CERROR( "%s: connect failed: rc = %d\n" , lwp->obd_name, rc); } else { if (unlikely(lwp->obd_lwp_export != NULL)) class_export_put(lwp->obd_lwp_export); lwp->obd_lwp_export = class_export_get(exp); lustre_notify_lwp_list(exp); } lustre_notify_lwp_list() is called after obd_connect() is called successfully (which means CONNECT can be sent successfully, but not connection established), lustre_notify_lwp_list calls ofd_register_lwp_callback() to send FLD RPC, please note that it's now try to send RPC to the primary MDT0, (connection of failover MDT0 hasn't been added yet). Let look at fld_client_rpc(): if (rc != 0) { if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) { /* Since LWP is not replayable, so it will keep * trying unless umount happens, otherwise it would * cause unecessary failure of the application. */ ptlrpc_req_finished(req); rc = 0; goto again; } GOTO(out_req, rc); } It's trying to send the RPC to primary MDT0 again and again in a loop... At the end, the mount process is blocked here forever (before the failover connection being added). I'd think further on how to resolve this. Di, any thoughts? Thanks.

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: