[LU-6273] Hard Failover replay-dual test_17: Failover OST mount hang - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Labels:
- zfs
Environment:
client and server: lustre-master build # 2856
zfs

Severity:
3
Rank (Obsolete):
17590

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0429703c-ba58-11e4-8053-5254006e85c2.

The sub-test test_17 failed with the following error:

test failed to respond and timed out

ost dmesg

Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
 ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
 ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
 ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
Call Trace:
 [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
 [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
 [<ffffffff8152a213>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
 [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
 [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad74b>] do_mount+0x2fb/0x930
 [<ffffffff811ade10>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565647/real 1424565647]  req@ffff880070449080 x1493765169611180/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565672 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565712/real 1424565712]  req@ffff880070449680 x1493765169611316/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565737 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.lustre  D 0000000000000000     0  3630   3629 0x00000080
 ffff88006edf9718 0000000000000082 0000000000000000 ffff88006ef82040
 ffff88006edf9698 ffffffff81055783 ffff88007e4c2ad8 ffff880002216880
 ffff88006ef825f8 ffff88006edf9fd8 000000000000fbc8 ffff88006ef825f8
Call Trace:
 [<ffffffff81055783>] ? set_next_buddy+0x43/0x50
 [<ffffffff8152a595>] schedule_timeout+0x215/0x2e0
 [<ffffffff81069f15>] ? enqueue_entity+0x125/0x450
 [<ffffffff8152a213>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa090cd00>] ? client_lwp_config_process+0x0/0x1948 [obdclass]
 [<ffffffff8152a32d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0898e14>] llog_process_or_fork+0x354/0x540 [obdclass]
 [<ffffffffa0899014>] llog_process+0x14/0x30 [obdclass]
 [<ffffffffa08c81d4>] class_config_parse_llog+0x1e4/0x330 [obdclass]
 [<ffffffffa10314f2>] mgc_process_log+0xeb2/0x1970 [mgc]
 [<ffffffffa102b1f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0ad0860>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa1032ef8>] mgc_process_config+0x658/0x1210 [mgc]
 [<ffffffffa08d9383>] lustre_process_log+0x7e3/0x1130 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d514f>] ? server_name2fsname+0x6f/0x90 [obdclass]
 [<ffffffffa0907496>] server_start_targets+0x12b6/0x1af0 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dbfe6>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
 [<ffffffffa07891c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa08d3390>] ? class_config_llog_handler+0x0/0x1a70 [obdclass]
 [<ffffffffa090c255>] server_fill_super+0xbe5/0x1690 [obdclass]
 [<ffffffffa0783818>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa08dde90>] lustre_fill_super+0x560/0xa80 [obdclass]
 [<ffffffffa08dd930>] ? lustre_fill_super+0x0/0xa80 [obdclass]
 [<ffffffff8118c56f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa08d4ee5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118bbcb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bd72>] do_kern_mount+0x52/0x130
 [<ffffffff8119e972>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad74b>] do_mount+0x2fb/0x930
 [<ffffffff811ade10>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.2.4.156@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 304 previous similar messages
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1424565842/real 1424565842]  req@ffff880070449c80 x1493765169611592/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.2.4.158@tcp:12/10 lens 400/544 e 0 to 1 dl 1424565867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3053:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
INFO: task mount.lustre:3630 blocked for more than 120 seconds.

Attachments

Issue Links

is related to

LU-4214 Hyperion - OST never recovers on failover node

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

Activity

[LU-6273] Hard Failover replay-dual test_17: Failover OST mount hang

Peter Jones added a comment - 22/Sep/15 4:20 AM

Landed for 2.8

Peter Jones added a comment - 22/Sep/15 4:20 AM Landed for 2.8

Gerrit Updater added a comment - 22/Sep/15 2:55 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/
Subject: ~~LU-6273~~ lwp: notify LWP users in dedicated thread
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

Gerrit Updater added a comment - 22/Sep/15 2:55 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16303/ Subject: LU-6273 lwp: notify LWP users in dedicated thread Project: fs/lustre-release Branch: master Current Patch Set: Commit: b1848aa5b23fd332362e9ae3d5aab31d8dd9d920

Gerrit Updater added a comment - 08/Sep/15 6:19 AM

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304
Subject: ~~LU-6273~~ lwp: notify LWP is ready after llog processing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1be148d882a209137e92e525bc69b601e114646c

Gerrit Updater added a comment - 08/Sep/15 6:19 AM Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16304 Subject: LU-6273 lwp: notify LWP is ready after llog processing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1be148d882a209137e92e525bc69b601e114646c

Di Wang (Inactive) added a comment - 08/Sep/15 6:12 AM

I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

No, lustre_start_lwp does not promise LWP will be setup accordingly, because mdt0 might not be added to the client config log yet, for example, OST might be started before MDT0.

Di Wang (Inactive) added a comment - 08/Sep/15 6:12 AM I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup. No, lustre_start_lwp does not promise LWP will be setup accordingly, because mdt0 might not be added to the client config log yet, for example, OST might be started before MDT0.

Gerrit Updater added a comment - 08/Sep/15 5:54 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16303
Subject: ~~LU-6273~~ fld: update local FLDB in a separate thread
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 92feb9161586f00cc03c481c81598f3b77847fb3

Gerrit Updater added a comment - 08/Sep/15 5:54 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16303 Subject: LU-6273 fld: update local FLDB in a separate thread Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 92feb9161586f00cc03c481c81598f3b77847fb3

Mikhail Pershin added a comment - 08/Sep/15 5:49 AM

I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

Mikhail Pershin added a comment - 08/Sep/15 5:49 AM I think we can just move the lustre_notify_lwp_list() to the lustre_start_lwp(), right after lsi_lwp_started is set to 1. At this point the llog is processed and lwp should be fully setup.

Di Wang (Inactive) added a comment - 07/Sep/15 7:16 PM - edited

I discussed with niu a bit, and this patch is probably not enough according to niu's comment above. We probably should move this fled_update_FLDB to a separate thread, otherwise the LWP config log analyze will not continue, and LWP will never be able to connect. Yes, there is FLDB locally on each target, this FLDB will only happens during first initialization (checking lsf_new flags) or upgrade from 2.4 (DNE env) to the current DNE version (2.8). And I do not know why it only happen on ZFS. It maybe because fld_index_init is slow on ZFS, so OST is restart before fld_index_init finish.

Di Wang (Inactive) added a comment - 07/Sep/15 7:16 PM - edited I discussed with niu a bit, and this patch is probably not enough according to niu's comment above. We probably should move this fled_update_FLDB to a separate thread, otherwise the LWP config log analyze will not continue, and LWP will never be able to connect. Yes, there is FLDB locally on each target, this FLDB will only happens during first initialization (checking lsf_new flags) or upgrade from 2.4 (DNE env) to the current DNE version (2.8). And I do not know why it only happen on ZFS. It maybe because fld_index_init is slow on ZFS, so OST is restart before fld_index_init finish.

Mikhail Pershin added a comment - 07/Sep/15 6:20 PM

yes, that looks like the reason, but I still wonder why ldiskfs setups are not affected. Di, IIRC, you did local FLDB in past, shouldn't OST have a local copy of FLDB?

Mikhail Pershin added a comment - 07/Sep/15 6:20 PM yes, that looks like the reason, but I still wonder why ldiskfs setups are not affected. Di, IIRC, you did local FLDB in past, shouldn't OST have a local copy of FLDB?

Di Wang (Inactive) added a comment - 07/Sep/15 6:39 AM

Hmm, I actually met this problem in one of my patch test
http://review.whamcloud.com/#/c/15275/ (patch set 7)

See this failure https://testing.hpdd.intel.com/test_sets/e2a60dca-3ff5-11e5-afd4-5254006e85c2

we probably should set no_delay and allow_replay flag for update FLD RPC. So I think this patch should fix the problem

diff --git a/lustre/fld/fld_request.c b/lustre/fld/fld_request.c
index a59ab95..c82b098 100644
--- a/lustre/fld/fld_request.c
+++ b/lustre/fld/fld_request.c
@@ -399,6 +399,14 @@ again:
 
                req_capsule_set_size(&req->rq_pill, &RMF_GENERIC_DATA,
                                     RCL_SERVER, PAGE_CACHE_SIZE);
+
+               /* This might happen before the import state becomes to FULL,
+                * let's set allow_replay for this request to avoid deadlock
+                * see LU-6273 */
+               req->rq_allow_replay = 1;
+               /* This will always use LWP connection, let's send the req
+                * with no_delay flags, see above */
+               req->rq_no_delay = 1;
                break;
        default:
                rc = -EINVAL;

I will push the fix to http://review.whamcloud.com/#/c/15275/

Di Wang (Inactive) added a comment - 07/Sep/15 6:39 AM Hmm, I actually met this problem in one of my patch test http://review.whamcloud.com/#/c/15275/ (patch set 7) See this failure https://testing.hpdd.intel.com/test_sets/e2a60dca-3ff5-11e5-afd4-5254006e85c2 we probably should set no_delay and allow_replay flag for update FLD RPC. So I think this patch should fix the problem diff --git a/lustre/fld/fld_request.c b/lustre/fld/fld_request.c index a59ab95..c82b098 100644 --- a/lustre/fld/fld_request.c +++ b/lustre/fld/fld_request.c @@ -399,6 +399,14 @@ again: req_capsule_set_size(&req->rq_pill, &RMF_GENERIC_DATA, RCL_SERVER, PAGE_CACHE_SIZE); + + /* This might happen before the import state becomes to FULL, + * let's set allow_replay for this request to avoid deadlock + * see LU-6273 */ + req->rq_allow_replay = 1; + /* This will always use LWP connection, let's send the req + * with no_delay flags, see above */ + req->rq_no_delay = 1; break; default: rc = -EINVAL; I will push the fix to http://review.whamcloud.com/#/c/15275/

Niu Yawei (Inactive) added a comment - 07/Sep/15 3:04 AM

I see why the LWP on OST can't connect to the failover MDT now:

On OST mount, we process the client log to setup LWP as following:
1> Process LCFG_ADD_UUID record to setup LWP device, then connect to the MDT. (see lustre_lwp_setup());
2> Process LCFG_ADD_CONN record to add failover connection;

We can see if the mount process is blocked on step 1, then it will never have a chance to add failover connection, and LWP won't be able to switch to failover node forever.

Unfortunately, the process could be blocked on step 1, see lustre_lwp_setup() -> lustre_lwp_connect():

         rc = obd_connect(&env, &exp, lwp, uuid, data, NULL);
        if (rc != 0) {
                CERROR("%s: connect failed: rc = %d\n", lwp->obd_name, rc);
        } else {
                if (unlikely(lwp->obd_lwp_export != NULL))
                        class_export_put(lwp->obd_lwp_export);
                lwp->obd_lwp_export = class_export_get(exp);
                lustre_notify_lwp_list(exp);
        }

lustre_notify_lwp_list() is called after obd_connect() is called successfully (which means CONNECT can be sent successfully, but not connection established), lustre_notify_lwp_list calls ofd_register_lwp_callback() to send FLD RPC, please note that it's now try to send RPC to the primary MDT0, (connection of failover MDT0 hasn't been added yet).

Let look at fld_client_rpc():

        if (rc != 0) {
                if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                        /* Since LWP is not replayable, so it will keep
                         * trying unless umount happens, otherwise it would
                         * cause unecessary failure of the application. */
                        ptlrpc_req_finished(req);
                        rc = 0;
                        goto again;
                }
                GOTO(out_req, rc);
        }

It's trying to send the RPC to primary MDT0 again and again in a loop... At the end, the mount process is blocked here forever (before the failover connection being added).

I'd think further on how to resolve this. Di, any thoughts? Thanks.

Niu Yawei (Inactive) added a comment - 07/Sep/15 3:04 AM I see why the LWP on OST can't connect to the failover MDT now: On OST mount, we process the client log to setup LWP as following: 1> Process LCFG_ADD_UUID record to setup LWP device, then connect to the MDT. (see lustre_lwp_setup()); 2> Process LCFG_ADD_CONN record to add failover connection; We can see if the mount process is blocked on step 1, then it will never have a chance to add failover connection, and LWP won't be able to switch to failover node forever. Unfortunately, the process could be blocked on step 1, see lustre_lwp_setup() -> lustre_lwp_connect(): rc = obd_connect(&env, &exp, lwp, uuid, data, NULL); if (rc != 0) { CERROR( "%s: connect failed: rc = %d\n" , lwp->obd_name, rc); } else { if (unlikely(lwp->obd_lwp_export != NULL)) class_export_put(lwp->obd_lwp_export); lwp->obd_lwp_export = class_export_get(exp); lustre_notify_lwp_list(exp); } lustre_notify_lwp_list() is called after obd_connect() is called successfully (which means CONNECT can be sent successfully, but not connection established), lustre_notify_lwp_list calls ofd_register_lwp_callback() to send FLD RPC, please note that it's now try to send RPC to the primary MDT0, (connection of failover MDT0 hasn't been added yet). Let look at fld_client_rpc(): if (rc != 0) { if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) { /* Since LWP is not replayable, so it will keep * trying unless umount happens, otherwise it would * cause unecessary failure of the application. */ ptlrpc_req_finished(req); rc = 0; goto again; } GOTO(out_req, rc); } It's trying to send the RPC to primary MDT0 again and again in a loop... At the end, the mount process is blocked here forever (before the failover connection being added). I'd think further on how to resolve this. Di, any thoughts? Thanks.

People

Assignee:: Mikhail Pershin

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 24/Feb/15 6:48 AM

Updated:: 23/Sep/15 6:32 AM

Resolved:: 22/Sep/15 4:20 AM