[LU-7903] recovery-small test_23: hang on umount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f72c79d4-efbb-11e5-8ddc-5254006e85c2.

The sub-test test_23 failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: recovery-small 23

Attachments

Issue Links

is related to

LU-7022 recovery-small test_100: hung on umount

Open

LU-7172 replay-single test_70d hung on MDT unmount

Resolved

LU-7858 test_23: 'test failed to respond and timed out'

Resolved

Activity

[LU-7903] recovery-small test_23: hang on umount

Peter Jones added a comment - 14/Oct/16 12:56 AM

Landed for 2.9

Peter Jones added a comment - 14/Oct/16 12:56 AM Landed for 2.9

Gerrit Updater added a comment - 13/Oct/16 11:36 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22696/
Subject: ~~LU-7903~~ ptlrpc: leaked rs on difficult reply
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bd1441b92e31cd3c01cc2453aa627d139a1207f7

Gerrit Updater added a comment - 13/Oct/16 11:36 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22696/ Subject: LU-7903 ptlrpc: leaked rs on difficult reply Project: fs/lustre-release Branch: master Current Patch Set: Commit: bd1441b92e31cd3c01cc2453aa627d139a1207f7

Niu Yawei (Inactive) added a comment - 23/Sep/16 5:27 AM

The patch http://review.whamcloud.com/#/c/12530/ is merged in master on Jan 29, 2016, then we search the maloo failures:

https://testing.hpdd.intel.com/test_sets/query?commit=Update+results&hosts=&page=9&query_bugs=&status%5B%5D=TIMEOUT&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bend_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Bstart_date%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=f36cabd0-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93&warn%5Bnotice%5D=true

This bug showed up immediately after that time point (was reported as ~~LU-7172~~ before, so I think ~~LU-7172~~ is dup of this).

Niu Yawei (Inactive) added a comment - 23/Sep/16 5:27 AM The patch http://review.whamcloud.com/#/c/12530/ is merged in master on Jan 29, 2016, then we search the maloo failures: https://testing.hpdd.intel.com/test_sets/query?commit=Update+results&hosts=&page=9&query_bugs=&status%5B%5D=TIMEOUT&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bend_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Bstart_date%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=f36cabd0-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93&warn%5Bnotice%5D=true This bug showed up immediately after that time point (was reported as LU-7172 before, so I think LU-7172 is dup of this).

Lai Siyao added a comment - 23/Sep/16 5:24 AM

Indeed, this occurs when REPLY-ACK is disabled.

In http://review.whamcloud.com/#/c/12530/, to save local locks, REPLY-ACK is disabled, however for local operations, Commit-on-Sharing doesn't take effect, therefore we may not be able to replay in transno order if reply is lost. This is another issue, I'll fix it in another ticket.

Lai Siyao added a comment - 23/Sep/16 5:24 AM Indeed, this occurs when REPLY-ACK is disabled. In http://review.whamcloud.com/#/c/12530/ , to save local locks, REPLY-ACK is disabled, however for local operations, Commit-on-Sharing doesn't take effect, therefore we may not be able to replay in transno order if reply is lost. This is another issue, I'll fix it in another ticket.

Gerrit Updater added a comment - 23/Sep/16 4:10 AM

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/22696
Subject: ~~LU-7903~~ ptlrpc: leaked rs on difficult reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 802f41d1a78f588ab99ef8e1d081ac3da7cb8551

Gerrit Updater added a comment - 23/Sep/16 4:10 AM Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/22696 Subject: LU-7903 ptlrpc: leaked rs on difficult reply Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 802f41d1a78f588ab99ef8e1d081ac3da7cb8551

Niu Yawei (Inactive) added a comment - 23/Sep/16 4:03 AM

Looks there is a race can cause export reference leak on difficult reply (and when rs_no_ack enabled):

MDT umount calls server_disconnect_export() to disconnects it's export, server_disconnect_exports() tries to complete all it's outstanding (difficult) replies, that'll call into ptlrpc_handle_rs() at the end, where the reply sate is removed from uncommited list, and LNetMDUnlink() is called to trigger an unlink event, reply_out_callback() catches the event and try to finialize the reply sate (where the export hold by reply state will be released) by following code:

                spin_lock(&svcpt->scp_rep_lock);
                spin_lock(&rs->rs_lock);

                rs->rs_on_net = 0;
                if (!rs->rs_no_ack ||
                    rs->rs_transno <=
                    rs->rs_export->exp_obd->obd_last_committed)
                        ptlrpc_schedule_difficult_reply(rs);

                spin_unlock(&rs->rs_lock);
                spin_unlock(&svcpt->scp_rep_lock);

We can see if rs_no_ack is false (COS not enabled), or the transaction has been committed, the reply state is finalized here, otherwise, it relies on commit callback, however, for the export already been disconnected by umount or whatever reason, the reply state is already being removed from the uncommitted list, so nobody will execute the reply state (until unload ptlrpc module). I think adding one more check here to see if the reply state is still on uncommitted list is necessary.

COS is usually not enabled, so this defect won't be triggered easily, however by the change of ~~LU-3538~~, rs_no_ack is set unpurposely no matter if COS is enabled:

@@ -2530,24 +2589,27 @@ static void mdt_save_lock(struct mdt_thread_info *info, struct lustre_handle *h,
                        struct mdt_device *mdt = info->mti_mdt;
                        struct ldlm_lock *lock = ldlm_handle2lock(h);
                        struct ptlrpc_request *req = mdt_info_req(info);
-                       int no_ack = 0;
+                       int cos;
+
+                       cos = (mdt_cos_is_enabled(mdt) ||
+                              mdt_slc_is_enabled(mdt));

                        LASSERTF(lock != NULL, "no lock for cookie "LPX64"\n",
                                 h->cookie);
+
                        /* there is no request if mdt_object_unlock() is called
                         * from mdt_export_cleanup()->mdt_add_dirty_flag() */
                        if (likely(req != NULL)) {
                                CDEBUG(D_HA, "request = %p reply state = %p"
                                       " transno = "LPD64"\n", req,
                                       req->rq_reply_state, req->rq_transno);
-                               if (mdt_cos_is_enabled(mdt)) {
-                                       no_ack = 1;
+                               if (cos) {
                                        ldlm_lock_downgrade(lock, LCK_COS);
                                        mode = LCK_COS;
                                }
-                               ptlrpc_save_lock(req, h, mode, no_ack);
+                               ptlrpc_save_lock(req, h, mode, cos);
                        } else {
-                               ldlm_lock_decref(h, mode);
+                               mdt_fid_unlock(h, mode);
                        }

That could be the reason why it can be triggered more easily today.

Niu Yawei (Inactive) added a comment - 23/Sep/16 4:03 AM Looks there is a race can cause export reference leak on difficult reply (and when rs_no_ack enabled): MDT umount calls server_disconnect_export() to disconnects it's export, server_disconnect_exports() tries to complete all it's outstanding (difficult) replies, that'll call into ptlrpc_handle_rs() at the end, where the reply sate is removed from uncommited list, and LNetMDUnlink() is called to trigger an unlink event, reply_out_callback() catches the event and try to finialize the reply sate (where the export hold by reply state will be released) by following code: spin_lock(&svcpt->scp_rep_lock); spin_lock(&rs->rs_lock); rs->rs_on_net = 0; if (!rs->rs_no_ack || rs->rs_transno <= rs->rs_export->exp_obd->obd_last_committed) ptlrpc_schedule_difficult_reply(rs); spin_unlock(&rs->rs_lock); spin_unlock(&svcpt->scp_rep_lock); We can see if rs_no_ack is false (COS not enabled), or the transaction has been committed, the reply state is finalized here, otherwise, it relies on commit callback, however, for the export already been disconnected by umount or whatever reason, the reply state is already being removed from the uncommitted list, so nobody will execute the reply state (until unload ptlrpc module). I think adding one more check here to see if the reply state is still on uncommitted list is necessary. COS is usually not enabled, so this defect won't be triggered easily, however by the change of LU-3538 , rs_no_ack is set unpurposely no matter if COS is enabled: @@ -2530,24 +2589,27 @@ static void mdt_save_lock(struct mdt_thread_info *info, struct lustre_handle *h, struct mdt_device *mdt = info->mti_mdt; struct ldlm_lock *lock = ldlm_handle2lock(h); struct ptlrpc_request *req = mdt_info_req(info); - int no_ack = 0; + int cos; + + cos = (mdt_cos_is_enabled(mdt) || + mdt_slc_is_enabled(mdt)); LASSERTF(lock != NULL, "no lock for cookie " LPX64 "\n" , h->cookie); + /* there is no request if mdt_object_unlock() is called * from mdt_export_cleanup()->mdt_add_dirty_flag() */ if (likely(req != NULL)) { CDEBUG(D_HA, "request = %p reply state = %p" " transno = " LPD64 "\n" , req, req->rq_reply_state, req->rq_transno); - if (mdt_cos_is_enabled(mdt)) { - no_ack = 1; + if (cos) { ldlm_lock_downgrade(lock, LCK_COS); mode = LCK_COS; } - ptlrpc_save_lock(req, h, mode, no_ack); + ptlrpc_save_lock(req, h, mode, cos); } else { - ldlm_lock_decref(h, mode); + mdt_fid_unlock(h, mode); } That could be the reason why it can be triggered more easily today.

Andreas Dilger added a comment - 20/Sep/16 2:10 PM

Is this bug a duplicate of ~~LU-7858~~?

Andreas Dilger added a comment - 20/Sep/16 2:10 PM Is this bug a duplicate of LU-7858 ?

Niu Yawei (Inactive) added a comment - 19/Sep/16 2:58 AM

Thank you for the information, James, I'm looking at this issue. The console log shows that there is no pending RPC, locks, or commit callback is holding the export, so I'm wondering if it because someone missed puting reference.

Niu Yawei (Inactive) added a comment - 19/Sep/16 2:58 AM Thank you for the information, James, I'm looking at this issue. The console log shows that there is no pending RPC, locks, or commit callback is holding the export, so I'm wondering if it because someone missed puting reference.

James Nunez (Inactive) added a comment - 13/Sep/16 6:02 PM - edited

Niu,
Here is a failure that should include your debug patch: https://testing.hpdd.intel.com/test_sets/38e4f102-79bf-11e6-b08e-5254006e85c2

From the MDS 1 log:

09:49:55:[26605.283675] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
09:49:55:[26605.288046] Lustre: Skipped 24 previous similar messages
09:49:55:[26607.056252] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880046125200 x1545343953854000/t0(0) o13->lustre-OST0005-osc-MDT0000@10.9.4.148@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
09:49:55:[26607.064057] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) Skipped 23 previous similar messages
09:49:55:[26618.703121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26618.711819] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26634.716121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26634.723386] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26666.728048] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26666.734510] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26680.276322] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
09:49:55:[26680.281813] Lustre: Skipped 231 previous similar messages
09:49:55:[26730.739132] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26730.745251] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26831.719420] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.146@tcp (stopping)
09:49:55:[26831.723246] Lustre: Skipped 481 previous similar messages
09:49:55:[26858.750122] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26858.761521] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[27000.758142] INFO: task umount:13385 blocked for more than 120 seconds.
09:49:55:[27000.762771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:49:55:[27000.765657] umount          D 000000000000f918     0 13385  13384 0x00000080

Another at https://testing.hpdd.intel.com/test_sets/1dbc2ffc-7b23-11e6-8afd-5254006e85c2

James Nunez (Inactive) added a comment - 13/Sep/16 6:02 PM - edited Niu, Here is a failure that should include your debug patch: https://testing.hpdd.intel.com/test_sets/38e4f102-79bf-11e6-b08e-5254006e85c2 From the MDS 1 log: 09:49:55:[26605.283675] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 09:49:55:[26605.288046] Lustre: Skipped 24 previous similar messages 09:49:55:[26607.056252] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880046125200 x1545343953854000/t0(0) o13->lustre-OST0005-osc-MDT0000@10.9.4.148@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 09:49:55:[26607.064057] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) Skipped 23 previous similar messages 09:49:55:[26618.703121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26618.711819] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26634.716121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26634.723386] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26666.728048] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26666.734510] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26680.276322] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping) 09:49:55:[26680.281813] Lustre: Skipped 231 previous similar messages 09:49:55:[26730.739132] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26730.745251] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26831.719420] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.146@tcp (stopping) 09:49:55:[26831.723246] Lustre: Skipped 481 previous similar messages 09:49:55:[26858.750122] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26858.761521] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[27000.758142] INFO: task umount:13385 blocked for more than 120 seconds. 09:49:55:[27000.762771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:49:55:[27000.765657] umount D 000000000000f918 0 13385 13384 0x00000080 Another at https://testing.hpdd.intel.com/test_sets/1dbc2ffc-7b23-11e6-8afd-5254006e85c2

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 22/Mar/16 3:52 PM

Updated:: 12/Dec/16 8:21 PM

Resolved:: 14/Oct/16 12:56 AM