[LU-7903] recovery-small test_23: hang on umount Created: 22/Mar/16  Updated: 12/Dec/16  Resolved: 14/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7022 recovery-small test_100: hung on umount Open
is related to LU-7172 replay-single test_70d hung on MDT un... Resolved
is related to LU-7858 test_23: 'test failed to respond and ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f72c79d4-efbb-11e5-8ddc-5254006e85c2.

The sub-test test_23 failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: recovery-small 23



 Comments   
Comment by Bob Glossman (Inactive) [ 22/Mar/16 ]

this may be a dup of LU-7172

Comment by Richard Henwood (Inactive) [ 04/Apr/16 ]

just seen on Master running review-dne-part-1:

https://testing.hpdd.intel.com/test_sets/47205c78-f91f-11e5-a22e-5254006e85c2

I've just seen this and it doesn't look like a dup of LU-7172 to me: the call trace from the MDT starts:

22:02:10:Call Trace:
22:02:10: [] schedule_timeout+0x192/0x2e0
22:02:10: [] ? process_timeout+0x0/0x10
22:02:10: [] obd_exports_barrier+0xb6/0x190 [obdclass]
22:02:10: [] mdt_device_fini+0x6ab/0x12e0 [mdt]
22:02:10: [] ? class_disconnect_exports+0x17d/0x2f0 [obdclass]
22:02:10: [] class_cleanup+0x572/0xd20 [obdclass]
22:02:10: [] ? class_name2dev+0x56/0xe0 [obdclass]
22:02:10: [] class_process_config+0x1b66/0x24c0 [obdclass]
22:02:10: [] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:02:10: [] class_manual_cleanup+0x4bf/0xc90 [obdclass]
22:02:10: [] ? class_name2dev+0x56/0xe0 [obdclass]
22:02:10: [] server_put_super+0x8bc/0xcd0 [obdclass]
22:02:10: [] generic_shutdown_super+0x5b/0xe0
22:02:10: [] kill_anon_super+0x16/0x60
22:02:10: [] lustre_kill_super+0x36/0x60 [obdclass]
22:02:10: [] deactivate_super+0x57/0x80
22:02:10: [] mntput_no_expire+0xbf/0x110
22:02:10: [] sys_umount+0x7b/0x3a0
22:02:10: [] system_call_fastpath+0x16/0x1b
22:37:59:********** Timeout by autotest system **********
Comment by Bruno Faccini (Inactive) [ 22/Apr/16 ]

+1 at https://testing.hpdd.intel.com/test_sets/4e388a9a-0869-11e6-b5f1-5254006e85c2 during a master patch review.

Comment by Emoly Liu [ 22/Apr/16 ]

Another failure at https://testing.hpdd.intel.com/test_sets/5626d550-0845-11e6-9e5d-5254006e85c2

Comment by Bob Glossman (Inactive) [ 22/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/d2808eda-086c-11e6-9e5d-5254006e85c2

Comment by Richard Henwood (Inactive) [ 26/Apr/16 ]

another failure on master, from review-dne-part-1

https://testing.hpdd.intel.com/test_sets/7d9d1360-0b0b-11e6-855a-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 31/May/16 ]

Another failure on master , from review-dne-part-1 :
https://testing.hpdd.intel.com/test_sets/48332280-248f-11e6-aac3-5254006e85c2
This issue has occurred around 46 times in past 30 days.

Comment by Jian Yu [ 06/Jun/16 ]

More failures on master branch:
https://testing.hpdd.intel.com/test_sets/54653454-2ba5-11e6-80b9-5254006e85c2
https://testing.hpdd.intel.com/test_sets/61f7efde-2b3d-11e6-a0ce-5254006e85c2

Comment by Lai Siyao [ 14/Jun/16 ]
Jun  5 15:27:19 trevis-10vm4 kernel: INFO: task umount:3347 blocked for more than 120 seconds.
Jun  5 15:27:19 trevis-10vm4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  5 15:27:19 trevis-10vm4 kernel: umount          D 000000000000d918     0  3347   3346 0x00000080
Jun  5 15:27:19 trevis-10vm4 kernel: ffff88004d367aa0 0000000000000082 ffff880029f20b80 ffff88004d367fd8
Jun  5 15:27:19 trevis-10vm4 kernel: ffff88004d367fd8 ffff88004d367fd8 ffff880029f20b80 ffffffff81daaa00
Jun  5 15:27:19 trevis-10vm4 kernel: ffff88004d367ad0 0000000101bb1f02 ffffffff81daaa00 000000000000d918
Jun  5 15:27:19 trevis-10vm4 kernel: Call Trace:
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff8163b349>] schedule+0x29/0x70
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff81638fa5>] schedule_timeout+0x175/0x2d0
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff8108bf30>] ? internal_add_timer+0x70/0x70
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07b53d4>] obd_exports_barrier+0xc4/0x1a0 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa0de06a0>] mdt_device_fini+0x310/0xfc0 [mdt]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07ce3cc>] class_cleanup+0x8dc/0xd70 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d0dac>] class_process_config+0x1e2c/0x2f70 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa05ee957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d1fdf>] class_manual_cleanup+0xef/0x810 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa08032de>] server_put_super+0x8de/0xcd0 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e0ca6>] generic_shutdown_super+0x56/0xe0
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1082>] kill_anon_super+0x12/0x20
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d5742>] lustre_kill_super+0x32/0x50 [obdclass]
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1439>] deactivate_locked_super+0x49/0x60
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1a36>] deactivate_super+0x46/0x60
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811fe9a5>] mntput_no_expire+0xc5/0x120
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff811ffadf>] SyS_umount+0x9f/0x3c0
Jun  5 15:27:19 trevis-10vm4 kernel: [<ffffffff816463c9>] system_call_fastpath+0x16/0x1

MDT umount hung, but I don't find anything useful in the logs, and dump_exports() was called long before log dump, so they are missing in the logs, I'm wondering whether this can be improved.

Besides, I can't reproduce it in local test environment.

Comment by Bruno Faccini (Inactive) [ 04/Jul/16 ]

+1 on master : https://testing.hpdd.intel.com/test_sets/0709014c-407f-11e6-acf3-5254006e85c2

Comment by Bob Glossman (Inactive) [ 18/Jul/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/5e5bed64-4ce7-11e6-a80f-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 26/Jul/16 ]

+1 on master at https://testing.hpdd.intel.com/test_sets/ce2d9280-52fc-11e6-bf87-5254006e85c2

Comment by Emoly Liu [ 26/Jul/16 ]

Another failure on master:
https://testing.hpdd.intel.com/test_sets/9e634b92-5263-11e6-bf87-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 29/Jul/16 ]

Niu will be cooking a debug patch.

Comment by Gerrit Updater [ 29/Jul/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/21599
Subject: LU-7903 mdt: dump exports information on console
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0fd2cb8ad3591fc1db3f077ba466b26d0a6171e8

Comment by Niu Yawei (Inactive) [ 29/Jul/16 ]

Unfortunately, the exports information were always being truncated. I cooked a debug patch which dump the exports information on console: http://review.whamcloud.com/21599

Comment by nasf (Inactive) [ 12/Aug/16 ]

Hit it on master:
https://testing.hpdd.intel.com/test_sets/202938f2-600c-11e6-906c-5254006e85c2

Comment by Emoly Liu [ 12/Aug/16 ]

Hit on master: https://testing.hpdd.intel.com/test_sets/d7dfaba2-5e1d-11e6-b2e2-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 13/Aug/16 ]

+1 on master : https://testing.hpdd.intel.com/test_sets/7d8efe92-60ec-11e6-b2e2-5254006e85c2

Comment by Niu Yawei (Inactive) [ 15/Aug/16 ]

The problem is reproduced with more debug information: https://testing.hpdd.intel.com/test_sets/34ddc070-605b-11e6-906c-5254006e85c2

02:35:51:[ 1394.285106] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
02:35:51:[ 1394.295103] Lustre: lustre-MDT0000: UNLINKED ffff8800429cb400 ec87ba17-1f41-6540-2071-22cb49f34d84 10.9.5.174@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369

Looks 10.9.5.174 is a Lustre client, probably the export is held by some lock? I'll scrutinize that part of code later.

Comment by Niu Yawei (Inactive) [ 16/Aug/16 ]

The debug message shows the export refcount is only 1, which means no locks attached to the export, I'm afraid that we missed putting export reference somewhere. This is one place in the hsm code:

diff --git a/lustre/mdt/mdt_hsm_cdt_agent.c b/lustre/mdt/mdt_hsm_cdt_agent.c
index 38bc739..dc8e9ad 100644
--- a/lustre/mdt/mdt_hsm_cdt_agent.c
+++ b/lustre/mdt/mdt_hsm_cdt_agent.c
@@ -574,6 +574,8 @@ int mdt_hsm_agent_send(struct mdt_thread_info *mti,
         */
        exp = cfs_hash_lookup(mdt2obd_dev(mdt)->obd_uuid_hash, &uuid);
        if (exp == NULL || exp->exp_disconnected) {
+               if (exp != NULL)
+                       class_export_put(exp);
                /* This should clean up agents on evicted exports */
                rc = -ENOENT;
                CERROR("%s: agent uuid (%s) not found, unregistering:"

Not sure if it the real cause of this failure, but it needs be fixed anyway.

Comment by Gerrit Updater [ 16/Aug/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/21942
Subject: LU-7903 hsm: leaked export refcount
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dab8cd2e6918eb11a5d656ff2687ff9067620786

Comment by nasf (Inactive) [ 20/Aug/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/db682194-6662-11e6-aa74-5254006e85c2

Comment by Gerrit Updater [ 22/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21942/
Subject: LU-7903 hsm: leaked export refcount
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6d57484b718a986aea0ac83802b1b966df576ed8

Comment by Bruno Faccini (Inactive) [ 22/Aug/16 ]

Hello Niu,
I also encountered a similar problem with one of my master review at https://testing.hpdd.intel.com/test_sessions/4168f756-60ec-11e6-b2e2-5254006e85c2f. The failed session had been linked to this ticket as the cause. But looking into the whole tests suite/sequence, there was no previous execution of sanity-hsm and my own/concerned change is only affecting conf-sanity/test_41c that had been run before but with multiple MDT stop/fail successful in between. So I may be wrong, but unless there is some HSM activity outside sanity-hsm in others tests suites, I wonder if, as you already suspected, there may be other places in our code where we miss to put an export reference.

Comment by Niu Yawei (Inactive) [ 24/Aug/16 ]

Bruno, indeed, I agree with you, it's not necessary caused by the reference leak in hsm code, I've updated the debug patch and hope it can get us more useful information once it's hit again.

Comment by Gerrit Updater [ 08/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21599/
Subject: LU-7903 mdt: dump exports information on console
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f1c9166dc3172beecbd659686fe419f52138cefd

Comment by Peter Jones [ 08/Sep/16 ]

Landed for 2.9

Comment by Niu Yawei (Inactive) [ 08/Sep/16 ]

The landed patch is to improve the debug information, not fixing the real problem.

Comment by nasf (Inactive) [ 13/Sep/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/62e575f2-7682-11e6-b08e-5254006e85c2

Comment by James Nunez (Inactive) [ 13/Sep/16 ]

Niu,
Here is a failure that should include your debug patch: https://testing.hpdd.intel.com/test_sets/38e4f102-79bf-11e6-b08e-5254006e85c2

From the MDS 1 log:

09:49:55:[26605.283675] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
09:49:55:[26605.288046] Lustre: Skipped 24 previous similar messages
09:49:55:[26607.056252] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880046125200 x1545343953854000/t0(0) o13->lustre-OST0005-osc-MDT0000@10.9.4.148@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
09:49:55:[26607.064057] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) Skipped 23 previous similar messages
09:49:55:[26618.703121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26618.711819] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26634.716121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26634.723386] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26666.728048] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26666.734510] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26680.276322] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
09:49:55:[26680.281813] Lustre: Skipped 231 previous similar messages
09:49:55:[26730.739132] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26730.745251] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[26831.719420] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.146@tcp (stopping)
09:49:55:[26831.723246] Lustre: Skipped 481 previous similar messages
09:49:55:[26858.750122] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
09:49:55:[26858.761521] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0:           (null)  4294967369 stale:0
09:49:55:[27000.758142] INFO: task umount:13385 blocked for more than 120 seconds.
09:49:55:[27000.762771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:49:55:[27000.765657] umount          D 000000000000f918     0 13385  13384 0x00000080

Another at https://testing.hpdd.intel.com/test_sets/1dbc2ffc-7b23-11e6-8afd-5254006e85c2

Comment by Niu Yawei (Inactive) [ 19/Sep/16 ]

Thank you for the information, James, I'm looking at this issue. The console log shows that there is no pending RPC, locks, or commit callback is holding the export, so I'm wondering if it because someone missed puting reference.

Comment by Andreas Dilger [ 20/Sep/16 ]

Is this bug a duplicate of LU-7858?

Comment by Niu Yawei (Inactive) [ 23/Sep/16 ]

Looks there is a race can cause export reference leak on difficult reply (and when rs_no_ack enabled):

MDT umount calls server_disconnect_export() to disconnects it's export, server_disconnect_exports() tries to complete all it's outstanding (difficult) replies, that'll call into ptlrpc_handle_rs() at the end, where the reply sate is removed from uncommited list, and LNetMDUnlink() is called to trigger an unlink event, reply_out_callback() catches the event and try to finialize the reply sate (where the export hold by reply state will be released) by following code:

                spin_lock(&svcpt->scp_rep_lock);
                spin_lock(&rs->rs_lock);

                rs->rs_on_net = 0;
                if (!rs->rs_no_ack ||
                    rs->rs_transno <=
                    rs->rs_export->exp_obd->obd_last_committed)
                        ptlrpc_schedule_difficult_reply(rs);

                spin_unlock(&rs->rs_lock);
                spin_unlock(&svcpt->scp_rep_lock);

We can see if rs_no_ack is false (COS not enabled), or the transaction has been committed, the reply state is finalized here, otherwise, it relies on commit callback, however, for the export already been disconnected by umount or whatever reason, the reply state is already being removed from the uncommitted list, so nobody will execute the reply state (until unload ptlrpc module). I think adding one more check here to see if the reply state is still on uncommitted list is necessary.

COS is usually not enabled, so this defect won't be triggered easily, however by the change of LU-3538, rs_no_ack is set unpurposely no matter if COS is enabled:

@@ -2530,24 +2589,27 @@ static void mdt_save_lock(struct mdt_thread_info *info, struct lustre_handle *h,
                        struct mdt_device *mdt = info->mti_mdt;
                        struct ldlm_lock *lock = ldlm_handle2lock(h);
                        struct ptlrpc_request *req = mdt_info_req(info);
-                       int no_ack = 0;
+                       int cos;
+
+                       cos = (mdt_cos_is_enabled(mdt) ||
+                              mdt_slc_is_enabled(mdt));

                        LASSERTF(lock != NULL, "no lock for cookie "LPX64"\n",
                                 h->cookie);
+
                        /* there is no request if mdt_object_unlock() is called
                         * from mdt_export_cleanup()->mdt_add_dirty_flag() */
                        if (likely(req != NULL)) {
                                CDEBUG(D_HA, "request = %p reply state = %p"
                                       " transno = "LPD64"\n", req,
                                       req->rq_reply_state, req->rq_transno);
-                               if (mdt_cos_is_enabled(mdt)) {
-                                       no_ack = 1;
+                               if (cos) {
                                        ldlm_lock_downgrade(lock, LCK_COS);
                                        mode = LCK_COS;
                                }
-                               ptlrpc_save_lock(req, h, mode, no_ack);
+                               ptlrpc_save_lock(req, h, mode, cos);
                        } else {
-                               ldlm_lock_decref(h, mode);
+                               mdt_fid_unlock(h, mode);
                        }

That could be the reason why it can be triggered more easily today.

Comment by Gerrit Updater [ 23/Sep/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/22696
Subject: LU-7903 ptlrpc: leaked rs on difficult reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 802f41d1a78f588ab99ef8e1d081ac3da7cb8551

Comment by Lai Siyao [ 23/Sep/16 ]

Indeed, this occurs when REPLY-ACK is disabled.

In http://review.whamcloud.com/#/c/12530/, to save local locks, REPLY-ACK is disabled, however for local operations, Commit-on-Sharing doesn't take effect, therefore we may not be able to replay in transno order if reply is lost. This is another issue, I'll fix it in another ticket.

Comment by Niu Yawei (Inactive) [ 23/Sep/16 ]

The patch http://review.whamcloud.com/#/c/12530/ is merged in master on Jan 29, 2016, then we search the maloo failures:

https://testing.hpdd.intel.com/test_sets/query?commit=Update+results&hosts=&page=9&query_bugs=&status%5B%5D=TIMEOUT&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bend_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Bstart_date%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=f36cabd0-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93&warn%5Bnotice%5D=true

This bug showed up immediately after that time point (was reported as LU-7172 before, so I think LU-7172 is dup of this).

Comment by Gerrit Updater [ 13/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22696/
Subject: LU-7903 ptlrpc: leaked rs on difficult reply
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bd1441b92e31cd3c01cc2453aa627d139a1207f7

Comment by Peter Jones [ 14/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:12:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.