[LU-7903] recovery-small test_23: hang on umount Created: 22/Mar/16 Updated: 12/Dec/16 Resolved: 14/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f72c79d4-efbb-11e5-8ddc-5254006e85c2. The sub-test test_23 failed with the following error: test failed to respond and timed out Please provide additional information about the failure here. Info required for matching: recovery-small 23 |
| Comments |
| Comment by Bob Glossman (Inactive) [ 22/Mar/16 ] |
|
this may be a dup of |
| Comment by Richard Henwood (Inactive) [ 04/Apr/16 ] |
|
just seen on Master running review-dne-part-1: https://testing.hpdd.intel.com/test_sets/47205c78-f91f-11e5-a22e-5254006e85c2 I've just seen this and it doesn't look like a dup of 22:02:10:Call Trace: 22:02:10: [] schedule_timeout+0x192/0x2e0 22:02:10: [] ? process_timeout+0x0/0x10 22:02:10: [] obd_exports_barrier+0xb6/0x190 [obdclass] 22:02:10: [] mdt_device_fini+0x6ab/0x12e0 [mdt] 22:02:10: [] ? class_disconnect_exports+0x17d/0x2f0 [obdclass] 22:02:10: [] class_cleanup+0x572/0xd20 [obdclass] 22:02:10: [] ? class_name2dev+0x56/0xe0 [obdclass] 22:02:10: [] class_process_config+0x1b66/0x24c0 [obdclass] 22:02:10: [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 22:02:10: [] class_manual_cleanup+0x4bf/0xc90 [obdclass] 22:02:10: [] ? class_name2dev+0x56/0xe0 [obdclass] 22:02:10: [] server_put_super+0x8bc/0xcd0 [obdclass] 22:02:10: [] generic_shutdown_super+0x5b/0xe0 22:02:10: [] kill_anon_super+0x16/0x60 22:02:10: [] lustre_kill_super+0x36/0x60 [obdclass] 22:02:10: [] deactivate_super+0x57/0x80 22:02:10: [] mntput_no_expire+0xbf/0x110 22:02:10: [] sys_umount+0x7b/0x3a0 22:02:10: [] system_call_fastpath+0x16/0x1b 22:37:59:********** Timeout by autotest system ********** |
| Comment by Bruno Faccini (Inactive) [ 22/Apr/16 ] |
|
+1 at https://testing.hpdd.intel.com/test_sets/4e388a9a-0869-11e6-b5f1-5254006e85c2 during a master patch review. |
| Comment by Emoly Liu [ 22/Apr/16 ] |
|
Another failure at https://testing.hpdd.intel.com/test_sets/5626d550-0845-11e6-9e5d-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 22/Apr/16 ] |
|
another on master: |
| Comment by Richard Henwood (Inactive) [ 26/Apr/16 ] |
|
another failure on master, from review-dne-part-1 https://testing.hpdd.intel.com/test_sets/7d9d1360-0b0b-11e6-855a-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 31/May/16 ] |
|
Another failure on master , from review-dne-part-1 : |
| Comment by Jian Yu [ 06/Jun/16 ] |
|
More failures on master branch: |
| Comment by Lai Siyao [ 14/Jun/16 ] |
Jun 5 15:27:19 trevis-10vm4 kernel: INFO: task umount:3347 blocked for more than 120 seconds. Jun 5 15:27:19 trevis-10vm4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 5 15:27:19 trevis-10vm4 kernel: umount D 000000000000d918 0 3347 3346 0x00000080 Jun 5 15:27:19 trevis-10vm4 kernel: ffff88004d367aa0 0000000000000082 ffff880029f20b80 ffff88004d367fd8 Jun 5 15:27:19 trevis-10vm4 kernel: ffff88004d367fd8 ffff88004d367fd8 ffff880029f20b80 ffffffff81daaa00 Jun 5 15:27:19 trevis-10vm4 kernel: ffff88004d367ad0 0000000101bb1f02 ffffffff81daaa00 000000000000d918 Jun 5 15:27:19 trevis-10vm4 kernel: Call Trace: Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff8163b349>] schedule+0x29/0x70 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff81638fa5>] schedule_timeout+0x175/0x2d0 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff8108bf30>] ? internal_add_timer+0x70/0x70 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07b53d4>] obd_exports_barrier+0xc4/0x1a0 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa0de06a0>] mdt_device_fini+0x310/0xfc0 [mdt] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07ce3cc>] class_cleanup+0x8dc/0xd70 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d0dac>] class_process_config+0x1e2c/0x2f70 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa05ee957>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d1fdf>] class_manual_cleanup+0xef/0x810 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa08032de>] server_put_super+0x8de/0xcd0 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e0ca6>] generic_shutdown_super+0x56/0xe0 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1082>] kill_anon_super+0x12/0x20 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffffa07d5742>] lustre_kill_super+0x32/0x50 [obdclass] Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1439>] deactivate_locked_super+0x49/0x60 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811e1a36>] deactivate_super+0x46/0x60 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811fe9a5>] mntput_no_expire+0xc5/0x120 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff811ffadf>] SyS_umount+0x9f/0x3c0 Jun 5 15:27:19 trevis-10vm4 kernel: [<ffffffff816463c9>] system_call_fastpath+0x16/0x1 MDT umount hung, but I don't find anything useful in the logs, and dump_exports() was called long before log dump, so they are missing in the logs, I'm wondering whether this can be improved. Besides, I can't reproduce it in local test environment. |
| Comment by Bruno Faccini (Inactive) [ 04/Jul/16 ] |
|
+1 on master : https://testing.hpdd.intel.com/test_sets/0709014c-407f-11e6-acf3-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 18/Jul/16 ] |
|
another on master: |
| Comment by Bruno Faccini (Inactive) [ 26/Jul/16 ] |
|
+1 on master at https://testing.hpdd.intel.com/test_sets/ce2d9280-52fc-11e6-bf87-5254006e85c2 |
| Comment by Emoly Liu [ 26/Jul/16 ] |
|
Another failure on master: |
| Comment by Joseph Gmitter (Inactive) [ 29/Jul/16 ] |
|
Niu will be cooking a debug patch. |
| Comment by Gerrit Updater [ 29/Jul/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/21599 |
| Comment by Niu Yawei (Inactive) [ 29/Jul/16 ] |
|
Unfortunately, the exports information were always being truncated. I cooked a debug patch which dump the exports information on console: http://review.whamcloud.com/21599 |
| Comment by nasf (Inactive) [ 12/Aug/16 ] |
|
Hit it on master: |
| Comment by Emoly Liu [ 12/Aug/16 ] |
|
Hit on master: https://testing.hpdd.intel.com/test_sets/d7dfaba2-5e1d-11e6-b2e2-5254006e85c2 |
| Comment by Bruno Faccini (Inactive) [ 13/Aug/16 ] |
|
+1 on master : https://testing.hpdd.intel.com/test_sets/7d8efe92-60ec-11e6-b2e2-5254006e85c2 |
| Comment by Niu Yawei (Inactive) [ 15/Aug/16 ] |
|
The problem is reproduced with more debug information: https://testing.hpdd.intel.com/test_sets/34ddc070-605b-11e6-906c-5254006e85c2 02:35:51:[ 1394.285106] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 02:35:51:[ 1394.295103] Lustre: lustre-MDT0000: UNLINKED ffff8800429cb400 ec87ba17-1f41-6540-2071-22cb49f34d84 10.9.5.174@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 Looks 10.9.5.174 is a Lustre client, probably the export is held by some lock? I'll scrutinize that part of code later. |
| Comment by Niu Yawei (Inactive) [ 16/Aug/16 ] |
|
The debug message shows the export refcount is only 1, which means no locks attached to the export, I'm afraid that we missed putting export reference somewhere. This is one place in the hsm code: diff --git a/lustre/mdt/mdt_hsm_cdt_agent.c b/lustre/mdt/mdt_hsm_cdt_agent.c index 38bc739..dc8e9ad 100644 --- a/lustre/mdt/mdt_hsm_cdt_agent.c +++ b/lustre/mdt/mdt_hsm_cdt_agent.c @@ -574,6 +574,8 @@ int mdt_hsm_agent_send(struct mdt_thread_info *mti, */ exp = cfs_hash_lookup(mdt2obd_dev(mdt)->obd_uuid_hash, &uuid); if (exp == NULL || exp->exp_disconnected) { + if (exp != NULL) + class_export_put(exp); /* This should clean up agents on evicted exports */ rc = -ENOENT; CERROR("%s: agent uuid (%s) not found, unregistering:" Not sure if it the real cause of this failure, but it needs be fixed anyway. |
| Comment by Gerrit Updater [ 16/Aug/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/21942 |
| Comment by nasf (Inactive) [ 20/Aug/16 ] |
|
Another failure instance on master: |
| Comment by Gerrit Updater [ 22/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21942/ |
| Comment by Bruno Faccini (Inactive) [ 22/Aug/16 ] |
|
Hello Niu, |
| Comment by Niu Yawei (Inactive) [ 24/Aug/16 ] |
|
Bruno, indeed, I agree with you, it's not necessary caused by the reference leak in hsm code, I've updated the debug patch and hope it can get us more useful information once it's hit again. |
| Comment by Gerrit Updater [ 08/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21599/ |
| Comment by Peter Jones [ 08/Sep/16 ] |
|
Landed for 2.9 |
| Comment by Niu Yawei (Inactive) [ 08/Sep/16 ] |
|
The landed patch is to improve the debug information, not fixing the real problem. |
| Comment by nasf (Inactive) [ 13/Sep/16 ] |
|
Another failure instance on master: |
| Comment by James Nunez (Inactive) [ 13/Sep/16 ] |
|
Niu, From the MDS 1 log: 09:49:55:[26605.283675] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 09:49:55:[26605.288046] Lustre: Skipped 24 previous similar messages 09:49:55:[26607.056252] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880046125200 x1545343953854000/t0(0) o13->lustre-OST0005-osc-MDT0000@10.9.4.148@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 09:49:55:[26607.064057] LustreError: 30936:0:(client.c:1168:ptlrpc_import_delay_req()) Skipped 23 previous similar messages 09:49:55:[26618.703121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26618.711819] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26634.716121] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26634.723386] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26666.728048] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26666.734510] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26680.276322] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping) 09:49:55:[26680.281813] Lustre: Skipped 231 previous similar messages 09:49:55:[26730.739132] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26730.745251] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[26831.719420] Lustre: lustre-MDT0000: Not available for connect from 10.9.4.146@tcp (stopping) 09:49:55:[26831.723246] Lustre: Skipped 481 previous similar messages 09:49:55:[26858.750122] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? 09:49:55:[26858.761521] Lustre: lustre-MDT0000: UNLINKED ffff8800471ee000 dbac7ff5-8b18-2c65-a249-73e04ce7d6a8 10.9.4.146@tcp 1 (0 0 0) 1 0 0 0: (null) 4294967369 stale:0 09:49:55:[27000.758142] INFO: task umount:13385 blocked for more than 120 seconds. 09:49:55:[27000.762771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:49:55:[27000.765657] umount D 000000000000f918 0 13385 13384 0x00000080 Another at https://testing.hpdd.intel.com/test_sets/1dbc2ffc-7b23-11e6-8afd-5254006e85c2 |
| Comment by Niu Yawei (Inactive) [ 19/Sep/16 ] |
|
Thank you for the information, James, I'm looking at this issue. The console log shows that there is no pending RPC, locks, or commit callback is holding the export, so I'm wondering if it because someone missed puting reference. |
| Comment by Andreas Dilger [ 20/Sep/16 ] |
|
Is this bug a duplicate of |
| Comment by Niu Yawei (Inactive) [ 23/Sep/16 ] |
|
Looks there is a race can cause export reference leak on difficult reply (and when rs_no_ack enabled): MDT umount calls server_disconnect_export() to disconnects it's export, server_disconnect_exports() tries to complete all it's outstanding (difficult) replies, that'll call into ptlrpc_handle_rs() at the end, where the reply sate is removed from uncommited list, and LNetMDUnlink() is called to trigger an unlink event, reply_out_callback() catches the event and try to finialize the reply sate (where the export hold by reply state will be released) by following code: spin_lock(&svcpt->scp_rep_lock);
spin_lock(&rs->rs_lock);
rs->rs_on_net = 0;
if (!rs->rs_no_ack ||
rs->rs_transno <=
rs->rs_export->exp_obd->obd_last_committed)
ptlrpc_schedule_difficult_reply(rs);
spin_unlock(&rs->rs_lock);
spin_unlock(&svcpt->scp_rep_lock);
We can see if rs_no_ack is false (COS not enabled), or the transaction has been committed, the reply state is finalized here, otherwise, it relies on commit callback, however, for the export already been disconnected by umount or whatever reason, the reply state is already being removed from the uncommitted list, so nobody will execute the reply state (until unload ptlrpc module). I think adding one more check here to see if the reply state is still on uncommitted list is necessary. COS is usually not enabled, so this defect won't be triggered easily, however by the change of @@ -2530,24 +2589,27 @@ static void mdt_save_lock(struct mdt_thread_info *info, struct lustre_handle *h, struct mdt_device *mdt = info->mti_mdt; struct ldlm_lock *lock = ldlm_handle2lock(h); struct ptlrpc_request *req = mdt_info_req(info); - int no_ack = 0; + int cos; + + cos = (mdt_cos_is_enabled(mdt) || + mdt_slc_is_enabled(mdt)); LASSERTF(lock != NULL, "no lock for cookie "LPX64"\n", h->cookie); + /* there is no request if mdt_object_unlock() is called * from mdt_export_cleanup()->mdt_add_dirty_flag() */ if (likely(req != NULL)) { CDEBUG(D_HA, "request = %p reply state = %p" " transno = "LPD64"\n", req, req->rq_reply_state, req->rq_transno); - if (mdt_cos_is_enabled(mdt)) { - no_ack = 1; + if (cos) { ldlm_lock_downgrade(lock, LCK_COS); mode = LCK_COS; } - ptlrpc_save_lock(req, h, mode, no_ack); + ptlrpc_save_lock(req, h, mode, cos); } else { - ldlm_lock_decref(h, mode); + mdt_fid_unlock(h, mode); } That could be the reason why it can be triggered more easily today. |
| Comment by Gerrit Updater [ 23/Sep/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/22696 |
| Comment by Lai Siyao [ 23/Sep/16 ] |
|
Indeed, this occurs when REPLY-ACK is disabled. In http://review.whamcloud.com/#/c/12530/, to save local locks, REPLY-ACK is disabled, however for local operations, Commit-on-Sharing doesn't take effect, therefore we may not be able to replay in transno order if reply is lost. This is another issue, I'll fix it in another ticket. |
| Comment by Niu Yawei (Inactive) [ 23/Sep/16 ] |
|
The patch http://review.whamcloud.com/#/c/12530/ is merged in master on Jan 29, 2016, then we search the maloo failures: This bug showed up immediately after that time point (was reported as |
| Comment by Gerrit Updater [ 13/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22696/ |
| Comment by Peter Jones [ 14/Oct/16 ] |
|
Landed for 2.9 |