[LU-6705] MDT hung at umount under DNE mode Created: 10/Jun/15  Updated: 14/Jul/15  Resolved: 14/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: nasf (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6720 recovery-small test_111: mds crashed ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It can be reproduced via the following steps:
1) create some files cross multiple MDTs.
2) umount the MDTs
3) mount the MDTs
4) umount the MDTs
5) mount the MDTs
6) umount the MDTs

Then hung at the last umount.



 Comments   
Comment by Di Wang [ 10/Jun/15 ]

Just discussed with Fan Yong, it seems umount thread is blocked because it can not stop the recovery update thread. And the recovery update thread is trying to retrieve the update records from the remote MDT. and that MDT has been shutdown.

update recovery thread

[<ffffffffa0526c18>] ? ptlrpc_set_wait+0x188/0x900 [ptlrpc]
 [<ffffffffa051c5b0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffffa051da74>] ? ptlrpc_request_pack+0x24/0x70 [ptlrpc]
 [<ffffffffa0527411>] ? ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa073655b>] ? fld_client_rpc+0x15b/0x510 [fld]
 [<ffffffffa073c42e>] ? fld_server_lookup+0x14e/0x330 [fld]
 [<ffffffffa0c804ff>] ? lod_fld_lookup+0x34f/0x520 [lod]
 [<ffffffff8116fef2>] ? kmem_cache_alloc+0x182/0x190
 [<ffffffffa0c958e3>] ? lod_object_init+0x103/0x3c0 [lod]
 [<ffffffffa1294298>] ? lu_object_alloc+0xd8/0x320 [obdclass]
 [<ffffffffa12957a1>] ? lu_object_find_try+0x151/0x260 [obdclass]
 [<ffffffffa1295961>] ? lu_object_find_at+0xb1/0xe0 [obdclass]
 [<ffffffffa0d06fc2>] ? dt_update_request_destroy+0x1c2/0x270 [osp]
 [<ffffffffa1296ebc>] ? dt_locate_at+0x1c/0xa0 [obdclass]
 [<ffffffffa125346f>] ? llog_osd_open+0xdf/0xde0 [obdclass]
 [<ffffffffa124a375>] ? llog_open+0x145/0x470 [obdclass]
 [<ffffffffa0caa82e>] ? lod_sub_prep_llog+0x19e/0x7a0 [lod]
 [<ffffffffa0c8088e>] ? lod_sub_recovery_thread+0x1be/0x980 [lod]
 [<ffffffff81061c62>] ? default_wake_function+0x12/0x20
 [<ffffffffa0c806d0>] ? lod_sub_recovery_thread+0x0/0x980 [lod]
 [<ffffffff8109ab46>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109aab0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

So we need to stop the recovery thread during umount,

static void mdt_fini()
{
.....
       target_recovery_fini(obd);
        ping_evictor_stop();
        mdt_stack_pre_fini(env, m, md2lu_dev(m->mdt_child));
......

Simply, if we move mdt_stack_pre_fini before target_recovery_fini, the problem should go away, but that will cause other problem.

Comment by Gerrit Updater [ 11/Jun/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15210
Subject: LU-6705 lod: re-order lodsub recovery cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eee8fdfd69dc66e4224507ce9284022f3a1758a9

Comment by Gerrit Updater [ 08/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15210/
Subject: LU-6705 lod: re-order lodsub recovery cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8299bdd484ad44d3ed25dcc68e8440242c155c80

Generated at Sat Feb 10 02:02:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.