[LU-2701] recovery-small test 27 umount hang Created: 29/Jan/13  Updated: 15/Mar/13  Resolved: 15/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: MB

Severity: 3
Rank (Obsolete): 6294

 Description   

Some recent landing introduced a problem in osp cleanup.
Specifically test 27 of recovery small seems to be affected.
This test specifically breaks osc communication and perhaps osp is not able to recover?
trace of hung umount:

PID: 28642  TASK: ffff880097b0a140  CPU: 6   COMMAND: "umount"
 #0 [ffff88007b6c1898] schedule at ffffffff814f7c98
 #1 [ffff88007b6c1960] osp_sync_fini at ffffffffa069d09d [osp]
 #2 [ffff88007b6c19c0] osp_process_config at ffffffffa06972c0 [osp]
 #3 [ffff88007b6c1a20] lod_cleanup_desc_tgts at ffffffffa05ed564 [lod]
 #4 [ffff88007b6c1a70] lod_process_config at ffffffffa05f0266 [lod]
 #5 [ffff88007b6c1af0] mdd_process_config at ffffffffa0427c4b [mdd]
 #6 [ffff88007b6c1b50] mdt_stack_fini at ffffffffa0726b21 [mdt]
 #7 [ffff88007b6c1bb0] mdt_device_fini at ffffffffa072799a [mdt]
 #8 [ffff88007b6c1bf0] class_cleanup at ffffffffa0fb5247 [obdclass]
 #9 [ffff88007b6c1c70] class_process_config at ffffffffa0fb6b2c [obdclass]
#10 [ffff88007b6c1d00] class_manual_cleanup at ffffffffa0fb7869 [obdclass]
#11 [ffff88007b6c1dc0] server_put_super at ffffffffa0fc83bc [obdclass]
#12 [ffff88007b6c1e30] generic_shutdown_super at ffffffff8117d6ab
#13 [ffff88007b6c1e50] kill_anon_super at ffffffff8117d796
#14 [ffff88007b6c1e70] lustre_kill_super at ffffffffa0fb9666 [obdclass]
#15 [ffff88007b6c1e90] deactivate_super at ffffffff8117e825
#16 [ffff88007b6c1eb0] mntput_no_expire at ffffffff8119a89f
#17 [ffff88007b6c1ee0] sys_umount at ffffffff8119b34b

After this nothing cound progress:

[145371.090429] LustreError: 22398:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_f
ail_timeout id 407 sleeping for 10000ms
[145380.552626] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, o
r until 1 client reconnects
[145380.573743] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1
 recovered and 0 were evicted.
[145380.588321] Lustre: lustre-OST0001: deleting orphan objects from 0x0:176 to 
192
[145380.588747] Lustre: Skipped 1 previous similar message
[145381.093065] LustreError: 22398:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_f
ail_timeout id 407 awake
[145459.804339] Lustre: Failing over lustre-MDT0000
[145459.809747] LustreError: 11-0: lustre-MDT0000-mdc-ffff88008c61dbf0: Communic
ating with 0@lo, operation mds_reint failed with -19.
[145459.810324] LustreError: Skipped 5 previous similar messages
[145460.115626] LustreError: 20940:0:(client.c:1039:ptlrpc_import_delay_req()) @
@@ IMP_CLOSED   req@ffff88008fd2ebf0 x1425344870597376/t0(0) o6->lustre-OST0000-
osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[145460.157395] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880073692bf0 x1425344870597409/t0(0) o6->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[145460.157395] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880073692bf0 x1425344870597409/t0(0) o6->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[145460.158377] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
[145460.608530] LustreError: 137-5: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[145465.605136] LustreError: 137-5: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[145465.606017] LustreError: Skipped 3 previous similar messages
...

I have a crashdump.



 Comments   
Comment by Oleg Drokin [ 29/Jan/13 ]

This problem seems to be introduced by commit 74ec68346e14851ad8a1912185e1dccd3e6d12cd falso from Wangdi: LU-1187 lod: Fix config log and setup process for DNE

Comment by Jodi Levi (Inactive) [ 29/Jan/13 ]

Di is looking into this one as well.

Comment by Oleg Drokin [ 29/Jan/13 ]

Hm, actually looking through earlier logs, I have seem a very similar stack before for umount, but not triggering in recovery-small test 27 so reliably.

Comment by Oleg Drokin [ 29/Jan/13 ]

First appearance I see was on Jan 19, so still a recent thing.

Comment by Di Wang [ 29/Jan/13 ]

Hmm, if the first appearance is on Jan 19th, most changes on this area has not been landed yet. I think most of the patches here were landed on Jan 22th. And also it seems cleanup process is waiting for unlink log threads to be stopped, so it is unlikely fid on ost problem, IMHO. I will look at it deeper, sigh, can not reproduce it locally.

Comment by Andreas Dilger [ 07/Feb/13 ]

Problem appears to be in LOD/OSP code, not DNE.

Comment by Alex Zhuravlev [ 19/Feb/13 ]

http://review.whamcloud.com/5463

Comment by Alex Zhuravlev [ 15/Mar/13 ]

the patch is landed. the issue is hopefully solved.

Generated at Sat Feb 10 01:27:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.