[LU-187] umount hang when running conf-sanity test 29 Created: 02/Apr/11  Updated: 30/May/13  Resolved: 01/Jun/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File syslog     File trace    
Severity: 3
Rank (Obsolete): 8477

 Description   

umount mds hang when running conf-sanity test 29

MDS:

INFO: task umount:24781 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D ffffffff80150d6a 0 24781 24780 (NOTLB)
ffff8102f2283978 0000000000000086 0000003000000030 ffff8102f22839e0
ffff8102f22838d8 0000000000000007 ffff8103389b67a0 ffff81034aae77a0
000017607cb40f62 0000000002787517 ffff8103389b6988 0000000700005486
Call Trace:
[<ffffffff80064613>] __down_write_nested+0x7a/0x92
[<ffffffff88bcf960>] :mdd:mdd_init_capa_ctxt+0xe0/0x1a0
[<ffffffff889213e0>] :fld:fld_cache_fini+0x350/0x360
[<ffffffff88c84066>] :cmm:cmm_init_capa_ctxt+0xd6/0x170
[<ffffffff88bf32d2>] :mdt:mdt_fld_fini+0x1b2/0x1c0
[<ffffffff88c0087e>] :mdt:mdt_device_fini+0x36e/0x650
[<ffffffff88738aea>] :obdclass:class_cleanup+0xbca/0xe40
[<ffffffff8873cc4c>] :obdclass:class_process_config+0x1e5c/0x3200
[<ffffffff8873f647>] :obdclass:class_manual_cleanup+0xad7/0xe80
[<ffffffff8874d10c>] :obdclass:server_put_super+0x5ac/0xe70
[<ffffffff80051848>] vfs_quota_sync+0x14b/0x15a
[<ffffffff80034ee9>] dispose_list+0xc7/0xe0
[<ffffffff800ec87a>] invalidate_inodes+0xce/0xe0
[<ffffffff800e3f71>] generic_shutdown_super+0x79/0xfb
[<ffffffff800e4041>] kill_anon_super+0x9/0x35
[<ffffffff800e40f2>] deactivate_super+0x6a/0x82
[<ffffffff800edf47>] sys_umount+0x245/0x27b
[<ffffffff800236ba>] sys_newstat+0x19/0x31
[<ffffffff8005d116>] system_call+0x7e/0x83



 Comments   
Comment by Peter Jones [ 02/Apr/11 ]

Yu Jian

Can you please try and understand the cause of this hang?

Thanks

Peter

Comment by Jian Yu [ 20/Apr/11 ]

From the syslog of the MDS node, we could see:

Lustre: Permanently reactivating lustre-OST0001
Lustre: lustre-OST0001-osc-MDT0000: Connection to service lustre-OST0001 via nid 192.168.4.129@o2ib was lost; in progress operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail.
Lustre: lustre-OST0001-osc-MDT0000: Connection restored to service lustre-OST0001 using nid 192.168.4.129@o2ib.
LustreError: 11-0: an error occurred while communicating with 192.168.4.129@o2ib. The ost_set_info operation failed with -107

After reactivating the lustre-OST0001, the communication between MDS and OSS hit -107 (-ENOTCONN, Transport endpoint is not connected) error.

Sarah, do you have the dmesg/syslog of the OSS node? I could not reproduce this issue on Toro cluster, and then could not figure out what happened on the OSS node.

Comment by Sarah Liu [ 20/Apr/11 ]

sorry, I don't have those logs. I will try to reproduce this bug and gather the logs.

Comment by Jian Yu [ 01/Jun/11 ]

The issue has not been reproduced for two months. Let's close the ticket. If you hit the issue again, please feel free to reopen it.

Comment by Colin Faber [X] (Inactive) [ 22/Jun/11 ]

Hi,

I'm able to reproduce this pretty reliably:

INFO: task touch:22268 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
touch D ffff880038923a18 0 22268 22237 0x00000080
ffff880038923938 0000000000000082 ffff8800389238e8 ffffffff810117bc
0000000000000000 0000000000000000 0000000000000000 ffff88002ff9c100
ffff880039cbbb18 ffff880038923fd8 0000000000010518 ffff880039cbbb18
Call Trace:
[<ffffffff810117bc>] ? __switch_to+0x1ac/0x320
[<ffffffff814c95f6>] ? thread_return+0x4e/0x778
[<ffffffff814ca335>] schedule_timeout+0x225/0x2f0
[<ffffffff814c9fa3>] wait_for_common+0x123/0x180
[<ffffffff8105c580>] ? default_wake_function+0x0/0x20
[<ffffffff814ca0bd>] wait_for_completion+0x1d/0x20
[<ffffffffa08b3558>] osc_io_setattr_end+0x28/0x90 [osc]
[<ffffffffa0691260>] cl_io_end+0x60/0x130 [obdclass]
[<ffffffffa0905c86>] lov_io_end_wrapper+0x116/0x120 [lov]
[<ffffffffa0905b70>] ? lov_io_end_wrapper+0x0/0x120 [lov]
[<ffffffffa0905b70>] ? lov_io_end_wrapper+0x0/0x120 [lov]
[<ffffffffa0905a88>] lov_io_call+0x58/0x140 [lov]
[<ffffffffa09065a6>] ? lov_io_start+0xc6/0x170 [lov]
[<ffffffffa0906457>] lov_io_end+0x47/0xd0 [lov]
[<ffffffffa0691260>] cl_io_end+0x60/0x130 [obdclass]
[<ffffffffa06951de>] cl_io_loop+0x11e/0x1d0 [obdclass]
[<ffffffffa0693a14>] ? cl_io_init+0x64/0xc0 [obdclass]
[<ffffffffa0b24f3b>] cl_setattr_ost+0x1fb/0x290 [lustre]
[<ffffffffa0afc4ea>] ll_setattr_raw+0xa4a/0x1020 [lustre]
[<ffffffffa0afcb1f>] ll_setattr+0x5f/0x100 [lustre]
[<ffffffff81188f98>] notify_change+0x168/0x340
[<ffffffff8119c2cc>] utimes_common+0xdc/0x1b0
[<ffffffff8119c439>] do_utimes+0x99/0xf0
[<ffffffff810d4372>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff8119c592>] sys_utimensat+0x32/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
LustreError: 9781:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (19) req@ffff88000baf3400 x1372341458271667/t0(0) o-1><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1308766557 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1
LustreError: 9781:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 51 previous similar messages

Comment by Stephen Champion [ 30/May/13 ]

I tripped over this running acceptance on a SLES11SP1 / Lustre 2.1.5 client :

crash> bt 24884
PID: 24884 TASK: ffff8802f9688aa0 CPU: 5 COMMAND: "umount"
#0 [ffff8802f23438d8] schedule at ffffffff8150ba32
#1 [ffff8802f23439a0] rwsem_down_failed_common at ffffffff8150dfe5
#2 [ffff8802f2343a00] rwsem_down_write_failed at ffffffff8150e143
#3 [ffff8802f2343a40] call_rwsem_down_write_failed at ffffffff81281463
#4 [ffff8802f2343aa0] mdd_init_capa_ctxt at ffffffffa07da6c8 [mdd]
#5 [ffff8802f2343b00] cmm_init_capa_ctxt at ffffffffa091b873 [cmm]
#6 [ffff8802f2343b40] mdt_device_fini at ffffffffa0834abf [mdt]
#7 [ffff8802f2343b80] class_cleanup at ffffffffa0b730f7 [obdclass]
#8 [ffff8802f2343c00] class_process_config at ffffffffa0b75103 [obdclass]
#9 [ffff8802f2343ca0] class_manual_cleanup at ffffffffa0b75e49 [obdclass]
#10 [ffff8802f2343d60] server_put_super at ffffffffa0b81cec [obdclass]
#11 [ffff8802f2343e30] generic_shutdown_super at ffffffff811828fb
#12 [ffff8802f2343e50] kill_anon_super at ffffffff811829e6
#13 [ffff8802f2343e70] lustre_kill_super at ffffffffa0b77a86 [obdclass]
#14 [ffff8802f2343e90] deactivate_super at ffffffff81183187
#15 [ffff8802f2343eb0] mntput_no_expire at ffffffff811a11af
#16 [ffff8802f2343ee0] sys_umount at ffffffff811a1c1b
#17 [ffff8802f2343f80] system_call_fastpath at ffffffff8100b072

I doubt I'll have time to spare chasing it down, but I can make the core and logs available to anyone who wants to take a look.

Generated at Sat Feb 10 01:04:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.