[LU-9113] insanity test_0 umount fails for /mnt/lustre-mds1, "Fail all nodes" test can't start Created: 14/Feb/17  Updated: 12/Apr/17  Resolved: 07/Apr/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

onyx-30vm1-3/7/8, Full Group test,
master branch, v2.9.52, b3520,
DNE, ZFS


Issue Links:
Related
is related to LU-8502 replay-vbr: umount hangs waiting for ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sets/3c80e50a-efe9-11e6-8c0d-5254006e85c2

Client tries multiple times (unsuccessfully) to unmount mds1, but eventually times out.

From MDS console:

02:50:15:[ 4080.084137] INFO: task umount:19374 blocked for more than 120 seconds.
02:50:15:[ 4080.086174] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
02:50:15:[ 4080.088282] umount          D ffff8800793b7fc0     0 19374  19373 0x00000080
02:50:15:[ 4080.090408]  ffff880056d43bd0 0000000000000086 ffff8800422ebec0 ffff880056d43fd8
02:50:15:[ 4080.092523]  ffff880056d43fd8 ffff880056d43fd8 ffff8800422ebec0 ffff8800793b7fb8
02:50:15:[ 4080.094612]  ffff8800793b7fbc ffff8800422ebec0 00000000ffffffff ffff8800793b7fc0
02:50:15:[ 4080.096724] Call Trace:
02:50:15:[ 4080.098391]  [<ffffffff8168cad9>] schedule_preempt_disabled+0x29/0x70
02:50:15:[ 4080.100447]  [<ffffffff8168a735>] __mutex_lock_slowpath+0xc5/0x1c0
02:50:15:[ 4080.102429]  [<ffffffff81689b9f>] mutex_lock+0x1f/0x2f
02:50:15:[ 4080.104322]  [<ffffffffa0ce6a56>] mgc_process_config+0x7d6/0x1400 [mgc]
02:50:15:[ 4080.106336]  [<ffffffff810bc064>] ? __wake_up+0x44/0x50
02:50:15:[ 4080.108272]  [<ffffffffa0b37225>] obd_process_config.constprop.14+0x85/0x2d0 [obdclass]
02:50:15:[ 4080.110413]  [<ffffffffa0b375f0>] ? lustre_cfg_new+0x180/0x400 [obdclass]
02:50:15:[ 4080.112481]  [<ffffffffa0b39440>] lustre_end_log+0xf0/0x5c0 [obdclass]
02:50:15:[ 4080.114533]  [<ffffffffa0b61d2e>] server_put_super+0x7de/0xcd0 [obdclass]
02:50:15:[ 4080.116595]  [<ffffffff81200802>] generic_shutdown_super+0x72/0xf0
02:50:15:[ 4080.118594]  [<ffffffff81200bd2>] kill_anon_super+0x12/0x20
02:50:15:[ 4080.120545]  [<ffffffffa0b36db2>] lustre_kill_super+0x32/0x50 [obdclass]
02:50:15:[ 4080.122589]  [<ffffffff81200f89>] deactivate_locked_super+0x49/0x60
02:50:15:[ 4080.124609]  [<ffffffff81201586>] deactivate_super+0x46/0x60
02:50:15:[ 4080.126559]  [<ffffffff8121e9c5>] mntput_no_expire+0xc5/0x120
02:50:15:[ 4080.128491]  [<ffffffff8121fb00>] SyS_umount+0xa0/0x3b0
02:50:15:[ 4080.130375]  [<ffffffff81696949>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by James Casper [ 03/Apr/17 ]

Just saw this with a patch test that was trying to run replay-dual five times in a row:

https://testing.hpdd.intel.com/test_sessions/a32a6368-1702-4ce5-a99b-a7375a0aea8b

Replay-dual passes consistently following lustre init (when 21b is excepted), but not when it follows a passing replay-dual test set.

Comment by James Casper [ 07/Apr/17 ]

Looked at the traces below the one pasted above. They contain the following:

mgs_ir_fini_fs+0x27e/0x2ec [mgs]

Closing this ticket as a dupe of LU-8502.

Comment by James Casper [ 07/Apr/17 ]

Dupe of LU-8502.

Generated at Sat Feb 10 02:23:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.