[LU-7022] recovery-small test_100: hung on umount Created: 19/Aug/15  Updated: 16/Nov/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7903 recovery-small test_23: hang on umount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/cbf89e7c-4675-11e5-bedf-5254006e85c2.

The sub-test test_100 failed with the following error:

test failed to respond and timed out

syslog from ost shows:

obd refcount = 4. Is it stuck?
Aug 19 01:10:10 onyx-30vm4 kernel: Lustre: lustre-OST0000: Not available for connect from 10.2.4.97@tcp (stopping)
Aug 19 01:10:10 onyx-30vm4 kernel: Lustre: Skipped 77 previous similar messages
Aug 19 01:13:53 onyx-30vm4 kernel: INFO: task umount:9683 blocked for more than 120 seconds.
Aug 19 01:13:53 onyx-30vm4 kernel:      Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g107be2b.x86_64 #1
Aug 19 01:13:53 onyx-30vm4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 01:13:53 onyx-30vm4 kernel: umount        D 0000000000000001     0  9683   9682 0x00000080
Aug 19 01:13:53 onyx-30vm4 kernel: ffff880043057a78 0000000000000082 0000000000000000 ffff880043057a18
Aug 19 01:13:53 onyx-30vm4 kernel: ffff8800430579d8 ffffffffa21c8983 0000100ed9610762 0000000000000000
Aug 19 01:13:53 onyx-30vm4 kernel: ffff8800657dd044 000000010108d4b9 ffff88006308e5f8 ffff880043057fd8
Aug 19 01:13:53 onyx-30vm4 kernel: Call Trace:
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff81087540>] ? process_timeout+0x0/0x10
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa2157c66>] obd_exports_barrier+0xb6/0x190 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa0a4556f>] ofd_device_fini+0x5f/0x250 [ofd]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa21747b2>] class_cleanup+0x572/0xd30 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa2154726>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa2176e06>] class_process_config+0x1e96/0x2800 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa1e11c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa2177c2f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa2154726>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa21b10b2>] server_put_super+0x9e2/0xeb0 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff81190c66>] kill_anon_super+0x16/0x60
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffffa217aae6>] lustre_kill_super+0x36/0x60 [obdclass]
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff81191407>] deactivate_super+0x57/0x80
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0
Aug 19 01:13:53 onyx-30vm4 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Aug 19 01:14:25 onyx-30vm4 kernel: Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck?

Info required for matching: recovery-small 100



 Comments   
Comment by Bob Glossman (Inactive) [ 17/Oct/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/82c40750-b2f2-11e7-943d-5254006e85c2

from OST console log:

[18891.658372] Lustre: lustre-OST0001: Export ffff880067fdd800 already connecting from 10.2.8.140@tcp
[18891.662227] Lustre: Skipped 51 previous similar messages
[18960.200464] INFO: task umount:12854 blocked for more than 120 seconds.
[18960.201372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[18960.202152] umount          D ffffffffc0ede848     0 12854  12853 0x00000080
[18960.202932]  ffff88005e097b20 0000000000000086 ffff880079aacf10 ffff88005e097fd8
[18960.203750]  ffff88005e097fd8 ffff88005e097fd8 ffff880079aacf10 ffffffffc0ede840
[18960.204544]  ffffffffc0ede844 ffff880079aacf10 00000000ffffffff ffffffffc0ede848
[18960.205363] Call Trace:
[18960.205887]  [<ffffffff816aa3e9>] schedule_preempt_disabled+0x29/0x70
[18960.206664]  [<ffffffff816a8317>] __mutex_lock_slowpath+0xc7/0x1d0
[18960.207297]  [<ffffffff816a772f>] mutex_lock+0x1f/0x2f
[18960.208775]  [<ffffffffc0e4a04d>] nm_config_file_deregister_tgt+0x3d/0x1f0 [ptlrpc]
[18960.209658]  [<ffffffffc10fc66e>] ofd_device_fini+0xce/0x2d0 [ofd]
[18960.210740]  [<ffffffffc0b7f4dc>] class_cleanup+0x86c/0xc40 [obdclass]
[18960.211411]  [<ffffffffc0b818b6>] class_process_config+0x1996/0x23e0 [obdclass]
[18960.212279]  [<ffffffffc05ddba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[18960.212951]  [<ffffffffc0b824c6>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[18960.213746]  [<ffffffffc0bb203e>] server_put_super+0x8de/0xcd0 [obdclass]
[18960.214470]  [<ffffffff81203722>] generic_shutdown_super+0x72/0x100
[18960.215086]  [<ffffffff81203af2>] kill_anon_super+0x12/0x20
[18960.215648]  [<ffffffffc0b84dc2>] lustre_kill_super+0x32/0x50 [obdclass]
[18960.216305]  [<ffffffff81203ea9>] deactivate_locked_super+0x49/0x60
[18960.216910]  [<ffffffff81204616>] deactivate_super+0x46/0x60
[18960.217475]  [<ffffffff8122185f>] cleanup_mnt+0x3f/0x80
[18960.217991]  [<ffffffff812218f2>] __cleanup_mnt+0x12/0x20
[18960.218582]  [<ffffffff810ad265>] task_work_run+0xc5/0xf0
[18960.219146]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
[18960.219722]  [<ffffffff816b527d>] int_signal+0x12/0x17
[18994.563233] Lustre: lustre-OST0000: Not available for connect from 10.2.8.134@tcp (stopping)
[18994.564200] Lustre: Skipped 77 previous similar messages
Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ]

happened again at: https://testing.hpdd.intel.com/sub_tests/77f3d8ac-cad3-11e7-8027-52540065bddc

Generated at Sat Feb 10 02:05:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.