[LU-5161] replay-single test_42: OBD refcount is 5 Created: 09/Jun/14  Updated: 29/Jul/16  Resolved: 29/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-4772 MGS is waiting for obd_unlinked_exports Resolved
Severity: 3
Rank (Obsolete): 14231

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/32dbb85c-ef0d-11e3-b8c2-52540035b04c.

The sub-test test_42 failed with the following error:

test failed to respond and timed out

Info required for matching: replay-single 42

OST Console log:

20:09:19:Lustre: DEBUG MARKER: umount -d /mnt/ost1
20:09:19:Lustre: Failing over lustre-OST0000
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.231@tcp (stopping)
20:09:19:Lustre: Skipped 2 previous similar messages
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.224@tcp (stopping)
20:09:19:Lustre: Skipped 1 previous similar message
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.225@tcp (stopping)
20:09:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.231@tcp (stopping)
20:09:19:Lustre: Skipped 4 previous similar messages
20:09:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.225@tcp (stopping)
20:09:19:Lustre: Skipped 9 previous similar messages
20:09:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
20:09:19:Lustre: lustre-OST0000: Not available for connect from 10.1.4.231@tcp (stopping)
20:09:19:Lustre: Skipped 19 previous similar messages
20:20:16:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
20:20:16:Lustre: lustre-OST0000: Not available for connect from 10.1.4.231@tcp (stopping)
20:20:16:Lustre: Skipped 38 previous similar messages
20:20:16:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
20:20:16:Lustre: lustre-OST0000: Not available for connect from 10.1.4.231@tcp (stopping)
20:20:16:Lustre: Skipped 77 previous similar messages
20:20:16:INFO: task umount:32610 blocked for more than 120 seconds.
20:20:16:      Tainted: P        W  ---------------    2.6.32-431.17.1.el6_lustre.g357ff2e.x86_64 #1
20:20:16:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
20:20:16:umount        D 0000000000000001     0 32610  32609 0x00000080
20:20:16: ffff88001f005aa8 0000000000000082 ffff88001f005a08 ffff880074636800
20:20:16: ffffffffa078dbf7 0000000000000000 ffff88001bc2cfd4 ffffffffa078dbf7
20:20:16: ffff880021917ab8 ffff88001f005fd8 000000000000fbc8 ffff880021917ab8
20:20:16:Call Trace:
20:20:16: [<ffffffff81528e82>] schedule_timeout+0x192/0x2e0
20:20:16: [<ffffffff81083e90>] ? process_timeout+0x0/0x10
20:20:16: [<ffffffffa0710a5b>] obd_exports_barrier+0xab/0x180 [obdclass]
20:20:16: [<ffffffffa0f5e46f>] ofd_device_fini+0x5f/0x260 [ofd]
20:20:16: [<ffffffffa0737443>] class_cleanup+0x573/0xd30 [obdclass]
20:20:16: [<ffffffffa0712836>] ? class_name2dev+0x56/0xe0 [obdclass]
20:20:16: [<ffffffffa073916a>] class_process_config+0x156a/0x1ad0 [obdclass]
20:20:16: [<ffffffffa073150b>] ? lustre_cfg_new+0x2cb/0x680 [obdclass]
20:20:16: [<ffffffffa0739849>] class_manual_cleanup+0x179/0x6f0 [obdclass]
20:20:16: [<ffffffffa0712836>] ? class_name2dev+0x56/0xe0 [obdclass]
20:20:16: [<ffffffffa0777039>] server_put_super+0x8f9/0xe50 [obdclass]
20:20:16: [<ffffffff8118af0b>] generic_shutdown_super+0x5b/0xe0
20:20:16: [<ffffffff8118aff6>] kill_anon_super+0x16/0x60
20:20:16: [<ffffffffa073b726>] lustre_kill_super+0x36/0x60 [obdclass]
20:20:16: [<ffffffff8118b797>] deactivate_super+0x57/0x80
20:20:16: [<ffffffff811aa79f>] mntput_no_expire+0xbf/0x110
20:20:16: [<ffffffff811ab2eb>] sys_umount+0x7b/0x3a0
20:20:16: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by James Nunez (Inactive) [ 02/Mar/16 ]

There's another possible instance of this error seen on replay-single test_120. The MDS is the server that cannot unmount. Logs are at https://testing.hpdd.intel.com/test_sets/55181848-dfc5-11e5-9400-5254006e85c2

Comment by Andreas Dilger [ 29/Jul/16 ]

Given the lack of activity on this ticket, and the fact there is a fix for a similar problem on LU-4772, I'm going to assume this is the same as LU-4772.

Generated at Sat Feb 10 01:49:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.