[LU-7326] ost-pools hangs on OST unmount Created: 21/Oct/15  Updated: 22/Oct/15  Resolved: 22/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

autotest


Attachments: File ost-pools.test_complete.console.shadow-6vm7.log    
Issue Links:
Related
is related to LU-7038 obdfilter-survey test_3a: (lu_object.... Resolved
is related to LU-7221 replay-ost-single test_3: ASSERTION( ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ost-pools hangs on unmount of an OST. No individual tests fails and there are no errors in the last test run; test 26 in ost-pools. Just the unmounting of one of the OSTs in test clean up hangs. Logs are at https://testing.hpdd.intel.com/test_sets/ea392e2a-776b-11e5-a00c-5254006e85c2

The last thing we see in the suite_stdout log is:

16:16:45:Stopping /mnt/ost7 (opts:-f) on shadow-6vm7
16:16:45:CMD: shadow-6vm7 umount -d -f /mnt/ost7
16:16:56:CMD: shadow-6vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
16:16:56:CMD: shadow-6vm7 grep -c /mnt/ost8' ' /proc/mounts
16:16:56:Stopping /mnt/ost8 (opts:-f) on shadow-6vm7
16:16:56:CMD: shadow-6vm7 umount -d -f /mnt/ost8
17:14:51:********** Timeout by autotest system **********

This failure was on the master branch in review-dne-part-2. There are other similar hangs on unmount of the OSTs in ost-pools for some 'full' group test sessions. Logs for these are at
2015-10-03 02:46:25 - https://testing.hpdd.intel.com/test_sets/a436aad2-69ed-11e5-9fbf-5254006e85c2
2015-10-07 04:38:39 - https://testing.hpdd.intel.com/test_sets/c7a37340-6d19-11e5-ab7f-5254006e85c2
2015-10-07 05:32:11 - https://testing.hpdd.intel.com/test_sets/4398325c-6cd8-11e5-96b4-5254006e85c2



 Comments   
Comment by James Nunez (Inactive) [ 22/Oct/15 ]

I found the console logs that capture the activity between the last test completing and the start of the new test suite. In the attached file, ost-pools.test_complete.console.shadow-6vm7.log, we can see the stack trace from umount:

16:17:07:Lustre: DEBUG MARKER: umount -d -f /mnt/ost8
16:17:07:LustreError: 6705:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 3
16:17:07:LustreError: 6705:0:(lu_object.c:1224:lu_device_fini()) LBUG
16:17:07:Pid: 6705, comm: umount
16:17:07:
16:17:07:Call Trace:
16:17:07: [<ffffffffa049b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
16:17:07: [<ffffffffa049be77>] lbug_with_loc+0x47/0xb0 [libcfs]
16:17:07: [<ffffffffa05f0618>] lu_device_fini+0xb8/0xc0 [obdclass]
16:17:07: [<ffffffffa05d1efd>] ls_device_put+0x7d/0x2e0 [obdclass]
16:17:07: [<ffffffffa05d22d2>] local_oid_storage_fini+0x172/0x410 [obdclass]
16:17:07: [<ffffffffa0dc876f>] lfsck_instance_cleanup+0x20f/0x7e0 [lfsck]
16:17:07: [<ffffffffa0dcaf7b>] lfsck_degister+0x4b/0x60 [lfsck]
16:17:07: [<ffffffffa0e935cb>] ofd_device_fini+0xab/0x260 [ofd]
16:17:07: [<ffffffffa05dfb82>] class_cleanup+0x572/0xd20 [obdclass]
16:17:07: [<ffffffffa05e2206>] class_process_config+0x1ed6/0x2830 [obdclass]
16:17:07: [<ffffffffa04a7b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
16:17:07: [<ffffffff811788ec>] ? __kmalloc+0x21c/0x230
16:17:07: [<ffffffffa05e301f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
16:17:07: [<ffffffffa05c0746>] ? class_name2dev+0x56/0xe0 [obdclass]
16:17:07: [<ffffffffa061aeec>] server_put_super+0xa0c/0xed0 [obdclass]
16:17:07: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
16:17:07: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0
16:17:07: [<ffffffff81194466>] kill_anon_super+0x16/0x60
16:17:07: [<ffffffffa05e5ed6>] lustre_kill_super+0x36/0x60 [obdclass]
16:17:07: [<ffffffff81194c07>] deactivate_super+0x57/0x80
16:17:07: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110
16:17:07: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0
16:17:07: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
16:17:07:
16:17:07:Kernel panic - not syncing: LBUG
16:17:07:Pid: 6705, comm: umount Not tainted 2.6.32-573.7.1.el6_lustre.gef63c03.x86_64 #1
Comment by James Nunez (Inactive) [ 22/Oct/15 ]

With the stack trace, we can see this is a duplicate of LU-7038.

Generated at Sat Feb 10 02:07:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.