[LU-7157] sanity test_27z: cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed Created: 14/Sep/15 Updated: 23/Nov/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ccc00f28-5b17-11e5-af09-5254006e85c2. The sub-test test_27z failed with the following error: test failed to respond and timed out I think an OST crashed & rebooted during test 27z, but I'm not sure. no console logs were captured. console logs might have given better clues. Info required for matching: sanity 27z |
| Comments |
| Comment by Bob Glossman (Inactive) [ 14/Sep/15 ] |
|
I think the missing console logs have been misplaced onto lustre-init, as has been seen before on el7 test runs. This isn't el7, it's sles11sp4. However the same thing may be happening here. If I look at the OST console log recorded in lustre-init I do in fact see a panic: 16:57:44:Welcome to SUSE Linux Enterprise Server 11 SP4 (x86_64) - Kernel 3.0.1 01-63_lustre.g031dbf9-default (console).16:57:44: 16:57:44: 16:57:44:shadow-7vm11 login: [ 1817.654651] LustreError: 17681:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: 16:57:44:[ 1817.658147] LustreError: 17681:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG 16:57:44:[ 1817.661508] Kernel panic - not syncing: LBUG 16:57:44:[ 1817.662478] Pid: 17681, comm: umount Tainted: G EN 3.0.101-63_lustre.g031dbf9-default #1 16:57:44:[ 1817.664474] Call Trace: 16:57:44:[ 1817.665214] [<ffffffff81004b95>] dump_trace+0x75/0x300 16:57:44:[ 1817.666278] [<ffffffff81466093>] dump_stack+0x69/0x6f 16:57:44:[ 1817.667349] [<ffffffff8146612c>] panic+0x93/0x201 16:57:44:[ 1817.669684] [<ffffffffa0726db3>] lbug_with_loc+0xa3/0xb0 [libcfs] 16:57:44:[ 1817.672157] [<ffffffffa0737ccd>] cfs_hash_bd_del_locked+0xdd/0x120 [libcfs] 16:57:44:[ 1817.675226] [<ffffffffa0a6516e>] __ldlm_resource_putref_final+0x3e/0xc0 [ptlrpc] 16:57:44:[ 1817.679206] [<ffffffffa0a652d2>] ldlm_resource_putref_locked+0xe2/0x3f0 [ptlrpc] 16:57:44:[ 1817.681626] [<ffffffffa073852a>] cfs_hash_for_each_relax+0x1da/0x330 [libcfs] 16:57:44:[ 1817.683368] [<ffffffffa073a6ba>] cfs_hash_for_each_nolock+0x7a/0x1e0 [libcfs] 16:57:44:[ 1817.685157] [<ffffffffa0a63be9>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc] 16:57:44:[ 1817.686515] [<ffffffffa0a660f2>] __ldlm_namespace_free+0x52/0x580 [ptlrpc] 16:57:44:[ 1817.687817] [<ffffffffa0a66682>] ldlm_namespace_free_prior+0x62/0x230 [ptlrpc] 16:57:44:[ 1817.689594] [<ffffffffa0fff0a8>] ofd_fini+0x58/0x190 [ofd] 16:57:44:[ 1817.690725] [<ffffffffa0fff211>] ofd_device_fini+0x31/0xf0 [ofd] 16:57:44:[ 1817.692021] [<ffffffffa086872d>] class_cleanup+0x9bd/0xd40 [obdclass] 16:57:44:[ 1817.694026] [<ffffffffa0869c91>] class_process_config+0x11e1/0x1910 [obdclass] 16:57:44:[ 1817.698374] [<ffffffffa086a8bf>] class_manual_cleanup+0x4ff/0x8c0 [obdclass] 16:57:44:[ 1817.701394] [<ffffffffa08a6477>] server_put_super+0x607/0xb00 [obdclass] 16:57:44:[ 1817.702686] [<ffffffff811603fb>] generic_shutdown_super+0x6b/0x100 16:57:44:[ 1817.703907] [<ffffffff81160519>] kill_anon_super+0x9/0x20 16:57:44:[ 1817.705052] [<ffffffff81160b83>] deactivate_locked_super+0x33/0x90 16:57:44:[ 1817.706654] [<ffffffff8117cc0c>] sys_umount+0x6c/0xd0 16:57:44:[ 1817.707718] [<ffffffff814710f2>] system_call_fastpath+0x16/0x1b 16:57:44:[ 1817.708906] [<00007fd1fa0b16f7>] 0x7fd1fa0b16f6 16:57:44:[ 0.000000] Initializing cgroup subsys cpuset 16:57:44:[ 0.000000] Initializing cgroup subsys cpu hoping this will give somebody a clue. |
| Comment by Alex Zhuravlev [ 07/Oct/15 ] |
|
https://testing.hpdd.intel.com/test_logs/959bcc54-6bba-11e5-8e3b-5254006e85c2/show_text - not exactly the same, but the same test and at umount. 13:06:41:Lustre: DEBUG MARKER: umount -d /mnt/ost1 13:06:41:Lustre: Failing over lustre-OST0000 13:06:41:Lustre: lustre-OST0000: Not available for connect from 10.1.4.19@tcp (stopping) 13:06:41:LustreError: 6276:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0 13:06:41:LustreError: 6276:0:(genops.c:815:class_export_put()) LBUG |
| Comment by James Nunez (Inactive) [ 06/Nov/15 ] |
|
Similar failure on OST unmount on master at |
| Comment by Andreas Dilger [ 11/Feb/16 ] |
|
Recent failures of this test: 09:37:21:Lustre: Failing over lustre-OST0000 09:37:21:general protection fault: 0000 [#1] SMP 09:37:21:Pid: 10695, comm: umount Not tainted 2.6.32-573.12.1.el6_lustre.gd68d18b.x86_64 #1 Red Hat KVM 09:37:21:RIP: 0010:[<ffffffffa076151b>] [<ffffffffa076151b>] ldlm_resource_putref_locked+0x1b/0x3f0 [ptlrpc] 09:37:21:Process umount (pid: 10695, threadinfo ffff88005b4d0000, task ffff88005cac8040) 09:37:21:Call Trace: 09:37:21: [<ffffffffa0761902>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc] 09:37:21: [<ffffffffa0478779>] cfs_hash_for_each_relax+0x199/0x350 [libcfs] 09:37:21: [<ffffffffa047a6ac>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs] 09:37:21: [<ffffffffa075ff30>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc] 09:37:21: [<ffffffffa0762444>] __ldlm_namespace_free+0x54/0x560 [ptlrpc] 09:37:21: [<ffffffffa07629bf>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc] 09:37:21: [<ffffffffa0de85bb>] ofd_device_fini+0x7b/0x260 [ofd] 09:37:21: [<ffffffffa056f282>] class_cleanup+0x572/0xd20 [obdclass] 09:37:21: [<ffffffffa0571906>] class_process_config+0x1ed6/0x2830 [obdclass] 09:37:21: [<ffffffffa057271f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] 09:37:21: [<ffffffffa05aae3c>] server_put_super+0xa0c/0xed0 [obdclass] 09:37:21: [<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0 09:37:21: [<ffffffff811945a6>] kill_anon_super+0x16/0x60 09:37:21: [<ffffffffa05755d6>] lustre_kill_super+0x36/0x60 [obdclass] 09:37:21: [<ffffffff81194d47>] deactivate_super+0x57/0x80 09:37:21: [<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110 09:37:21: [<ffffffff811b588b>] sys_umount+0x7b/0x3a0 and https://testing.hpdd.intel.com/test_sets/008f7a1e-839f-11e5-b1ba-5254006e85c2 01:11:57:Lustre: DEBUG MARKER: umount -d /mnt/ost4 01:11:57:LustreError: 15917:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: 01:11:57:LustreError: 15917:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG 01:11:57:Pid: 15917, comm: umount 01:11:57:Call Trace: 01:11:57: [<ffffffffa046c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 01:11:57: [<ffffffffa046ce77>] lbug_with_loc+0x47/0xb0 [libcfs] 01:11:57: [<ffffffffa047de60>] cfs_hash_bd_del_locked+0xc0/0x100 [libcfs] 01:11:57: [<ffffffffa07663d8>] __ldlm_resource_putref_final+0x48/0xc0 [ptlrpc] 01:11:57: [<ffffffffa076652d>] ldlm_resource_putref_locked+0xdd/0x3f0 [ptlrpc] 01:11:57: [<ffffffffa0766852>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc] 01:11:57: [<ffffffffa05782cf>] ? class_manual_cleanup+0x4bf/0x8e0 [obdclass] 01:11:57: [<ffffffffa05559f6>] ? class_name2dev+0x56/0xe0 [obdclass] 01:11:57: [<ffffffffa05b019c>] ? server_put_super+0xa0c/0xed0 [obdclass] 01:11:57: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190 01:11:57: [<ffffffff8119437b>] ? generic_shutdown_super+0x5b/0xe0 01:11:57: [<ffffffff81194466>] ? kill_anon_super+0x16/0x60 01:11:57: [<ffffffffa057b186>] ? lustre_kill_super+0x36/0x60 [obdclass] 01:11:57: [<ffffffff81194c07>] ? deactivate_super+0x57/0x80 01:11:57: [<ffffffff811b4a7f>] ? mntput_no_expire+0xbf/0x110 01:11:57: [<ffffffff811b55cb>] ? sys_umount+0x7b/0x3a0 |
| Comment by Andreas Dilger [ 13/Jul/16 ] |
|
Another failure: https://testing.hpdd.intel.com/test_sets/0cc7cb54-4872-11e6-8968-5254006e85c2 |
| Comment by John Hammond [ 13/Jul/16 ] |
|
Andreas, this issue is an oops, your link is for a softlockup. See |
| Comment by Niu Yawei (Inactive) [ 27/Sep/16 ] |
|
Another failure: https://testing.hpdd.intel.com/test_sets/fb94d724-8420-11e6-a35f-5254006e85c2 |
| Comment by Oleg Drokin [ 04/Oct/16 ] |
|
This last one seems to be a different failure that's worth filing a separate ticket for |