[LU-7157] sanity test_27z: cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed Created: 14/Sep/15  Updated: 23/Nov/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ccc00f28-5b17-11e5-af09-5254006e85c2.

The sub-test test_27z failed with the following error:

test failed to respond and timed out

I think an OST crashed & rebooted during test 27z, but I'm not sure. no console logs were captured. console logs might have given better clues.

Info required for matching: sanity 27z



 Comments   
Comment by Bob Glossman (Inactive) [ 14/Sep/15 ]

I think the missing console logs have been misplaced onto lustre-init, as has been seen before on el7 test runs. This isn't el7, it's sles11sp4. However the same thing may be happening here.

If I look at the OST console log recorded in lustre-init I do in fact see a panic:

16:57:44:Welcome to SUSE Linux Enterprise Server 11 SP4  (x86_64) - Kernel 3.0.1
01-63_lustre.g031dbf9-default (console).16:57:44:
16:57:44:
16:57:44:shadow-7vm11 login: [ 1817.654651] LustreError: 17681:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed:
16:57:44:[ 1817.658147] LustreError: 17681:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG
16:57:44:[ 1817.661508] Kernel panic - not syncing: LBUG
16:57:44:[ 1817.662478] Pid: 17681, comm: umount Tainted: G           EN  3.0.101-63_lustre.g031dbf9-default #1
16:57:44:[ 1817.664474] Call Trace:
16:57:44:[ 1817.665214]  [<ffffffff81004b95>] dump_trace+0x75/0x300
16:57:44:[ 1817.666278]  [<ffffffff81466093>] dump_stack+0x69/0x6f
16:57:44:[ 1817.667349]  [<ffffffff8146612c>] panic+0x93/0x201
16:57:44:[ 1817.669684]  [<ffffffffa0726db3>] lbug_with_loc+0xa3/0xb0 [libcfs]
16:57:44:[ 1817.672157]  [<ffffffffa0737ccd>] cfs_hash_bd_del_locked+0xdd/0x120 [libcfs]
16:57:44:[ 1817.675226]  [<ffffffffa0a6516e>] __ldlm_resource_putref_final+0x3e/0xc0 [ptlrpc]
16:57:44:[ 1817.679206]  [<ffffffffa0a652d2>] ldlm_resource_putref_locked+0xe2/0x3f0 [ptlrpc]
16:57:44:[ 1817.681626]  [<ffffffffa073852a>] cfs_hash_for_each_relax+0x1da/0x330 [libcfs]
16:57:44:[ 1817.683368]  [<ffffffffa073a6ba>] cfs_hash_for_each_nolock+0x7a/0x1e0 [libcfs]
16:57:44:[ 1817.685157]  [<ffffffffa0a63be9>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
16:57:44:[ 1817.686515]  [<ffffffffa0a660f2>] __ldlm_namespace_free+0x52/0x580 [ptlrpc]
16:57:44:[ 1817.687817]  [<ffffffffa0a66682>] ldlm_namespace_free_prior+0x62/0x230 [ptlrpc]
16:57:44:[ 1817.689594]  [<ffffffffa0fff0a8>] ofd_fini+0x58/0x190 [ofd]
16:57:44:[ 1817.690725]  [<ffffffffa0fff211>] ofd_device_fini+0x31/0xf0 [ofd]
16:57:44:[ 1817.692021]  [<ffffffffa086872d>] class_cleanup+0x9bd/0xd40 [obdclass]
16:57:44:[ 1817.694026]  [<ffffffffa0869c91>] class_process_config+0x11e1/0x1910 [obdclass]
16:57:44:[ 1817.698374]  [<ffffffffa086a8bf>] class_manual_cleanup+0x4ff/0x8c0 [obdclass]
16:57:44:[ 1817.701394]  [<ffffffffa08a6477>] server_put_super+0x607/0xb00 [obdclass]
16:57:44:[ 1817.702686]  [<ffffffff811603fb>] generic_shutdown_super+0x6b/0x100
16:57:44:[ 1817.703907]  [<ffffffff81160519>] kill_anon_super+0x9/0x20
16:57:44:[ 1817.705052]  [<ffffffff81160b83>] deactivate_locked_super+0x33/0x90
16:57:44:[ 1817.706654]  [<ffffffff8117cc0c>] sys_umount+0x6c/0xd0
16:57:44:[ 1817.707718]  [<ffffffff814710f2>] system_call_fastpath+0x16/0x1b
16:57:44:[ 1817.708906]  [<00007fd1fa0b16f7>] 0x7fd1fa0b16f6
16:57:44:[    0.000000] Initializing cgroup subsys cpuset
16:57:44:[    0.000000] Initializing cgroup subsys cpu

hoping this will give somebody a clue.

Comment by Alex Zhuravlev [ 07/Oct/15 ]

https://testing.hpdd.intel.com/test_logs/959bcc54-6bba-11e5-8e3b-5254006e85c2/show_text - not exactly the same, but the same test and at umount.

13:06:41:Lustre: DEBUG MARKER: umount -d /mnt/ost1
13:06:41:Lustre: Failing over lustre-OST0000
13:06:41:Lustre: lustre-OST0000: Not available for connect from 10.1.4.19@tcp (stopping)
13:06:41:LustreError: 6276:0:(genops.c:815:class_export_put()) ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
13:06:41:LustreError: 6276:0:(genops.c:815:class_export_put()) LBUG
Comment by James Nunez (Inactive) [ 06/Nov/15 ]

Similar failure on OST unmount on master at
2015-11-05 00:43:03 - https://testing.hpdd.intel.com/test_sets/008f7a1e-839f-11e5-b1ba-5254006e85c2
2015-11-11 19:37:53 - https://testing.hpdd.intel.com/test_sets/fa66faee-88ef-11e5-8ba4-5254006e85c2

Comment by Andreas Dilger [ 11/Feb/16 ]

Recent failures of this test:
https://testing.hpdd.intel.com/test_sets/260b405e-cf8d-11e5-9923-5254006e85c2

09:37:21:Lustre: Failing over lustre-OST0000
09:37:21:general protection fault: 0000 [#1] SMP 
09:37:21:Pid: 10695, comm: umount Not tainted 2.6.32-573.12.1.el6_lustre.gd68d18b.x86_64 #1 Red Hat KVM
09:37:21:RIP: 0010:[<ffffffffa076151b>]  [<ffffffffa076151b>] ldlm_resource_putref_locked+0x1b/0x3f0 [ptlrpc]
09:37:21:Process umount (pid: 10695, threadinfo ffff88005b4d0000, task ffff88005cac8040)
09:37:21:Call Trace:
09:37:21: [<ffffffffa0761902>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc]
09:37:21: [<ffffffffa0478779>] cfs_hash_for_each_relax+0x199/0x350 [libcfs]
09:37:21: [<ffffffffa047a6ac>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs]
09:37:21: [<ffffffffa075ff30>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc]
09:37:21: [<ffffffffa0762444>] __ldlm_namespace_free+0x54/0x560 [ptlrpc]
09:37:21: [<ffffffffa07629bf>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc]
09:37:21: [<ffffffffa0de85bb>] ofd_device_fini+0x7b/0x260 [ofd]
09:37:21: [<ffffffffa056f282>] class_cleanup+0x572/0xd20 [obdclass]
09:37:21: [<ffffffffa0571906>] class_process_config+0x1ed6/0x2830 [obdclass]
09:37:21: [<ffffffffa057271f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
09:37:21: [<ffffffffa05aae3c>] server_put_super+0xa0c/0xed0 [obdclass]
09:37:21: [<ffffffff811944bb>] generic_shutdown_super+0x5b/0xe0
09:37:21: [<ffffffff811945a6>] kill_anon_super+0x16/0x60
09:37:21: [<ffffffffa05755d6>] lustre_kill_super+0x36/0x60 [obdclass]
09:37:21: [<ffffffff81194d47>] deactivate_super+0x57/0x80
09:37:21: [<ffffffff811b4d3f>] mntput_no_expire+0xbf/0x110
09:37:21: [<ffffffff811b588b>] sys_umount+0x7b/0x3a0

and https://testing.hpdd.intel.com/test_sets/008f7a1e-839f-11e5-b1ba-5254006e85c2

01:11:57:Lustre: DEBUG MARKER: umount -d /mnt/ost4
01:11:57:LustreError: 15917:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: 
01:11:57:LustreError: 15917:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG
01:11:57:Pid: 15917, comm: umount
01:11:57:Call Trace:
01:11:57: [<ffffffffa046c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
01:11:57: [<ffffffffa046ce77>] lbug_with_loc+0x47/0xb0 [libcfs]
01:11:57: [<ffffffffa047de60>] cfs_hash_bd_del_locked+0xc0/0x100 [libcfs]
01:11:57: [<ffffffffa07663d8>] __ldlm_resource_putref_final+0x48/0xc0 [ptlrpc]
01:11:57: [<ffffffffa076652d>] ldlm_resource_putref_locked+0xdd/0x3f0 [ptlrpc]
01:11:57: [<ffffffffa0766852>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc]
01:11:57: [<ffffffffa05782cf>] ? class_manual_cleanup+0x4bf/0x8e0 [obdclass]
01:11:57: [<ffffffffa05559f6>] ? class_name2dev+0x56/0xe0 [obdclass]
01:11:57: [<ffffffffa05b019c>] ? server_put_super+0xa0c/0xed0 [obdclass]
01:11:57: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
01:11:57: [<ffffffff8119437b>] ? generic_shutdown_super+0x5b/0xe0
01:11:57: [<ffffffff81194466>] ? kill_anon_super+0x16/0x60
01:11:57: [<ffffffffa057b186>] ? lustre_kill_super+0x36/0x60 [obdclass]
01:11:57: [<ffffffff81194c07>] ? deactivate_super+0x57/0x80
01:11:57: [<ffffffff811b4a7f>] ? mntput_no_expire+0xbf/0x110
01:11:57: [<ffffffff811b55cb>] ? sys_umount+0x7b/0x3a0
Comment by Andreas Dilger [ 13/Jul/16 ]

Another failure: https://testing.hpdd.intel.com/test_sets/0cc7cb54-4872-11e6-8968-5254006e85c2

Comment by John Hammond [ 13/Jul/16 ]

Andreas, this issue is an oops, your link is for a softlockup. See LU-8392.

Comment by Niu Yawei (Inactive) [ 27/Sep/16 ]

Another failure: https://testing.hpdd.intel.com/test_sets/fb94d724-8420-11e6-a35f-5254006e85c2

Comment by Oleg Drokin [ 04/Oct/16 ]

This last one seems to be a different failure that's worth filing a separate ticket for

Generated at Sat Feb 10 02:06:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.