[LU-8663] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed Created: 03/Oct/16  Updated: 16/Feb/17  Resolved: 18/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

lustre-2.8.0_0.0.llnlpreview.41-2.ch6.x86_64


Issue Links:
Duplicate
duplicates LU-6304 crash on umount in cleanup_resource Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While the filesystem was mounted and active, we began power cycling OSS's to verify failover worked properly.

Several OSS nodes crashed, with traces like this one:

2016-09-30 14:40:14 [11785.975211] BUG: unable to handle kernel paging request at 00000000deadbeef
2016-09-30 14:40:14 [11785.984272] IP: [<ffffffff81334259>] memset+0x9/0xb0
2016-09-30 14:40:14 [11785.986998] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed:
2016-09-30 14:40:14 [11785.986999] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG
2016-09-30 14:40:14 [11785.987000] Pid: 106058, comm: ldlm_bl_10
2016-09-30 14:40:14 [11785.987000]
2016-09-30 14:40:14 [11786.490434] Call Trace:
2016-09-30 14:40:14 [11786.493875]  [<ffffffffa0d2e00d>] ? ofd_lvbo_free+0x4d/0xe0 [ofd]
2016-09-30 14:40:14 [11786.501409]  [<ffffffffa1099643>] ldlm_resource_putref_locked+0x133/0x430 [ptlrpc]
2016-09-30 14:40:14 [11786.510593]  [<ffffffffa1099952>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc]
2016-09-30 14:40:14 [11786.519183]  [<ffffffffa08d1b74>] cfs_hash_for_each_relax+0x1b4/0x3d0 [libcfs]
2016-09-30 14:40:14 [11786.527976]  [<ffffffffa1096d60>] ? cleanup_resource+0x370/0x370 [ptlrpc]
2016-09-30 14:40:14 [11786.536291]  [<ffffffffa1096d60>] ? cleanup_resource+0x370/0x370 [ptlrpc]
2016-09-30 14:40:14 [11786.544587]  [<ffffffffa08d4dc5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
2016-09-30 14:40:14 [11786.553382]  [<ffffffffa1094eb0>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc]
2016-09-30 14:40:14 [11786.561886]  [<ffffffffa1095d5f>] __ldlm_namespace_free+0x5f/0x5c0 [ptlrpc]
2016-09-30 14:40:14 [11786.570385]  [<ffffffffa0c660e4>] ? lfsck_instance_find+0x74/0xb0 [lfsck]
2016-09-30 14:40:14 [11786.578678]  [<ffffffff8169d015>] ? mutex_lock+0x25/0x42
2016-09-30 14:40:14 [11786.585329]  [<ffffffffa0c6a0a8>] ? lfsck_stop+0x1b8/0x4f0 [lfsck]
2016-09-30 14:40:14 [11786.592951]  [<ffffffff811e5fd6>] ? kmem_cache_alloc_trace+0x226/0x250
2016-09-30 14:40:14 [11786.600978]  [<ffffffffa109631a>] ldlm_namespace_free_prior+0x5a/0x210 [ptlrpc]
2016-09-30 14:40:14 [11786.609869]  [<ffffffffa0d1089a>] ofd_device_fini+0x8a/0x2a0 [ofd]
2016-09-30 14:40:14 [11786.617527]  [<ffffffffa0a0a21c>] class_cleanup+0x8dc/0xd70 [obdclass]
2016-09-30 14:40:14 [11786.625561]  [<ffffffffa0a0cbfc>] class_process_config+0x1e2c/0x2f70 [obdclass]
2016-09-30 14:40:14 [11786.634454]  [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280
2016-09-30 14:40:14 [11786.641219]  [<ffffffffa0a0611b>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
2016-09-30 14:40:14 [11786.649424]  [<ffffffffa0a0de2f>] class_manual_cleanup+0xef/0x810 [obdclass]
2016-09-30 14:40:14 [11786.658007]  [<ffffffffa0a3fece>] server_put_super+0x8de/0xcd0 [obdclass]
2016-09-30 14:40:14 [11786.666272]  [<ffffffff81209572>] generic_shutdown_super+0x72/0xf0
2016-09-30 14:40:14 [11786.673841]  [<ffffffff81209942>] kill_anon_super+0x12/0x20
2016-09-30 14:40:14 [11786.680720]  [<ffffffffa0a11592>] lustre_kill_super+0x32/0x50 [obdclass]
2016-09-30 14:40:14 [11786.688840]  [<ffffffff81209cf9>] deactivate_locked_super+0x49/0x60
2016-09-30 14:40:14 [11786.696457]  [<ffffffff8120a2f6>] deactivate_super+0x46/0x60
2016-09-30 14:40:14 [11786.703375]  [<ffffffff812282c5>] mntput_no_expire+0xc5/0x120
2016-09-30 14:40:14 [11786.710372]  [<ffffffff81229440>] SyS_umount+0xa0/0x3b0
2016-09-30 14:40:14 [11786.716770]  [<ffffffff816aa4c9>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Olaf Faaland [ 03/Oct/16 ]

Looks like LU-7157 but I wasn't sure whether to put notes from a production system issue into a ticket about a test failure.

Comment by Olaf Faaland [ 03/Oct/16 ]

Console output preceeding this assertion:

2016-09-30 14:39:50 [11762.576750] Lustre: 99059:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1475271484/real 1475271484]  req@ffff881782684500 x1546922312654596/t0(0) o400->MGC172.19.3.1@o2ib600@172.19.3.1@o2ib600:26/25 lens 224/224 e 0 to 1 dl 1475271590 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
2016-09-30 14:39:50 [11762.611814] Lustre: 99059:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
2016-09-30 14:39:50 [11762.623781] LustreError: 166-1: MGC172.19.3.1@o2ib600: Connection to MGS (at 172.19.3.1@o2ib600) was lost; in progress operations using this service will fail
2016-09-30 14:40:13 [11785.220067] LNetError: 94499:0:(o2iblnd_cb.c:3134:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
2016-09-30 14:40:13 [11785.232466] LNetError: 94499:0:(o2iblnd_cb.c:3197:kiblnd_check_conns()) Timed out RDMA with 172.19.3.1@o2ib600 (53): c: 0, oc: 0, rc: 8
2016-09-30 14:40:13 [11785.249479] Lustre: Failing over lsh-OST0002
2016-09-30 14:40:13 [11785.266567] Lustre: lsh-OST0002: Not available for connect from 192.168.136.240@o2ib27 (stopping)
2016-09-30 14:40:13 [11785.279740] LustreError: 124427:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) ldlm_cancel from 192.168.135.99@o2ib27 arrived at 1475271613 with bad export cookie 9157528843916976698
2016-09-30 14:40:13 [11785.301432] LustreError: 124427:0:(ldlm_lock.c:2597:ldlm_lock_dump_handle()) ### ### ns: filter-lsh-OST0002_UUID lock: ffff881f8fe6ea00/0x7f1613f702807cf5 lrc: 3/0,0 mode: PW/PW res: [0xc0466a:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 192.168.135.99@o2ib27 remote: 0x320df3ff8e6289ab expref: 1777 pid: 99124 timeout: 0 lvb_type: 0
2016-09-30 14:40:14 [11785.787090] Lustre: lsh-OST0002: Not available for connect from 192.168.134.161@o2ib27 (stopping)
2016-09-30 14:40:14 [11785.791589] LustreError: 104316:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) ldlm_cancel from 192.168.134.141@o2ib27 arrived at 1475271614 with bad export cookie 9157528843916977104
2016-09-30 14:40:14 [11785.791591] LustreError: 104316:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) Skipped 2 previous similar messages
2016-09-30 14:40:14 [11785.834020] Lustre: Skipped 74 previous similar messages
2016-09-30 14:40:14 [11785.975211] BUG: unable to handle kernel paging request at 00000000deadbeef
2016-09-30 14:40:14 [11785.984272] IP: [<ffffffff81334259>] memset+0x9/0xb0
Comment by Olaf Faaland [ 03/Oct/16 ]

O/S on the OSS is RHEL 7.3 derivative.

Comment by Bruno Faccini (Inactive) [ 03/Oct/16 ]

Olaf, do you mean umount OSTs by power-cycling OSSs ?
And yes it definitely looks like LU-7157 which seems still to be investigated.

Comment by Olaf Faaland [ 03/Oct/16 ]

Bruno,
I'll have to get the sysadmin to describe what he was doing in more detail. You're right, there's SyS_umount at the bottom of the stack so he must have been doing umount; possibly HA did this on his behalf.

Comment by Cameron Harr [ 03/Oct/16 ]

It's difficult to say what was happening at the time of the crash as I was working on sibling OSSs and rebooting a couple of them several times. I believe the MDS node was rebooted as well. When I looked at the status of the other OSS nodes (which had been up and healthy), I noticed these crashes.

Comment by Peter Jones [ 03/Oct/16 ]

Oleg is looking into this

Comment by Oleg Drokin [ 04/Oct/16 ]

I believe this is a dup of LU-6304 and the patch is here: http://review.whamcloud.com/13908

Comment by Peter Jones [ 18/Dec/16 ]

This is believed to be a duplicate of LU-6304 which has been landed for 2.8.1

Generated at Sat Feb 10 02:19:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.