[LU-8663] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed Created: 03/Oct/16 Updated: 16/Feb/17 Resolved: 18/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.8.0_0.0.llnlpreview.41-2.ch6.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
While the filesystem was mounted and active, we began power cycling OSS's to verify failover worked properly. Several OSS nodes crashed, with traces like this one: 2016-09-30 14:40:14 [11785.975211] BUG: unable to handle kernel paging request at 00000000deadbeef 2016-09-30 14:40:14 [11785.984272] IP: [<ffffffff81334259>] memset+0x9/0xb0 2016-09-30 14:40:14 [11785.986998] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: 2016-09-30 14:40:14 [11785.986999] LustreError: 106058:0:(hash.c:554:cfs_hash_bd_del_locked()) LBUG 2016-09-30 14:40:14 [11785.987000] Pid: 106058, comm: ldlm_bl_10 2016-09-30 14:40:14 [11785.987000] 2016-09-30 14:40:14 [11786.490434] Call Trace: 2016-09-30 14:40:14 [11786.493875] [<ffffffffa0d2e00d>] ? ofd_lvbo_free+0x4d/0xe0 [ofd] 2016-09-30 14:40:14 [11786.501409] [<ffffffffa1099643>] ldlm_resource_putref_locked+0x133/0x430 [ptlrpc] 2016-09-30 14:40:14 [11786.510593] [<ffffffffa1099952>] ldlm_res_hop_put_locked+0x12/0x20 [ptlrpc] 2016-09-30 14:40:14 [11786.519183] [<ffffffffa08d1b74>] cfs_hash_for_each_relax+0x1b4/0x3d0 [libcfs] 2016-09-30 14:40:14 [11786.527976] [<ffffffffa1096d60>] ? cleanup_resource+0x370/0x370 [ptlrpc] 2016-09-30 14:40:14 [11786.536291] [<ffffffffa1096d60>] ? cleanup_resource+0x370/0x370 [ptlrpc] 2016-09-30 14:40:14 [11786.544587] [<ffffffffa08d4dc5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] 2016-09-30 14:40:14 [11786.553382] [<ffffffffa1094eb0>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc] 2016-09-30 14:40:14 [11786.561886] [<ffffffffa1095d5f>] __ldlm_namespace_free+0x5f/0x5c0 [ptlrpc] 2016-09-30 14:40:14 [11786.570385] [<ffffffffa0c660e4>] ? lfsck_instance_find+0x74/0xb0 [lfsck] 2016-09-30 14:40:14 [11786.578678] [<ffffffff8169d015>] ? mutex_lock+0x25/0x42 2016-09-30 14:40:14 [11786.585329] [<ffffffffa0c6a0a8>] ? lfsck_stop+0x1b8/0x4f0 [lfsck] 2016-09-30 14:40:14 [11786.592951] [<ffffffff811e5fd6>] ? kmem_cache_alloc_trace+0x226/0x250 2016-09-30 14:40:14 [11786.600978] [<ffffffffa109631a>] ldlm_namespace_free_prior+0x5a/0x210 [ptlrpc] 2016-09-30 14:40:14 [11786.609869] [<ffffffffa0d1089a>] ofd_device_fini+0x8a/0x2a0 [ofd] 2016-09-30 14:40:14 [11786.617527] [<ffffffffa0a0a21c>] class_cleanup+0x8dc/0xd70 [obdclass] 2016-09-30 14:40:14 [11786.625561] [<ffffffffa0a0cbfc>] class_process_config+0x1e2c/0x2f70 [obdclass] 2016-09-30 14:40:14 [11786.634454] [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280 2016-09-30 14:40:14 [11786.641219] [<ffffffffa0a0611b>] ? lustre_cfg_new+0x8b/0x400 [obdclass] 2016-09-30 14:40:14 [11786.649424] [<ffffffffa0a0de2f>] class_manual_cleanup+0xef/0x810 [obdclass] 2016-09-30 14:40:14 [11786.658007] [<ffffffffa0a3fece>] server_put_super+0x8de/0xcd0 [obdclass] 2016-09-30 14:40:14 [11786.666272] [<ffffffff81209572>] generic_shutdown_super+0x72/0xf0 2016-09-30 14:40:14 [11786.673841] [<ffffffff81209942>] kill_anon_super+0x12/0x20 2016-09-30 14:40:14 [11786.680720] [<ffffffffa0a11592>] lustre_kill_super+0x32/0x50 [obdclass] 2016-09-30 14:40:14 [11786.688840] [<ffffffff81209cf9>] deactivate_locked_super+0x49/0x60 2016-09-30 14:40:14 [11786.696457] [<ffffffff8120a2f6>] deactivate_super+0x46/0x60 2016-09-30 14:40:14 [11786.703375] [<ffffffff812282c5>] mntput_no_expire+0xc5/0x120 2016-09-30 14:40:14 [11786.710372] [<ffffffff81229440>] SyS_umount+0xa0/0x3b0 2016-09-30 14:40:14 [11786.716770] [<ffffffff816aa4c9>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Olaf Faaland [ 03/Oct/16 ] |
|
Looks like LU-7157 but I wasn't sure whether to put notes from a production system issue into a ticket about a test failure. |
| Comment by Olaf Faaland [ 03/Oct/16 ] |
|
Console output preceeding this assertion: 2016-09-30 14:39:50 [11762.576750] Lustre: 99059:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1475271484/real 1475271484] req@ffff881782684500 x1546922312654596/t0(0) o400->MGC172.19.3.1@o2ib600@172.19.3.1@o2ib600:26/25 lens 224/224 e 0 to 1 dl 1475271590 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-09-30 14:39:50 [11762.611814] Lustre: 99059:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 5 previous similar messages 2016-09-30 14:39:50 [11762.623781] LustreError: 166-1: MGC172.19.3.1@o2ib600: Connection to MGS (at 172.19.3.1@o2ib600) was lost; in progress operations using this service will fail 2016-09-30 14:40:13 [11785.220067] LNetError: 94499:0:(o2iblnd_cb.c:3134:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds 2016-09-30 14:40:13 [11785.232466] LNetError: 94499:0:(o2iblnd_cb.c:3197:kiblnd_check_conns()) Timed out RDMA with 172.19.3.1@o2ib600 (53): c: 0, oc: 0, rc: 8 2016-09-30 14:40:13 [11785.249479] Lustre: Failing over lsh-OST0002 2016-09-30 14:40:13 [11785.266567] Lustre: lsh-OST0002: Not available for connect from 192.168.136.240@o2ib27 (stopping) 2016-09-30 14:40:13 [11785.279740] LustreError: 124427:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) ldlm_cancel from 192.168.135.99@o2ib27 arrived at 1475271613 with bad export cookie 9157528843916976698 2016-09-30 14:40:13 [11785.301432] LustreError: 124427:0:(ldlm_lock.c:2597:ldlm_lock_dump_handle()) ### ### ns: filter-lsh-OST0002_UUID lock: ffff881f8fe6ea00/0x7f1613f702807cf5 lrc: 3/0,0 mode: PW/PW res: [0xc0466a:0x0:0x0].0x0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x40000000000000 nid: 192.168.135.99@o2ib27 remote: 0x320df3ff8e6289ab expref: 1777 pid: 99124 timeout: 0 lvb_type: 0 2016-09-30 14:40:14 [11785.787090] Lustre: lsh-OST0002: Not available for connect from 192.168.134.161@o2ib27 (stopping) 2016-09-30 14:40:14 [11785.791589] LustreError: 104316:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) ldlm_cancel from 192.168.134.141@o2ib27 arrived at 1475271614 with bad export cookie 9157528843916977104 2016-09-30 14:40:14 [11785.791591] LustreError: 104316:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) Skipped 2 previous similar messages 2016-09-30 14:40:14 [11785.834020] Lustre: Skipped 74 previous similar messages 2016-09-30 14:40:14 [11785.975211] BUG: unable to handle kernel paging request at 00000000deadbeef 2016-09-30 14:40:14 [11785.984272] IP: [<ffffffff81334259>] memset+0x9/0xb0 |
| Comment by Olaf Faaland [ 03/Oct/16 ] |
|
O/S on the OSS is RHEL 7.3 derivative. |
| Comment by Bruno Faccini (Inactive) [ 03/Oct/16 ] |
|
Olaf, do you mean umount OSTs by power-cycling OSSs ? |
| Comment by Olaf Faaland [ 03/Oct/16 ] |
|
Bruno, |
| Comment by Cameron Harr [ 03/Oct/16 ] |
|
It's difficult to say what was happening at the time of the crash as I was working on sibling OSSs and rebooting a couple of them several times. I believe the MDS node was rebooted as well. When I looked at the status of the other OSS nodes (which had been up and healthy), I noticed these crashes. |
| Comment by Peter Jones [ 03/Oct/16 ] |
|
Oleg is looking into this |
| Comment by Oleg Drokin [ 04/Oct/16 ] |
|
I believe this is a dup of |
| Comment by Peter Jones [ 18/Dec/16 ] |
|
This is believed to be a duplicate of |