-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
It looks like something broke in our pdirops support on the ldiskfs side. There are several customer reports and I am seeing this in my testing as well.
The crash looks like this:
[11832.408445] ------------[ cut here ]------------ [11832.452194] kernel BUG at /home/green/git/lustre-release/ldiskfs/htree_lock.c:665! [11832.452194] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [11832.452194] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common pcspkr virtio_console virtio_balloon i2c_piix4 ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm drm_panel_orientation_quirks ata_piix i2c_core serio_raw virtio_blk libata floppy [last unloaded: libcfs] [11832.452194] CPU: 5 PID: 3350 Comm: mdt02_004 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1 [11832.452194] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [11832.452194] task: ffff8803237c87c0 ti: ffff88007b130000 task.ti: ffff88007b130000 [11832.452194] RIP: 0010:[<ffffffffa0bae653>] [<ffffffffa0bae653>] htree_unlock_internal.isra.8+0x133/0x140 [ldiskfs] [11832.452194] RSP: 0000:ffff88007b1339c0 EFLAGS: 00010246 [11832.452194] RAX: ffff8802f2c68744 RBX: ffff88008e984e00 RCX: 0000000000000000 [11832.452194] RDX: 0000000000000000 RSI: ffff88008e984e1c RDI: ffff8802f2c68740 [11832.452194] RBP: ffff88007b1339f8 R08: 0000000000000058 R09: ffff88029b9686e0 [11832.452194] R10: 0000000000000000 R11: ffff88029b9686d8 R12: 0000000000000000 [11832.502820] R13: ffff8802f2c68740 R14: ffff88008e984e00 R15: ffff880073e54000 [11832.502820] FS: 0000000000000000(0000) GS:ffff88033db40000(0000) knlGS:0000000000000000 [11832.502820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11832.502820] CR2: 00007f4ccab8bb78 CR3: 00000000ae87a000 CR4: 00000000000006e0 [11832.502820] Call Trace: [11832.502820] [<ffffffffa0bafa22>] htree_unlock+0x52/0xc0 [ldiskfs] [11832.502820] [<ffffffffa0ac93d9>] osd_index_ea_lookup+0x3b9/0xec0 [osd_ldiskfs] [11832.502820] [<ffffffffa03e07de>] ? lu_ucred+0x1e/0x30 [obdclass] [11832.502820] [<ffffffffa0d95ad5>] lod_lookup+0x25/0x30 [lod] [11832.502820] [<ffffffffa0c51d72>] __mdd_lookup.isra.18+0x2b2/0x460 [mdd] [11832.502820] [<ffffffffa0c51fcf>] mdd_lookup+0xaf/0x170 [mdd] [11832.502820] [<ffffffffa0cd1cdf>] mdt_lookup_version_check+0x6f/0x2c0 [mdt] [11832.502820] [<ffffffffa0cd68f7>] mdt_reint_unlink+0x217/0x14d0 [mdt] [11832.502820] [<ffffffffa0cdd8d0>] mdt_reint_rec+0x80/0x210 [mdt] [11832.502820] [<ffffffffa0cba723>] mdt_reint_internal+0x6e3/0xab0 [mdt] [11832.502820] [<ffffffffa0cc2924>] ? mdt_thread_info_init+0xa4/0x1e0 [mdt] [11832.502820] [<ffffffffa0cc58e7>] mdt_reint+0x67/0x140 [mdt] [11832.502820] [<ffffffffa0663e65>] tgt_request_handle+0xaf5/0x1590 [ptlrpc] [11832.502820] [<ffffffffa0236fa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [11832.502820] [<ffffffffa0607f06>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc] [11832.502820] [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90 [11832.544484] [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90 [11832.545188] [<ffffffffa060bdf9>] ptlrpc_main+0xa99/0x1f60 [ptlrpc] [11832.545188] [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0 [11832.545188] [<ffffffff817b6cd0>] ? __schedule+0x410/0xa00 [11832.545188] [<ffffffffa060b360>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [11832.545188] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [11832.545188] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [11832.545188] [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21 [11832.545188] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [11832.545188] Code: 67 20 e8 f1 c3 51 e0 4c 3b 65 c8 49 8b 47 20 8b 4d d4 48 8d 70 e0 75 80 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8d [11832.545188] RIP [<ffffffffa0bae653>] htree_unlock_internal.isra.8+0x133/0x140 [ldiskfs]
The crash location is:
htree_unlock_internal(struct htree_lock *lck)
{
struct htree_lock_head *lhead = lck->lk_head;
struct htree_lock *tmp;
struct htree_lock *tmp2;
int granted = 0;
int i;
BUG_ON(lhead->lh_ngranted[lck->lk_mode] == 0); <=========== HERE
so it's pretty clear we are trying to unlock a lock that has not been locked in this particular mode we are trying to unlock.
I did a debug-fixing patch that does not seem to help here: https://review.whamcloud.com/#/c/33068/
the suspicion was we were waking the waiter wrongly and it always assumed the wake up means the lock is granted.
I have some crashdumps but there's nothing clear there once the situation happens. The problem is first seen in my testing on Aug 16th and comes in waves for some reason.
- is related to
-
LU-13054 MDS kernel BUG at ldiskfs/htree_lock.c:429!
-
- Resolved
-