Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.9.0
-
None
-
Lustre 2.8.60, kernel-3.10.0-327.3.1.el7_lustre.x86_64
-
3
-
9223372036854775807
Description
Hi,
During some relatively heavy IOR benchmarks on a system that is not yet in production, we hit a OSS crash in a lustre thread (ll_ost_io01_012). So far, we were not able to reproduce the issue, but wanted to report it just in case somebody has some hints.
We're using Lustre servers based on community Lustre 2.9 (tag 2.8.60) + kernel-3.10.0-327.3.1.el7_lustre.x86_64 (CentOS 7.2) and our clients are IEEL 3.0.0.0 (Lustre 2.7) connected through a single Lustre IB FDR/FDR router.
The impacted server has 3 md raid6 10-disk arrays as OST backend.
[495232.536334] kernel BUG at drivers/md/raid5.c:529! [495232.541676] invalid opcode: 0000 [#1] SMP
the stack is:
crash> bt 103442
PID: 103442 TASK: ffff883e0c564500 CPU: 23 COMMAND: "ll_ost_io01_012"
#0 [ffff883e099f3268] machine_kexec at ffffffff81051beb
#1 [ffff883e099f32c8] crash_kexec at ffffffff810f2542
#2 [ffff883e099f3398] oops_end at ffffffff8163e368
#3 [ffff883e099f33c0] die at ffffffff8101859b
#4 [ffff883e099f33f0] do_trap at ffffffff8163da20
#5 [ffff883e099f3440] do_invalid_op at ffffffff81015204
#6 [ffff883e099f34f0] invalid_op at ffffffff8164721e
[exception RIP: get_active_stripe+1838]
RIP: ffffffffa0f2721e RSP: ffff883e099f35a8 RFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff883fa118e820 RCX: dead000000200200
RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff883f40875a30
RBP: ffff883e099f3650 R8: ffff883f40875a40 R9: 0000000000000080
R10: 0000000000000007 R11: 0000000000000000 R12: ffff883fa118e800
R13: ffff883f40875a30 R14: ffff883fa118e800 R15: 00000000aff93430
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff883e099f35a0] get_active_stripe at ffffffffa0f26e2a [raid456]
#8 [ffff883e099f3658] make_request at ffffffffa0f2cbd5 [raid456]
#9 [ffff883e099f3748] md_make_request at ffffffff814b7af5
#10 [ffff883e099f37a8] generic_make_request at ffffffff812c73e2
#11 [ffff883e099f37e0] submit_bio at ffffffff812c74a1
#12 [ffff883e099f3838] osd_submit_bio at ffffffffa0eb89ac [osd_ldiskfs]
#13 [ffff883e099f3848] osd_do_bio at ffffffffa0ebade7 [osd_ldiskfs]
#14 [ffff883e099f3968] osd_write_commit at ffffffffa0ebb974 [osd_ldiskfs]
#15 [ffff883e099f3a08] ofd_commitrw_write at ffffffffa120d774 [ofd]
#16 [ffff883e099f3a80] ofd_commitrw at ffffffffa1210f2d [ofd]
#17 [ffff883e099f3b08] obd_commitrw at ffffffffa10175a5 [ptlrpc]
#18 [ffff883e099f3b70] tgt_brw_write at ffffffffa0feff81 [ptlrpc]
#19 [ffff883e099f3cd8] tgt_request_handle at ffffffffa0fec235 [ptlrpc]
#20 [ffff883e099f3d20] ptlrpc_server_handle_request at ffffffffa0f981bb [ptlrpc]
#21 [ffff883e099f3de8] ptlrpc_main at ffffffffa0f9c270 [ptlrpc]
#22 [ffff883e099f3ec8] kthread at ffffffff810a5aef
#23 [ffff883e099f3f50] ret_from_fork at ffffffff81645a58
the BUG is triggered by line 529 in raid5.c: BUG_ON(sh->batch_head):
crash> l drivers/md/raid5.c:526 521 static void init_stripe(struct stripe_head *sh, sector_t sector, int previous) 522 { 523 struct r5conf *conf = sh->raid_conf; 524 int i, seq; 525 526 BUG_ON(atomic_read(&sh->count) != 0); 527 BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); 528 BUG_ON(stripe_operations_active(sh)); 529 BUG_ON(sh->batch_head);
Please note that we do have a crash dump of the OSS available.
Best,
- Stephane