Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
None
Environment:
Lustre 2.8.60, kernel-3.10.0-327.3.1.el7_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Hi,

During some relatively heavy IOR benchmarks on a system that is not yet in production, we hit a OSS crash in a lustre thread (ll_ost_io01_012). So far, we were not able to reproduce the issue, but wanted to report it just in case somebody has some hints.

We're using Lustre servers based on community Lustre 2.9 (tag 2.8.60) + kernel-3.10.0-327.3.1.el7_lustre.x86_64 (CentOS 7.2) and our clients are IEEL 3.0.0.0 (Lustre 2.7) connected through a single Lustre IB FDR/FDR router.

The impacted server has 3 md raid6 10-disk arrays as OST backend.

[495232.536334] kernel BUG at drivers/md/raid5.c:529!
[495232.541676] invalid opcode: 0000 [#1] SMP

the stack is:

crash> bt 103442
PID: 103442  TASK: ffff883e0c564500  CPU: 23  COMMAND: "ll_ost_io01_012"
 #0 [ffff883e099f3268] machine_kexec at ffffffff81051beb
 #1 [ffff883e099f32c8] crash_kexec at ffffffff810f2542
 #2 [ffff883e099f3398] oops_end at ffffffff8163e368
 #3 [ffff883e099f33c0] die at ffffffff8101859b
 #4 [ffff883e099f33f0] do_trap at ffffffff8163da20
 #5 [ffff883e099f3440] do_invalid_op at ffffffff81015204
 #6 [ffff883e099f34f0] invalid_op at ffffffff8164721e
    [exception RIP: get_active_stripe+1838]
    RIP: ffffffffa0f2721e  RSP: ffff883e099f35a8  RFLAGS: 00010086
    RAX: 0000000000000000  RBX: ffff883fa118e820  RCX: dead000000200200
    RDX: 0000000000000000  RSI: 0000000000000006  RDI: ffff883f40875a30
    RBP: ffff883e099f3650   R8: ffff883f40875a40   R9: 0000000000000080
    R10: 0000000000000007  R11: 0000000000000000  R12: ffff883fa118e800
    R13: ffff883f40875a30  R14: ffff883fa118e800  R15: 00000000aff93430
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff883e099f35a0] get_active_stripe at ffffffffa0f26e2a [raid456]
 #8 [ffff883e099f3658] make_request at ffffffffa0f2cbd5 [raid456]
 #9 [ffff883e099f3748] md_make_request at ffffffff814b7af5
#10 [ffff883e099f37a8] generic_make_request at ffffffff812c73e2
#11 [ffff883e099f37e0] submit_bio at ffffffff812c74a1
#12 [ffff883e099f3838] osd_submit_bio at ffffffffa0eb89ac [osd_ldiskfs]
#13 [ffff883e099f3848] osd_do_bio at ffffffffa0ebade7 [osd_ldiskfs]
#14 [ffff883e099f3968] osd_write_commit at ffffffffa0ebb974 [osd_ldiskfs]
#15 [ffff883e099f3a08] ofd_commitrw_write at ffffffffa120d774 [ofd]
#16 [ffff883e099f3a80] ofd_commitrw at ffffffffa1210f2d [ofd]
#17 [ffff883e099f3b08] obd_commitrw at ffffffffa10175a5 [ptlrpc]
#18 [ffff883e099f3b70] tgt_brw_write at ffffffffa0feff81 [ptlrpc]
#19 [ffff883e099f3cd8] tgt_request_handle at ffffffffa0fec235 [ptlrpc]
#20 [ffff883e099f3d20] ptlrpc_server_handle_request at ffffffffa0f981bb [ptlrpc]
#21 [ffff883e099f3de8] ptlrpc_main at ffffffffa0f9c270 [ptlrpc]
#22 [ffff883e099f3ec8] kthread at ffffffff810a5aef
#23 [ffff883e099f3f50] ret_from_fork at ffffffff81645a58

the BUG is triggered by line 529 in raid5.c: BUG_ON(sh->batch_head):

crash> l drivers/md/raid5.c:526
521     static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
522     {
523             struct r5conf *conf = sh->raid_conf;
524             int i, seq;
525     
526             BUG_ON(atomic_read(&sh->count) != 0);
527             BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
528             BUG_ON(stripe_operations_active(sh));
529             BUG_ON(sh->batch_head);

Please note that we do have a crash dump of the OSS available.

Best,

Stephane

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 07/Dec/16 7:58 PM

Updated:: 05/Feb/18 10:29 PM

Resolved:: 06/Sep/17 12:15 AM

Details

Description

Attachments

Activity

People

Dates