[LU-8917] OSS crash with kernel BUG at drivers/md/raid5.c:529! Created: 07/Dec/16  Updated: 05/Feb/18  Resolved: 06/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Jinshan Xiong (Inactive)
Resolution: Not a Bug Votes: 1
Labels: None
Environment:

Lustre 2.8.60, kernel-3.10.0-327.3.1.el7_lustre.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

During some relatively heavy IOR benchmarks on a system that is not yet in production, we hit a OSS crash in a lustre thread (ll_ost_io01_012). So far, we were not able to reproduce the issue, but wanted to report it just in case somebody has some hints.

We're using Lustre servers based on community Lustre 2.9 (tag 2.8.60) + kernel-3.10.0-327.3.1.el7_lustre.x86_64 (CentOS 7.2) and our clients are IEEL 3.0.0.0 (Lustre 2.7) connected through a single Lustre IB FDR/FDR router.

The impacted server has 3 md raid6 10-disk arrays as OST backend.
 

[495232.536334] kernel BUG at drivers/md/raid5.c:529!
[495232.541676] invalid opcode: 0000 [#1] SMP




the stack is:

crash> bt 103442
PID: 103442  TASK: ffff883e0c564500  CPU: 23  COMMAND: "ll_ost_io01_012"
 #0 [ffff883e099f3268] machine_kexec at ffffffff81051beb
 #1 [ffff883e099f32c8] crash_kexec at ffffffff810f2542
 #2 [ffff883e099f3398] oops_end at ffffffff8163e368
 #3 [ffff883e099f33c0] die at ffffffff8101859b
 #4 [ffff883e099f33f0] do_trap at ffffffff8163da20
 #5 [ffff883e099f3440] do_invalid_op at ffffffff81015204
 #6 [ffff883e099f34f0] invalid_op at ffffffff8164721e
    [exception RIP: get_active_stripe+1838]
    RIP: ffffffffa0f2721e  RSP: ffff883e099f35a8  RFLAGS: 00010086
    RAX: 0000000000000000  RBX: ffff883fa118e820  RCX: dead000000200200
    RDX: 0000000000000000  RSI: 0000000000000006  RDI: ffff883f40875a30
    RBP: ffff883e099f3650   R8: ffff883f40875a40   R9: 0000000000000080
    R10: 0000000000000007  R11: 0000000000000000  R12: ffff883fa118e800
    R13: ffff883f40875a30  R14: ffff883fa118e800  R15: 00000000aff93430
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff883e099f35a0] get_active_stripe at ffffffffa0f26e2a [raid456]
 #8 [ffff883e099f3658] make_request at ffffffffa0f2cbd5 [raid456]
 #9 [ffff883e099f3748] md_make_request at ffffffff814b7af5
#10 [ffff883e099f37a8] generic_make_request at ffffffff812c73e2
#11 [ffff883e099f37e0] submit_bio at ffffffff812c74a1
#12 [ffff883e099f3838] osd_submit_bio at ffffffffa0eb89ac [osd_ldiskfs]
#13 [ffff883e099f3848] osd_do_bio at ffffffffa0ebade7 [osd_ldiskfs]
#14 [ffff883e099f3968] osd_write_commit at ffffffffa0ebb974 [osd_ldiskfs]
#15 [ffff883e099f3a08] ofd_commitrw_write at ffffffffa120d774 [ofd]
#16 [ffff883e099f3a80] ofd_commitrw at ffffffffa1210f2d [ofd]
#17 [ffff883e099f3b08] obd_commitrw at ffffffffa10175a5 [ptlrpc]
#18 [ffff883e099f3b70] tgt_brw_write at ffffffffa0feff81 [ptlrpc]
#19 [ffff883e099f3cd8] tgt_request_handle at ffffffffa0fec235 [ptlrpc]
#20 [ffff883e099f3d20] ptlrpc_server_handle_request at ffffffffa0f981bb [ptlrpc]
#21 [ffff883e099f3de8] ptlrpc_main at ffffffffa0f9c270 [ptlrpc]
#22 [ffff883e099f3ec8] kthread at ffffffff810a5aef
#23 [ffff883e099f3f50] ret_from_fork at ffffffff81645a58





the BUG is triggered by line 529 in raid5.c: BUG_ON(sh->batch_head):

crash> l drivers/md/raid5.c:526
521     static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
522     {
523             struct r5conf *conf = sh->raid_conf;
524             int i, seq;
525     
526             BUG_ON(atomic_read(&sh->count) != 0);
527             BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
528             BUG_ON(stripe_operations_active(sh));
529             BUG_ON(sh->batch_head);




Please note that we do have a crash dump of the OSS available.

Best,

  • Stephane


 Comments   
Comment by Peter Jones [ 09/Dec/16 ]

Jinshan

Could you please advise on this one?

Thanks

Peter

Comment by Stephane Thiell [ 09/May/17 ]

Hi,

I got another occurrence of this issue on a newer el7 kernel (3.10.0-514.10.2.el7_lustre.x86_64). It seems clear to me now that it is a (rare) MD raid bug but would like to continue documenting the problem here as it is triggered when using Lustre.

The issue results in a panic due to a BUG_ON(sh->batch_head) in raid5.c. During last weekend, the issue occurred during a weekly raid-check (for all volumes). This time, the crashing backtrace doesn't contain any Lustre part. The raid volumes (12 total) are pretty new, no mismatch nor hardware errors have been detected. After failover, the filesystem is still running well.

The panic:

[535089.369357] kernel BUG at drivers/md/raid5.c:527!
[535089.374700] invalid opcode: 0000 [#1] SMP 
[535089.379384] Modules linked in: ...
[535089.503509] CPU: 34 PID: 280061 Comm: md0_resync Tainted: G           OE  ------------   3.10.0-514.10.2.el7_lustre.x86_64 #1

I have a crash dump, this is the backtrace:

crash> bt 280061
PID: 280061  TASK: ffff8800757cde20  CPU: 34  COMMAND: "md0_resync"
#0 [ffff88024e217830] machine_kexec at ffffffff81059bdb
#1 [ffff88024e217890] __crash_kexec at ffffffff81105382
#2 [ffff88024e217960] crash_kexec at ffffffff81105470
#3 [ffff88024e217978] oops_end at ffffffff8168f508
#4 [ffff88024e2179a0] die at ffffffff8102e93b
#5 [ffff88024e2179d0] do_trap at ffffffff8168ebc0
#6 [ffff88024e217a20] do_invalid_op at ffffffff8102b144
#7 [ffff88024e217ad0] invalid_op at ffffffff816984de
   [exception RIP: raid5_get_active_stripe+1809]
   RIP: ffffffffa0e4ed71  RSP: ffff88024e217b88  RFLAGS: 00010086
   RAX: 0000000000000000  RBX: ffff883fe5e40408  RCX: dead000000000200
   RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff8823ebf45ca0
   RBP: ffff88024e217c30   R8: ffff8823ebf45cb0   R9: 0000000000000080
   R10: 0000000000000006  R11: 0000000000000000  R12: ffff883fe5e40400
   R13: ffff8823ebf45ca0  R14: 0000000000000000  R15: 0000000000000000
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#8 [ffff88024e217b80] raid5_get_active_stripe at ffffffffa0e4e996 [raid456]
#9 [ffff88024e217be0] raid5_release_stripe at ffffffffa0e48f24 [raid456]
#10 [ffff88024e217c38] raid5_sync_request at ffffffffa0e53c4b [raid456]
#11 [ffff88024e217ca8] md_do_sync at ffffffff814fef9b
#12 [ffff88024e217e50] md_thread at ffffffff814fb1b5
#13 [ffff88024e217ec8] kthread at ffffffff810b06ff
#14 [ffff88024e217f50] ret_from_fork at ffffffff81696b98

It appears to be triggered by BUG_ON(sh->batch_head) in init_stripe():

crash> l drivers/md/raid5.c:524
519     static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
520     {
521             struct r5conf *conf = sh->raid_conf;
522             int i, seq;
523     
524             BUG_ON(atomic_read(&sh->count) != 0);
525             BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
526             BUG_ON(stripe_operations_active(sh));
527             BUG_ON(sh->batch_head);    <<<
528 

Other I/Os were processed at this time, but I am not sure how to check that they were actually on the same md:

crash> ps | grep ">"
<snip>
59684      2  28  ffff88407fdc0fb0  RU   0.0       0      0  [md0_raid6]
61479      2  17  ffff883e46e80000  UN   0.0       0      0  [ll_ost_io01_001]
220748      2  23  ffff881fb7ab8fb0  UN   0.0       0      0  [ll_ost_io01_011]
220750      2  19  ffff881fb7abce70  UN   0.0       0      0  [ll_ost_io01_013]
279158      2  14  ffff883ab46b4e70  RU   0.0       0      0  [md22_resync]
280061      2  34  ffff8800757cde20  RU   0.0       0      0  [md0_resync]
280829      2   6  ffff881c72296dd0  RU   0.0       0      0  [md6_resync]

example of possible concurrent writing thread:

crash> bt 61479
PID: 61479  TASK: ffff883e46e80000  CPU: 17  COMMAND: "ll_ost_io01_001"
#0 [ffff883ffc805e58] crash_nmi_callback at ffffffff8104d2e2
#1 [ffff883ffc805e68] nmi_handle at ffffffff8168f699
#2 [ffff883ffc805eb0] do_nmi at ffffffff8168f813
#3 [ffff883ffc805ef0] end_repeat_nmi at ffffffff8168ead3
   [exception RIP: _raw_spin_lock_irq+63]
   RIP: ffffffff8168e09f  RSP: ffff883e46e3f588  RFLAGS: 00000002
   RAX: 00000000000044c2  RBX: ffff883fe5e40408  RCX: 000000000000c464
   RDX: 000000000000c468  RSI: 000000000000c468  RDI: ffff883fe5e40408
   RBP: ffff883e46e3f588   R8: 0000000000000000   R9: 0000000000000080
   R10: 0000000000000002  R11: 0000000000000000  R12: ffff883fe5e40400
   R13: 0000000000000000  R14: ffff883fe0e61900  R15: 0000000000000000
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#4 [ffff883e46e3f588] _raw_spin_lock_irq at ffffffff8168e09f
#5 [ffff883e46e3f590] raid5_get_active_stripe at ffffffffa0e4e6cb [raid456]
#6 [ffff883e46e3f648] raid5_make_request at ffffffffa0e4ef55 [raid456]
#7 [ffff883e46e3f738] md_make_request at ffffffff814f7dfc
#8 [ffff883e46e3f798] generic_make_request at ffffffff812ee939
#9 [ffff883e46e3f7e0] submit_bio at ffffffff812eea81
#10 [ffff883e46e3f838] osd_submit_bio at ffffffffa10a0bcc [osd_ldiskfs]
#11 [ffff883e46e3f848] osd_do_bio at ffffffffa10a3007 [osd_ldiskfs]
#12 [ffff883e46e3f968] osd_write_commit at ffffffffa10a3b94 [osd_ldiskfs]
#13 [ffff883e46e3fa08] ofd_commitrw_write at ffffffffa1113774 [ofd]
#14 [ffff883e46e3fa80] ofd_commitrw at ffffffffa1116f2d [ofd]
#15 [ffff883e46e3fb08] obd_commitrw at ffffffffa0c43c22 [ptlrpc]
#16 [ffff883e46e3fb70] tgt_brw_write at ffffffffa0c1bfc1 [ptlrpc]
#17 [ffff883e46e3fcd8] tgt_request_handle at ffffffffa0c18275 [ptlrpc]
#18 [ffff883e46e3fd20] ptlrpc_server_handle_request at ffffffffa0bc41fb [ptlrpc]
#19 [ffff883e46e3fde8] ptlrpc_main at ffffffffa0bc82b0 [ptlrpc]
#20 [ffff883e46e3fec8] kthread at ffffffff810b06ff
#21 [ffff883e46e3ff50] ret_from_fork at ffffffff81696b98

I found a similar occurrence of this bug on this page https://lkml.org/lkml/2016/12/23/205 (not with Lustre though).

I have also described the problem on the linux-raid mailing list but didn't get any reply so far.

Stephane

Comment by Stanford Research Computing Center [ 15/Jun/17 ]

This issue hit us one more time on the same OSS. Could you assist in debugging this a little?

[143161.778606] ------------[ cut here ]------------
[143161.783857] kernel BUG at drivers/md/raid5.c:527!
[143161.789199] invalid opcode: 0000 [#1] SMP 
[143161.793883] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) uas iTCO_wdt iTCO_vendor_support vfat fat dcdbas mxm_wmi intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 nfsv4 dell_rbu dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core lpc_ich mei_me mei dm_service_time ses enclosure
[143161.873461]  ipmi_devintf sg ipmi_si ipmi_msghandler shpchp acpi_power_meter wmi dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci libahci drm mpt3sas tg3 raid_class mlx4_core crct10dif_pclmul crct10dif_common ptp crc32c_intel libata megaraid_sas i2c_core devlink scsi_transport_sas pps_core fjes
[143161.918022] CPU: 14 PID: 253667 Comm: ll_ost_io00_017 Tainted: G           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
[143161.931213] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.3.4 11/08/2016
[143161.939660] task: ffff883fac2b0fb0 ti: ffff883912b30000 task.ti: ffff883912b30000
[143161.948107] RIP: 0010:[<ffffffffa0fffd71>]  [<ffffffffa0fffd71>] raid5_get_active_stripe+0x711/0x750 [raid456]
[143161.959387] RSP: 0018:ffff883912b33598  EFLAGS: 00010082
[143161.965409] RAX: 0000000000000000 RBX: ffff883fadb7800c RCX: dead000000000200
[143161.973467] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff881e5a980f70
[143161.981525] RBP: ffff883912b33640 R08: ffff881e5a980f80 R09: 0000000000000080
[143161.989584] R10: 0000000000000000 R11: 0000000000000000 R12: ffff883fadb78000
[143161.997642] R13: ffff881e5a980f70 R14: 0000000000000000 R15: 0000000000000000
[143162.005702] FS:  0000000000000000(0000) GS:ffff881ffcfc0000(0000) knlGS:0000000000000000
[143162.014828] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[143162.021335] CR2: 00000000027b8f78 CR3: 00000000019be000 CR4: 00000000001407e0
[143162.029394] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[143162.037453] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[143162.045511] Stack:
[143162.047848]  ffff883fadb78000 0000000000000000 ffff881ff9b9d900 0000000059298000
[143162.056234]  ffff883fadb78004 00000000aec652c8 ffff883fadb781d8 ffff883900000001
[143162.064619]  ffff883fadb78000 0000000000000000 ffff881e59298000 ffff883912b33640
[143162.073004] Call Trace:
[143162.075834]  [<ffffffffa0ffff55>] raid5_make_request+0x1a5/0xd70 [raid456]
[143162.083620]  [<ffffffffa0fafbbc>] ? ldiskfs_ext_insert_extent+0x47c/0x12e0 [ldiskfs]
[143162.092363]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[143162.098971]  [<ffffffff814f838c>] md_make_request+0xdc/0x230
[143162.105385]  [<ffffffff812eedc9>] generic_make_request+0x109/0x1e0
[143162.112378]  [<ffffffff812eef11>] submit_bio+0x71/0x150
[143162.118337]  [<ffffffffa09b36b2>] ? lprocfs_oh_tally+0x32/0x40 [obdclass]
[143162.126029]  [<ffffffffa10a4bcc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
[143162.133613]  [<ffffffffa10a73c1>] osd_do_bio.isra.23+0x791/0x830 [osd_ldiskfs]
[143162.141782]  [<ffffffffa0f6011c>] ? __ldiskfs_journal_stop+0x3c/0xb0 [ldiskfs]
[143162.149951]  [<ffffffffa10a7b94>] osd_write_commit+0x344/0x940 [osd_ldiskfs]
[143162.157915]  [<ffffffffa116b774>] ofd_commitrw_write+0x1014/0x1c20 [ofd]
[143162.165496]  [<ffffffffa116ef2d>] ofd_commitrw+0x51d/0xa40 [ofd]
[143162.172345]  [<ffffffffa0c8dc22>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[143162.179479]  [<ffffffffa0c65fc1>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[143162.186766]  [<ffffffff810ce474>] ? update_curr+0x104/0x190
[143162.193081]  [<ffffffff810c9ba8>] ? __enqueue_entity+0x78/0x80
[143162.199712]  [<ffffffffa0bbc560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[143162.207903]  [<ffffffffa0c62275>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[143162.215705]  [<ffffffffa0c0e1fb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[143162.224355]  [<ffffffffa08c4158>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[143162.232057]  [<ffffffffa0c0bdb8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[143162.239730]  [<ffffffff810c54d2>] ? default_wake_function+0x12/0x20
[143162.246821]  [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[143162.253456]  [<ffffffffa0c122b0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[143162.260575]  [<ffffffffa0c11810>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[143162.268926]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[143162.274465]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[143162.281846]  [<ffffffff81697218>] ret_from_fork+0x58/0x90
[143162.287967]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[143162.295346] Code: c4 04 75 a6 0f 0b e8 ff 5c 08 e0 0f 0b 48 8b 55 80 48 c7 c6 c8 a2 00 a1 48 c7 c7 90 c0 00 a1 31 c0 e8 34 cc 33 e0 e9 45 fc ff ff <0f> 0b 0f 0b 48 c7 c6 35 95 00 a1 48 c7 c7 68 c0 00 a1 31 c0 e8 
[143162.317044] RIP  [<ffffffffa0fffd71>] raid5_get_active_stripe+0x711/0x750 [raid456]
[143162.325696]  RSP <ffff883912b33598>

Thanks,
Stephane
 

Comment by Stephane Thiell [ 06/Sep/17 ]

We fixed this issue with the linux-raid kernel folks. Please see http://www.spinics.net/lists/raid/msg59072.html for more details about the fix.

Feel free to close this LU.

Best,

Stephane

Comment by Peter Jones [ 06/Sep/17 ]

ok thanks Stephane. Will this fix be included in an upcoming RHEL 7.x update?

Comment by Stephane Thiell [ 06/Sep/17 ]

Hi Peter,

I don't know yet, but it clearly should. Usually Red Hat and SUSE developers are tracking linux-raid changes so we'll see. We're using CentOS 7 so I can't really open a support case for that. If you have a way to encourage the Red Hat folks to backport this patch, that would be great, and it is super easy to apply as raid5.c didn't change much. The race is very rare unless using Lustre or other setups adding some kind of latency (using virtualization for instance).

Another link to the patch is https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=3664847d95e60a9a943858b7800f8484669740fc

Thanks!

Stephane

Comment by Peter Jones [ 06/Sep/17 ]

Stephane

We will pass along the details to Suse/Red Hat with the hope that this fix can be included in a future release

Peter

Comment by Stephane Thiell [ 06/Sep/17 ]

Thanks Peter, much appreciated!

Comment by Stephane Thiell [ 10/Oct/17 ]

Landed in RHEL 7.5 and should be back ported in an upcoming RHEL 7.4 kernel update, see https://bugzilla.redhat.com/show_bug.cgi?id=1496836

 

It has also landed in SLES12 SP2, even more quickly, see http://kernel.suse.com/cgit/kernel-source/commit/?h=SLE12-SP2&id=5bedfa64185da4ab9cd7b1166a0097546c928234

Comment by Stephane Thiell [ 05/Feb/18 ]

RHEL 7.4: fixed in kernel-3.10.0-693.18.1.el7

Generated at Sat Feb 10 02:21:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.