Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8917

OSS crash with kernel BUG at drivers/md/raid5.c:529!

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • Lustre 2.8.60, kernel-3.10.0-327.3.1.el7_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      Hi,

      During some relatively heavy IOR benchmarks on a system that is not yet in production, we hit a OSS crash in a lustre thread (ll_ost_io01_012). So far, we were not able to reproduce the issue, but wanted to report it just in case somebody has some hints.

      We're using Lustre servers based on community Lustre 2.9 (tag 2.8.60) + kernel-3.10.0-327.3.1.el7_lustre.x86_64 (CentOS 7.2) and our clients are IEEL 3.0.0.0 (Lustre 2.7) connected through a single Lustre IB FDR/FDR router.

      The impacted server has 3 md raid6 10-disk arrays as OST backend.
       

      [495232.536334] kernel BUG at drivers/md/raid5.c:529!
      [495232.541676] invalid opcode: 0000 [#1] SMP
      
      
      
      
      

      the stack is:

      crash> bt 103442
      PID: 103442  TASK: ffff883e0c564500  CPU: 23  COMMAND: "ll_ost_io01_012"
       #0 [ffff883e099f3268] machine_kexec at ffffffff81051beb
       #1 [ffff883e099f32c8] crash_kexec at ffffffff810f2542
       #2 [ffff883e099f3398] oops_end at ffffffff8163e368
       #3 [ffff883e099f33c0] die at ffffffff8101859b
       #4 [ffff883e099f33f0] do_trap at ffffffff8163da20
       #5 [ffff883e099f3440] do_invalid_op at ffffffff81015204
       #6 [ffff883e099f34f0] invalid_op at ffffffff8164721e
          [exception RIP: get_active_stripe+1838]
          RIP: ffffffffa0f2721e  RSP: ffff883e099f35a8  RFLAGS: 00010086
          RAX: 0000000000000000  RBX: ffff883fa118e820  RCX: dead000000200200
          RDX: 0000000000000000  RSI: 0000000000000006  RDI: ffff883f40875a30
          RBP: ffff883e099f3650   R8: ffff883f40875a40   R9: 0000000000000080
          R10: 0000000000000007  R11: 0000000000000000  R12: ffff883fa118e800
          R13: ffff883f40875a30  R14: ffff883fa118e800  R15: 00000000aff93430
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff883e099f35a0] get_active_stripe at ffffffffa0f26e2a [raid456]
       #8 [ffff883e099f3658] make_request at ffffffffa0f2cbd5 [raid456]
       #9 [ffff883e099f3748] md_make_request at ffffffff814b7af5
      #10 [ffff883e099f37a8] generic_make_request at ffffffff812c73e2
      #11 [ffff883e099f37e0] submit_bio at ffffffff812c74a1
      #12 [ffff883e099f3838] osd_submit_bio at ffffffffa0eb89ac [osd_ldiskfs]
      #13 [ffff883e099f3848] osd_do_bio at ffffffffa0ebade7 [osd_ldiskfs]
      #14 [ffff883e099f3968] osd_write_commit at ffffffffa0ebb974 [osd_ldiskfs]
      #15 [ffff883e099f3a08] ofd_commitrw_write at ffffffffa120d774 [ofd]
      #16 [ffff883e099f3a80] ofd_commitrw at ffffffffa1210f2d [ofd]
      #17 [ffff883e099f3b08] obd_commitrw at ffffffffa10175a5 [ptlrpc]
      #18 [ffff883e099f3b70] tgt_brw_write at ffffffffa0feff81 [ptlrpc]
      #19 [ffff883e099f3cd8] tgt_request_handle at ffffffffa0fec235 [ptlrpc]
      #20 [ffff883e099f3d20] ptlrpc_server_handle_request at ffffffffa0f981bb [ptlrpc]
      #21 [ffff883e099f3de8] ptlrpc_main at ffffffffa0f9c270 [ptlrpc]
      #22 [ffff883e099f3ec8] kthread at ffffffff810a5aef
      #23 [ffff883e099f3f50] ret_from_fork at ffffffff81645a58
      
      
      
      
      
      

      the BUG is triggered by line 529 in raid5.c: BUG_ON(sh->batch_head):

      crash> l drivers/md/raid5.c:526
      521     static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
      522     {
      523             struct r5conf *conf = sh->raid_conf;
      524             int i, seq;
      525     
      526             BUG_ON(atomic_read(&sh->count) != 0);
      527             BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
      528             BUG_ON(stripe_operations_active(sh));
      529             BUG_ON(sh->batch_head);
      
      
      
      
      

      Please note that we do have a crash dump of the OSS available.

      Best,

      • Stephane

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            sthiell Stephane Thiell
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: