Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.3
    • None
    • 2
    • 9223372036854775807

    Description

      On Tuesday, July 26th, the RAID controller firmware was updated, causing two OSSes to drop into KDB due to missing paths. Since then, we have had a spate of crashes on the file system:

      Tue Jul 26 - 394 - 2028 KDB (I/O)

      • 395 - 2000 KDB (I/O)
        Fri jUl 29 - s393 - 0910 KDB (aborting journal)
        Fri Jul 29 - s394 - 1922 KDB (disk)
        Fri Jul 29 - s393 - 2123 KDB (disk)

      s394 Tue Jul 26 - 2028 :

      Buffer I/O error on device dm-21, logical block 137126919
      lost page write due to I/O error on dm-21
      end_request: I/O error, dev dm-21, sector 1073743416
      LDISKFS-fs error (device dm-21): ldiskfs_read_block_bitmap: Cannot read block bitmap - block_group = 4295, block_bitmap = 134217927
      Entering kdb (current=0xffff880360d92ab0, pid 63652) on processor 0 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [0]kdb> [-- MARK -- Tue Jul 26 18:00:00 2016]
      

      s395 Tue Jul 26 - 2000 :

      end_request: I/O error, dev dm-18, sector 759488504
      end_request: I/O error, dev dm-18, sector 759488504
      end_request: I/O error, dev dm-18, sector 295072
      LustreError: 63719:0:(osd_io.c:1179:osd_ldiskfs_write_record()) dm-18: error reading offset 1342464 (block 327): rc = -5
      LustreError: 63719:0:(osd_handler.c:1051:osd_trans_stop()) Failure in transaction hook: -5
      ------------[ cut here ]------------
      WARNING: at fs/buffer.c:1186 mark_buffer_dirty+0x82/0xa0() (Tainted: G        W  ---------------   )
      Hardware name: C1104G-RP5
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc dm_round_robin bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_srp(U) scsi_transport_srp(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_multipath microcode acpi_pad iTCO_wdt iTCO_vendor_support igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core sg lpc_ich mfd_core tcp_bic ext3 jbd sd_mod crc_t10dif ahci isci libsas scsi_transport_sas wmi mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]
      Pid: 36266, comm: jbd2/dm-6 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
      Call Trace:
       [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20
       [<ffffffff811c1ca2>] ? mark_buffer_dirty+0x82/0xa0
       [<ffffffffa0b6b385>] ? __jbd2_journal_temp_unlink_buffer+0xa5/0x140 [jbd2]
       [<ffffffffa0b6b436>] ? __jbd2_journal_unfile_buffer+0x16/0x30 [jbd2]
       [<ffffffffa0b6b798>] ? __jbd2_journal_refile_buffer+0xc8/0xf0 [jbd2]
       [<ffffffffa0b6e548>] ? jbd2_journal_commit_transaction+0xdc8/0x15a0 [jbd2]
       [<ffffffff810873eb>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffffa0b73c48>] ? kjournald2+0xb8/0x220 [jbd2]
       [<ffffffff8109e120>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0b73b90>] ? kjournald2+0x0/0x220 [jbd2]
       [<ffffffff8109dc8e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20
       [<ffffffff8109dbf0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      ---[ end trace 8870bed99a68f953 ]---
      srp_daemon[3088]: No response to inform info registration
      srp_daemon[3088]: Fail to register to traps, maybe there is no opensm running on fabric or IB port is down
      srp_daemon[3088]: bad MAD status (110) from lid 1
      scsi host7: ib_srp: Got failed path rec status -110
      scsi host7: ib_srp: Path record query failed
      scsi host7: reconnect attempt 4 failed (-110)
      ------------[ cut here ]------------
      kernel BUG at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/mballoc.c:3189!
      
      Entering kdb (current=0xffff881ae3470040, pid 63719) on processor 9 due to KDB_ENTER()
      [9]kdb> [-- MARK -- Tue Jul 26 18:00:00 2016]
      

      s393 Fri Jul 29 - 0910:

      LDISKFS-fs error (device dm-18): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff8801ce41bf28] [phy 1437833216] [logic 334848] [len 1024] [free 542] [error 0] [inode 207111] [freed 1024]
      Aborting journal on device dm-4.
      Kernel panic - not syncing: LDISKFS-fs (device dm-18): panic forced after error
      
      Pid: 165, comm: kswapd1 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
      Call Trace:
       [<ffffffff81564fb9>] ? panic+0xa7/0x190
       [<ffffffffa0bc5118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskfs]
       [<ffffffffa0bc5724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs]
       [<ffffffffa0bc5ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs]
       [<ffffffffa0ba920f>] ? ldiskfs_mb_release_inode_pa+0x26f/0x360 [ldiskfs]
      
      Entering kdb (current=0xffff881065789520, pid 165) on processor 8 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [8]kdb> [-- Fri Jul 29 08:55:12 2016]
      

      s394 Fri Jul 29 - 1922:

      Buffer I/O error on device dm-21, logical block 137126919
      lost page write due to I/O error on dm-21
      end_request: I/O error, dev dm-21, sector 1073743416
      LDISKFS-fs error (device dm-21): ldiskfs_read_block_bitmap: Cannot read block bitmap - block_group = 4295, block_bitmap = 134217927
      Entering kdb (current=0xffff880360d92ab0, pid 63652) on processor 0 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [0]kdb> [-- MARK -- Tue Jul 26 18:00:00 2016]
      

      s393 Fri Jul 29 -2123:

      [-- MARK -- Fri Jul 29 18:00:00 2016]
      LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd
      
      Aborting journal on device dm-2.
      LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb:
      Kernel panic - not syncing: LDISKFS-fs (device dm-27): panic forced after error
      
      Pid: 95204, comm: ll_ost_io01_021 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
      Call Trace:
       [<ffffffff81564fb9>] ? panic+0xa7/0x190
       [<ffffffffa0eb0118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskf s ]14
       out of 16 cpus in kdb, waiting for the rest, timeout in 10 second(s)
       [<ffffffffa0eb0724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs]
       [<ffffffffa0eb0ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs]
       [<ffffffff810f2059>] ? delayacct_end+0x89/0xa0
       [<ffffffffa0e92a29>] ? ldiskfs_mb_check_ondisk_bitmap+0x149/0x150 [ldiskfs]
       [<ffffffffa0e92aad>] ? ldiskfs_mb_generate_from_pa+0x7d/0x180 [ldiskfs]
       [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50
       [<ffffffffa0e9528b>] ? ldiskfs_mb_init_cache+0x55b/0xa30 [ldiskfs]
       [<ffffffffa0e9587e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs]
       [<ffffffffa0e95dd5>] ? ldiskfs_mb_load_buddy+0x355/0x390 [ldiskfs]
       [<ffffffffa0e7c00b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
       [<ffffffffa0e96b7d>] ? ldiskfs_mb_find_by_goal+0x6d/0x2e0 [ldiskfs]
       [<ffffffffa0e97019>] ? ldiskfs_mb_regular_allocator+0x59/0x410 [ldiskfs]
       [<ffffffffa0eb12e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
       [<ffffffffa0e91872>] ? ldiskfs_mb_normalize_request+0x2c2/0x3d0 [ldiskfs]
       [<ffffffffa0e9923d>] ? ldiskfs_mb_new_blocks+0x47d/0x630 [ldiskfs]
       [<ffffffffa0eb1390>] ? ldiskfs_journal_start_sb+0x70/0x170 [ldiskfs]
       [<ffffffffa02369ff>] ? ldiskfs_ext_new_extent_cb+0x57f/0x6cc [fsfilt_ldiskfs]
       [<ffffffffa0e7ecd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
       [<ffffffffa0236480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs]
       [<ffffffffa02361cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
       [<ffffffffa02362f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs]
       [<ffffffffa0236475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
       [<ffffffffa0eb1884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs]
       [<ffffffffa0f89342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs]
       [<ffffffffa137c094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd]
       [<ffffffffa137edfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd]
       [<ffffffffa0372861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs]
       [<ffffffffa131319d>] ? obd_commitrw+0x11d/0x390 [ost]
       [<ffffffffa131cf71>] ? ost_brw_write+0xea1/0x15d0 [ost]
       [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0
       [<ffffffffa0686500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa13235cc>] ? ost_handle+0x439c/0x44d0 [ost]
       [<ffffffffa03f8a44>] ? libcfs_id2str+0x74/0xb0 [libcfs]
       [<ffffffffa06d60c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa03fe8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
       [<ffffffffa06cea69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffffa06d889d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20
       [<ffffffffa06d7da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb: Detected aborted journal [<ffffffff8100c280>] ? child_rip+0x0/0x20
      
      .  15 out of 16 cpus in kdb, waiting for the rest, timeout in 9 second(s)
      ..1 cpu is not in kdb, its state is unknown
      
      Entering kdb (current=0xffff880e989cc040, pid 95204) on processor 7 Oops: (null)
      due to oops @ 0x0
      kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
      [7]kdb> [-- MARK -- Fri Jul 29 19:00:00 2016]
      

      Attachments

        Issue Links

          Activity

            [LU-8462] OSS keeps dropping into KDB

            http://review.whamcloud.com/#/c/21705/ has been landed to b2_7_fe.

            yong.fan nasf (Inactive) added a comment - http://review.whamcloud.com/#/c/21705/ has been landed to b2_7_fe.

            We have not tried the patch at NASA yet, were waiting on completed review and testing and landing. Since we fixed the on-disk corruption with e2fsck, I don't believe we have a test case for it anymore anyway. As you described it seems like it will be sufficient though, so please proceed with getting it landed in the 2.7 FE branch and we will pull it into a future build. Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - We have not tried the patch at NASA yet, were waiting on completed review and testing and landing. Since we fixed the on-disk corruption with e2fsck, I don't believe we have a test case for it anymore anyway. As you described it seems like it will be sufficient though, so please proceed with getting it landed in the 2.7 FE branch and we will pull it into a future build. Thanks!

            I mean that have you tried the patch? If yes, any results? Besides, any others you want us to do?

            yong.fan nasf (Inactive) added a comment - I mean that have you tried the patch? If yes, any results? Besides, any others you want us to do?

            What kind of feedback were you looking for?

            We did run e2fsck on the related servers and it corrected some errors.

            It looks like the backport patch requires review and landing to the 2.7 FE branch.

            ndauchy Nathan Dauchy (Inactive) added a comment - What kind of feedback were you looking for? We did run e2fsck on the related servers and it corrected some errors. It looks like the backport patch requires review and landing to the 2.7 FE branch.

            Any feedback? Thanks!

            yong.fan nasf (Inactive) added a comment - Any feedback? Thanks!

            With the patch applied, there may be still some message. But the message will be warning (#define KERN_WARNING "<4>"), not error (#define KERN_CRIT "<2>").

            yong.fan nasf (Inactive) added a comment - With the patch applied, there may be still some message. But the message will be warning (#define KERN_WARNING "<4>"), not error (#define KERN_CRIT "<2>").

            With the patch, will we still get error messages/warnings like below? Want to create a SEC rule to notify us if it happens again.

            LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd

            hyeung Herbert Yeung added a comment - With the patch, will we still get error messages/warnings like below? Want to create a SEC rule to notify us if it happens again. LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd
            yong.fan nasf (Inactive) added a comment - - edited

            Originally, the case of the bitmap corruption is fatal, just like this ticket described, the system becomes unavailable. The patch from LU-7114 is NOT to fix bitmap corruption, instead, it marks the ground(s) with bad bitmap(s) and skip errors, then the system is still usable under such trouble case. I have back ported the patch to b2_7_fe (http://review.whamcloud.com/21705), but it just make the system to be available, the customer still needs to run e2fsck to fix the corrupted bitmap(s).

            yong.fan nasf (Inactive) added a comment - - edited Originally, the case of the bitmap corruption is fatal, just like this ticket described, the system becomes unavailable. The patch from LU-7114 is NOT to fix bitmap corruption, instead, it marks the ground(s) with bad bitmap(s) and skip errors, then the system is still usable under such trouble case. I have back ported the patch to b2_7_fe ( http://review.whamcloud.com/21705 ), but it just make the system to be available, the customer still needs to run e2fsck to fix the corrupted bitmap(s).
            pjones Peter Jones added a comment -

            Fan Yong

            Is the fix from LU-7114 sufficient to address this issue? If so could you please port it to the 2.7 FE branch?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Is the fix from LU-7114 sufficient to address this issue? If so could you please port it to the 2.7 FE branch? Thanks Peter
            green Oleg Drokin added a comment -

            From the messages it looks like the bitmaps got corrupted in your case.
            Patches in bug LU-7114 should convert that from a fatal even into a non-fatal. Recent e2fsck should also fix that I believe.

            green Oleg Drokin added a comment - From the messages it looks like the bitmaps got corrupted in your case. Patches in bug LU-7114 should convert that from a fatal even into a non-fatal. Recent e2fsck should also fix that I believe.

            People

              yong.fan nasf (Inactive)
              hyeung Herbert Yeung
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: