Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8252

MDS kernel panic after aborting journal

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3
    • None
    • Centos 6.5
      Linux 2.6.32-431.23.3.el6_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We're having an issue with our mds crashing. This is after recovering from a full md filesystem. We've been deleting from storage to free up metadata space, but have run into these kernel panics.

      dmesg logs have the following:

      <2>LDISKFS-jfs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 0corrupted: 57 blocks free in bitmap, 6 - in gd
      <4>
      <3>Aborting journal on device md0-8.
      <2>LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
      <2>LDISKFS-fs error (device md0) in iam_txn_add: Journal has aborted
      <2>LDISKFS-fs (md0): Remounting filesystem read-only
      <2>LDISKFS-fs (md0): Remounting filesystem read-only
      <3>LustreError: 6919:0:(osd_io.c:1173:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      <3>LustreError: 6919:0:(osd_handler.c:1054:osd_trans_stop()) Failure in transaction hook: -30
      <3>LustreError: 6919:0:(osd_handler.c:1063:osd_trans_stop()) Failure to stop transaction: -30
      <2>LDISKFS-fs error (device md0): ldiskfs_mb_new_blocks: Updating bitmap error: [err -30] [pa ffff8860350c8ba8] [phy 34992896] [logic 256] [len 256] [free 256] [error 1] [inode 1917]
      <3>LustreError: 8967:0:(osd_io.c:1166:osd_ldiskfs_write_record()) md0: error reading offset 2093056 (block 511): rc = -30
      <3>LustreError: 8967:0:(llog_osd.c:156:llog_osd_write_blob()) echo-MDT0000-osd: error writing log record: rc = -30
      <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
      <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
      <3>LustreError: 8967:0:(llog_cat.c:356:llog_cat_add_rec()) llog_write_rec -30: lh=ffff88601d1e4b40
      <4>
      <3>LustreError: 5801:0:(osd_handler.c:863:osd_trans_commit_cb()) transaction @0xffff882945fc28c0 commit error: 2
      <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 11 changes, 31 in progress, 0 in flight: -5
      <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) LBUG
      <4>Pid: 6145, comm: osp-syn-98-0
      <4>
      <4>Call Trace:
      <4> [<ffffffffa03b3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa03b3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0eff2e3>] osp_sync_thread+0x753/0x7d0 [osp]
      <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
      <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <3>LustreError: 6135:0:(llog.c:159:llog_cancel_rec()) echo-OST005d-osc-MDT0000: fail to write header for llog #0x5552:1#00000000: rc = -30
      <3>LustreError: 6135:0:(llog_cat.c:538:llog_cat_cancel_records()) echo-OST005d-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
      <3>LustreError: 6135:0:(osp_sync.c:721:osp_sync_process_committed()) echo-OST005d-osc-MDT0000: can't cancel record: -30
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6145, comm: osp-syn-98-0 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
      <4> [<ffffffffa03b3eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0eff2e3>] ? osp_sync_thread+0x753/0x7d0 [osp]
      <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
      <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        1. dir.3040644673.bin
          4 kB
        2. inode.3040644673.bin
          4 kB
        3. vmcore-dmesg.20160607.txt
          138 kB
        4. vmcore-dmesg.20160608.txt
          157 kB

        Issue Links

          Activity

            [LU-8252] MDS kernel panic after aborting journal

            There were a couple of patches landed for 2.8.0 that make ldiskfs more robust in the face of similar corruption:

            adilger Andreas Dilger added a comment - There were a couple of patches landed for 2.8.0 that make ldiskfs more robust in the face of similar corruption: http://review.whamcloud.com/16679 " LU-1026 ldiskfs: make bitmaps corruption not fatal" http://review.whamcloud.com/16312 " LU-7114 ldiskfs: corrupted bitmaps handling patches"

            The ext2fs_warn_bitmap() function is the source of the message Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map, but it turns out that this isn't the reason the e2fsck was aborted since in the first run this message appeared some time before e2fsck was aborted. That is caused by the later error Internal error: couldn't find dir_info for 3040644673, which was one of the inodes repaired in a previous run.

            The Internal error message did appear for a different inode the first time, so it does seem possible that e2fsck is still repairing these inodes in each run, but not as well as they should be, causing an abort each run.

            adilger Andreas Dilger added a comment - The ext2fs_warn_bitmap() function is the source of the message Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map , but it turns out that this isn't the reason the e2fsck was aborted since in the first run this message appeared some time before e2fsck was aborted. That is caused by the later error Internal error: couldn't find dir_info for 3040644673 , which was one of the inodes repaired in a previous run. The Internal error message did appear for a different inode the first time, so it does seem possible that e2fsck is still repairing these inodes in each run, but not as well as they should be, causing an abort each run.
            ys Yang Sheng added a comment -

            Look into e2fsck code, It is failed at:

            int ext2fs_test_inode_bitmap_range(ext2fs_inode_bitmap bitmap,
                                               ino_t inode, int num)
            {
                    EXT2_CHECK_MAGIC(bitmap, EXT2_ET_MAGIC_INODE_BITMAP);
                    if ((inode < bitmap->start) || (inode+num-1 > bitmap->real_end)) {
                            ext2fs_warn_bitmap(EXT2_ET_BAD_INODE_TEST,
                                               inode, bitmap->description);
                            return 0;
                    }
                    return ext2fs_test_clear_generic_bitmap_range((ext2fs_generic_bitmap)
                                                                  bitmap, inode, num);
            }
            
            

            So looks like metadata is inconsistent. Maybe will passed after manual fixed.

            ys Yang Sheng added a comment - Look into e2fsck code, It is failed at: int ext2fs_test_inode_bitmap_range(ext2fs_inode_bitmap bitmap, ino_t inode, int num) { EXT2_CHECK_MAGIC(bitmap, EXT2_ET_MAGIC_INODE_BITMAP); if ((inode < bitmap->start) || (inode+num-1 > bitmap->real_end)) { ext2fs_warn_bitmap(EXT2_ET_BAD_INODE_TEST, inode, bitmap->description); return 0; } return ext2fs_test_clear_generic_bitmap_range((ext2fs_generic_bitmap) bitmap, inode, num); } So looks like metadata is inconsistent. Maybe will passed after manual fixed.

            Looking into this issue, it appears that the MDT filesystem is becoming full because of the use of hard-link trees for backup. This results in each file having a large link xattr that spills into an external block, as well as multiple directories referencing each file.

            Inode count:              3079569408
            Free inodes:              2166646521  == 912922887 files used
            Block count:              1539776448
            Free blocks:              41729879 == 159GB free, 5992186276 KB used == 5714 GB used, 6721 bytes used/inode
            

            It would normally not be possible to have more than about 4500 bytes used per inode, even with the external xattr block, but the extra directory trees are consuming this space.

            A file taken at random has a link count of 10 and an external xattr block (the File ACL block):

            debugfs:  stat F_006_fx_smokep
            F_006_fx_smokep: File not found by ext2_lookup
            debugfs:  stat F_006_fx_smokeplumes_0010_comp_v001_01.nk
            Inode: 2467900774   Type: regular    Mode:  0666   Flags: 0x0
            Generation: 3837434978    Version: 0x00000014:f756e940
            User:     0   Group:     0   Size: 0
            File ACL: 1234063101    Directory ACL: 0
            Links: 10   Blockcount: 8
            Fragment:  Address: 0    Number: 0    Size: 0
             ctime: 0x57541f8e:00000000 -- Sun Jun  5 05:48:14 2016
             atime: 0x571b5195:00000000 -- Sat Apr 23 03:42:29 2016
             mtime: 0x539eb883:00000000 -- Mon Jun 16 02:27:31 2014
            crtime: 0x571b5195:7f0d72f8 -- Sat Apr 23 03:42:29 2016
            Size of extra inode fields: 28
            Extended attributes stored in inode body:
              lma = "00 00 00 00 00 00 00 00 40 66 00 00 02 00 00 00 81 b6 01 00 00 00 00 00
             " (24)
              lma: fid=[0x200006640:0x1b681:0x0] compat=0 incompat=0
              lov = "d0 0b d1 0b 01 00 00 00 81 b6 01 00 00 00 00 00 40 66 00 00 02 00 00 00
             00 00 10 00 01 00 00 00 6c 7c 0e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
            0 00 56 00 00 00 " (56)
            

            As for the e2fsck problem, I haven't been able to debug it yet because the problem takes several hours to hit. In the meantime, I've fixed the (first) problem that was causing the MDT to be remounted read-only:

            Jun  8 00:56:41 emds1 kernel: LDISKFS-fs error (device md0): ldiskfs_mb_check_on
            disk_bitmap: on-disk bitmap for group 0 corrupted: 57 blocks free in bitmap, 6 in gd
            Jun  8 00:56:41 emds1 kernel: 
            Jun  8 00:56:41 emds1 kernel: Aborting journal on device md0-8.
            Jun  8 00:56:41 emds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only
            

            I manually marked all of the blocks in group 0 used, and changed the group summary to match, as well as recomputed the block group checksum, which is the safest workaround given that I don't know which block(s) are actually in use, or which of those values is correct. It isn't clear if there are more errors like this, but I verified the next few groups had consistent block counts in the bitmap and group descriptors.

            There is an e2fsck read-only check running under GDB to hopefully be able to debug the problem, but it will take about 8h to hit the point of the prior corruption.

            adilger Andreas Dilger added a comment - Looking into this issue, it appears that the MDT filesystem is becoming full because of the use of hard-link trees for backup. This results in each file having a large link xattr that spills into an external block, as well as multiple directories referencing each file. Inode count: 3079569408 Free inodes: 2166646521 == 912922887 files used Block count: 1539776448 Free blocks: 41729879 == 159GB free, 5992186276 KB used == 5714 GB used, 6721 bytes used/inode It would normally not be possible to have more than about 4500 bytes used per inode, even with the external xattr block, but the extra directory trees are consuming this space. A file taken at random has a link count of 10 and an external xattr block (the File ACL block): debugfs: stat F_006_fx_smokep F_006_fx_smokep: File not found by ext2_lookup debugfs: stat F_006_fx_smokeplumes_0010_comp_v001_01.nk Inode: 2467900774 Type: regular Mode: 0666 Flags: 0x0 Generation: 3837434978 Version: 0x00000014:f756e940 User: 0 Group: 0 Size: 0 File ACL: 1234063101 Directory ACL: 0 Links: 10 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x57541f8e:00000000 -- Sun Jun 5 05:48:14 2016 atime: 0x571b5195:00000000 -- Sat Apr 23 03:42:29 2016 mtime: 0x539eb883:00000000 -- Mon Jun 16 02:27:31 2014 crtime: 0x571b5195:7f0d72f8 -- Sat Apr 23 03:42:29 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 40 66 00 00 02 00 00 00 81 b6 01 00 00 00 00 00 " (24) lma: fid=[0x200006640:0x1b681:0x0] compat=0 incompat=0 lov = "d0 0b d1 0b 01 00 00 00 81 b6 01 00 00 00 00 00 40 66 00 00 02 00 00 00 00 00 10 00 01 00 00 00 6c 7c 0e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 00 56 00 00 00 " (56) As for the e2fsck problem, I haven't been able to debug it yet because the problem takes several hours to hit. In the meantime, I've fixed the (first) problem that was causing the MDT to be remounted read-only: Jun 8 00:56:41 emds1 kernel: LDISKFS-fs error (device md0): ldiskfs_mb_check_on disk_bitmap: on-disk bitmap for group 0 corrupted: 57 blocks free in bitmap, 6 in gd Jun 8 00:56:41 emds1 kernel: Jun 8 00:56:41 emds1 kernel: Aborting journal on device md0-8. Jun 8 00:56:41 emds1 kernel: LDISKFS-fs (md0): Remounting filesystem read-only I manually marked all of the blocks in group 0 used, and changed the group summary to match, as well as recomputed the block group checksum, which is the safest workaround given that I don't know which block(s) are actually in use, or which of those values is correct. It isn't clear if there are more errors like this, but I verified the next few groups had consistent block counts in the bitmap and group descriptors. There is an e2fsck read-only check running under GDB to hopefully be able to debug the problem, but it will take about 8h to hit the point of the prior corruption.
            pjones Peter Jones added a comment -

            Stephen

            I've sent you an email to get this going

            Peter

            pjones Peter Jones added a comment - Stephen I've sent you an email to get this going Peter

            We can get that in place. We're actually in Vancouver so Pacific hours. I'll set up some SSH and get back to you. Do you have an incoming IP I can restrict this to?

            Please mail me on sfw@dneg.com - Thanks.

            sfw Stephen Willey (Inactive) added a comment - We can get that in place. We're actually in Vancouver so Pacific hours. I'll set up some SSH and get back to you. Do you have an incoming IP I can restrict this to? Please mail me on sfw@dneg.com - Thanks.
            pjones Peter Jones added a comment -

            Stephen

            How long would it take to get remote access in place? I know that you are based in the UK so what hours could a contact on site be available to work in realtime with one of our engineers?

            Peter

            pjones Peter Jones added a comment - Stephen How long would it take to get remote access in place? I know that you are based in the UK so what hours could a contact on site be available to work in realtime with one of our engineers? Peter

            We're quickly reaching the point where we're going to have to consider clearing things and starting from scratch which is something we'd really rather not do. We'd appreciate any other options you can present. If remote access would be useful, we can provide that.

            If clearing things is the only real option here, can we provide any extra info to determine why this might have happened? As far as we can tell, the only thing that happened was the MDT filling up.

            sfw Stephen Willey (Inactive) added a comment - We're quickly reaching the point where we're going to have to consider clearing things and starting from scratch which is something we'd really rather not do. We'd appreciate any other options you can present. If remote access would be useful, we can provide that. If clearing things is the only real option here, can we provide any extra info to determine why this might have happened? As far as we can tell, the only thing that happened was the MDT filling up.

            The e2fsck seemed to stop at the same place again:

            emds1 /root # e2fsck -fvy -C 0 /dev/md0
            e2fsck 1.42.13.wc5 (15-Apr-2016)
            echo-MDT0000: recovering journal
            Pass 1: Checking inodes, blocks, and sizes
            Pass 2: Checking directory structure
            Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map
            Internal error: couldn't find dir_info for 3040644673.
            e2fsck: aborted

            Any suggestions...?

            sfw Stephen Willey (Inactive) added a comment - The e2fsck seemed to stop at the same place again: emds1 /root # e2fsck -fvy -C 0 /dev/md0 e2fsck 1.42.13.wc5 (15-Apr-2016) echo-MDT0000: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map Internal error: couldn't find dir_info for 3040644673. e2fsck: aborted Any suggestions...?

            It looks like you will need to run "e2fsck -fy" again to repair the corrupted directory. I'm not sure why e2fsck aborted, I haven't seen a problem like that before.

            adilger Andreas Dilger added a comment - It looks like you will need to run "e2fsck -fy" again to repair the corrupted directory. I'm not sure why e2fsck aborted, I haven't seen a problem like that before.
            pjones Peter Jones added a comment -

            Yang Sheng

            Could you please assist with this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Yang Sheng Could you please assist with this issue? Thanks Peter

            People

              ys Yang Sheng
              cyb Cory Brassington (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: