Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8252

MDS kernel panic after aborting journal

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.3
    • None
    • Centos 6.5
      Linux 2.6.32-431.23.3.el6_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We're having an issue with our mds crashing. This is after recovering from a full md filesystem. We've been deleting from storage to free up metadata space, but have run into these kernel panics.

      dmesg logs have the following:

      <2>LDISKFS-jfs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 0corrupted: 57 blocks free in bitmap, 6 - in gd
      <4>
      <3>Aborting journal on device md0-8.
      <2>LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
      <2>LDISKFS-fs error (device md0) in iam_txn_add: Journal has aborted
      <2>LDISKFS-fs (md0): Remounting filesystem read-only
      <2>LDISKFS-fs (md0): Remounting filesystem read-only
      <3>LustreError: 6919:0:(osd_io.c:1173:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      <3>LustreError: 6919:0:(osd_handler.c:1054:osd_trans_stop()) Failure in transaction hook: -30
      <3>LustreError: 6919:0:(osd_handler.c:1063:osd_trans_stop()) Failure to stop transaction: -30
      <2>LDISKFS-fs error (device md0): ldiskfs_mb_new_blocks: Updating bitmap error: [err -30] [pa ffff8860350c8ba8] [phy 34992896] [logic 256] [len 256] [free 256] [error 1] [inode 1917]
      <3>LustreError: 8967:0:(osd_io.c:1166:osd_ldiskfs_write_record()) md0: error reading offset 2093056 (block 511): rc = -30
      <3>LustreError: 8967:0:(llog_osd.c:156:llog_osd_write_blob()) echo-MDT0000-osd: error writing log record: rc = -30
      <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
      <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
      <3>LustreError: 8967:0:(llog_cat.c:356:llog_cat_add_rec()) llog_write_rec -30: lh=ffff88601d1e4b40
      <4>
      <3>LustreError: 5801:0:(osd_handler.c:863:osd_trans_commit_cb()) transaction @0xffff882945fc28c0 commit error: 2
      <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 11 changes, 31 in progress, 0 in flight: -5
      <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) LBUG
      <4>Pid: 6145, comm: osp-syn-98-0
      <4>
      <4>Call Trace:
      <4> [<ffffffffa03b3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa03b3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0eff2e3>] osp_sync_thread+0x753/0x7d0 [osp]
      <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
      <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <3>LustreError: 6135:0:(llog.c:159:llog_cancel_rec()) echo-OST005d-osc-MDT0000: fail to write header for llog #0x5552:1#00000000: rc = -30
      <3>LustreError: 6135:0:(llog_cat.c:538:llog_cat_cancel_records()) echo-OST005d-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
      <3>LustreError: 6135:0:(osp_sync.c:721:osp_sync_process_committed()) echo-OST005d-osc-MDT0000: can't cancel record: -30
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6145, comm: osp-syn-98-0 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
      <4> [<ffffffffa03b3eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0eff2e3>] ? osp_sync_thread+0x753/0x7d0 [osp]
      <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
      <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        1. dir.3040644673.bin
          4 kB
          Stephen Willey
        2. inode.3040644673.bin
          4 kB
          Stephen Willey
        3. vmcore-dmesg.20160607.txt
          138 kB
          Cory Brassington
        4. vmcore-dmesg.20160608.txt
          157 kB
          Cory Brassington

        Issue Links

          Activity

            [LU-8252] MDS kernel panic after aborting journal

            Well, I was toying with the idea of using a 3 x NVMe (6.4TB) RAID5 for the MDT... We need lots of inodes and some redundancy. Maybe I should just use 4 cards and RAID10.

            daire Daire Byrne (Inactive) added a comment - Well, I was toying with the idea of using a 3 x NVMe (6.4TB) RAID5 for the MDT... We need lots of inodes and some redundancy. Maybe I should just use 4 cards and RAID10.

            The LBUG was fixed as part of LU-6696.

            As for Lustre + 4KB RAID chunk size, I don't think the MDT will really affect performance much, since it is (presumably) configured as RAID-1+0 and not RAID-6.

            adilger Andreas Dilger added a comment - The LBUG was fixed as part of LU-6696 . As for Lustre + 4KB RAID chunk size, I don't think the MDT will really affect performance much, since it is (presumably) configured as RAID-1+0 and not RAID-6.

            Okay, thanks. We'll give that a try on our next filesystem. I wonder how that will interact with a RAID stripe of 4k too? But then we are using NVMe to back MDTs now so I doubt there is much RAID overhead any more.

            daire Daire Byrne (Inactive) added a comment - Okay, thanks. We'll give that a try on our next filesystem. I wonder how that will interact with a RAID stripe of 4k too? But then we are using NVMe to back MDTs now so I doubt there is much RAID overhead any more.

            Daire, if you have a large number of links per file, then increasing the amount of space per inode would improve your space efficiency somewhat. Something like --mkfsoptions="-i 4608" would provide room for one external xattr block per file

            adilger Andreas Dilger added a comment - Daire, if you have a large number of links per file, then increasing the amount of space per inode would improve your space efficiency somewhat. Something like --mkfsoptions="-i 4608" would provide room for one external xattr block per file

            So it seems like this was eventually resolved (we have not had further issues), but I do have a related question.

            If we were to build this filesystem again, what would be the optimal inode size/ratio for performance considering we are likely to have 10+ hard links per file? Or does it not really matter and just ensuring we have more space available would be sufficient?

            daire Daire Byrne (Inactive) added a comment - So it seems like this was eventually resolved (we have not had further issues), but I do have a related question. If we were to build this filesystem again, what would be the optimal inode size/ratio for performance considering we are likely to have 10+ hard links per file? Or does it not really matter and just ensuring we have more space available would be sufficient?
            adilger Andreas Dilger added a comment - - edited

            I tried to create a two-level subdirectory test/test2 and then zeroed out the leaf block of test1 so that there was no entry pointing to test2 and also set the ".." entry of test2 to <2>, but this didn't show anything different. Another test with just changing ".." of the test3 subdirectory to <2> didn't cause the problem.

            e2fsck 1.42.13.wc5 (15-Apr-2016)
            Pass 1: Checking inodes, blocks, and sizes
            Pass 2: Checking directory structure
            Directory inode 12, block #0, offset 0: directory corrupted
            Salvage<y>? yes
            Missing '.' in directory inode 12.
            Fix<y>? yes
            Setting filetype for entry '.' in ??? (12) to 2.
            Missing '..' in directory inode 12.
            Fix<y>? yes
            Setting filetype for entry '..' in ??? (12) to 2.
            Pass 3: Checking directory connectivity
            '..' in /test (12) is <The NULL inode> (0), should be / (2).
            Fix<y>? yes
            Unconnected directory inode 13 (/test/???)
            Connect to /lost+found<y>? yes
            '..' in /test2/test3 (15) is / (2), should be /test2 (14).
            Fix<y>? yes
            

            I guess it is worthwhile to look into how directories are inserted into the parent directory list for e2fsck_dir_info_get_parent() to find. Something is preventing the 3040644673 inode from being found in the parent directory list.

            adilger Andreas Dilger added a comment - - edited I tried to create a two-level subdirectory test/test2 and then zeroed out the leaf block of test1 so that there was no entry pointing to test2 and also set the ".." entry of test2 to <2>, but this didn't show anything different. Another test with just changing ".." of the test3 subdirectory to <2> didn't cause the problem. e2fsck 1.42.13.wc5 (15-Apr-2016) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 12, block #0, offset 0: directory corrupted Salvage<y>? yes Missing '.' in directory inode 12. Fix<y>? yes Setting filetype for entry '.' in ??? (12) to 2. Missing '..' in directory inode 12. Fix<y>? yes Setting filetype for entry '..' in ??? (12) to 2. Pass 3: Checking directory connectivity '..' in /test (12) is <The NULL inode> (0), should be / (2). Fix<y>? yes Unconnected directory inode 13 (/test/???) Connect to /lost+found<y>? yes '..' in /test2/test3 (15) is / (2), should be /test2 (14). Fix<y>? yes I guess it is worthwhile to look into how directories are inserted into the parent directory list for e2fsck_dir_info_get_parent() to find. Something is preventing the 3040644673 inode from being found in the parent directory list.
            bobijam Zhenyu Xu added a comment - - edited

            I've tried to created a /mnt/lustre/dir1/dir2 which is disconnected, and e2fsck reports this

            # e2fsck -fvy /dev/sdb
            e2fsck 1.42.3.wc3 (15-Aug-2012)
            Pass 1: Checking inodes, blocks, and sizes
            Pass 2: Checking directory structure
            Pass 3: Checking directory connectivity
            Unconnected directory inode 25047 (/???)
            Connect to /lost+found? yes
            
            Pass 4: Checking reference counts
            Inode 2 ref count is 12, should be 13.  Fix? yes
            
            Inode 25047 ref count is 3, should be 2.  Fix? yes
            
            Pass 5: Checking group summary information
            
            lustre-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
            
                 246 inodes used (0.25%)
                   3 non-contiguous files (1.2%)
                   0 non-contiguous directories (0.0%)
                     # of inodes with ind/dind/tind blocks: 0/0/0
               17038 blocks used (34.08%)
                   0 bad blocks
                   1 large file
            
                 119 regular files
                 118 directories
                   0 character device files
                   0 block device files
                   0 fifos
            4294967295 links
                   0 symbolic links (0 fast symbolic links)
                   0 sockets
            --------
                 236 files
            

            It does not complain anything in the pass2.

            bobijam Zhenyu Xu added a comment - - edited I've tried to created a /mnt/lustre/dir1/dir2 which is disconnected, and e2fsck reports this # e2fsck -fvy /dev/sdb e2fsck 1.42.3.wc3 (15-Aug-2012) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 25047 (/???) Connect to /lost+found? yes Pass 4: Checking reference counts Inode 2 ref count is 12, should be 13. Fix? yes Inode 25047 ref count is 3, should be 2. Fix? yes Pass 5: Checking group summary information lustre-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** 246 inodes used (0.25%) 3 non-contiguous files (1.2%) 0 non-contiguous directories (0.0%) # of inodes with ind/dind/tind blocks: 0/0/0 17038 blocks used (34.08%) 0 bad blocks 1 large file 119 regular files 118 directories 0 character device files 0 block device files 0 fifos 4294967295 links 0 symbolic links (0 fast symbolic links) 0 sockets -------- 236 files It does not complain anything in the pass2.

            The problematic directory inode is 3040644673 as reported by e2fsck and hasn't been fixed after several runs:

            debugfs:  stat <3040644673>
            Inode: 3040644673   Type: directory    Mode:  0777   Flags: 0x0
            Generation: 2336434627    Version: 0x00000012:75b668cd
            User:   518   Group:    20   Size: 4096
            File ACL: 0    Directory ACL: 0
            Links: 2   Blockcount: 8
            Fragment:  Address: 0    Number: 0    Size: 0
             ctime: 0x57421268:00000000 -- Sun May 22 13:11:20 2016
             atime: 0x57421268:00000000 -- Sun May 22 13:11:20 2016
             mtime: 0x56b2143f:00000000 -- Wed Feb  3 06:52:47 2016
            crtime: 0x57421268:888eaf54 -- Sun May 22 13:11:20 2016
            Size of extra inode fields: 28
            Extended attributes stored in inode body:
              lma = "00 00 00 00 00 00 00 00 b5 83 00 00 02 00 00 00 df a8 00 00 00 00 00 00
             " (24)
              lma: fid=[0x2000083b5:0xa8df:0x0] compat=0 incompat=0
              link = "df f1 ea 11 01 00 00 00 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
            0 00 19 00 00 00 02 00 00 83 b5 00 00 a8 bf 00 00 00 00 61 63 6c 6f 63 61 6c " (
            49)
            BLOCKS:
            (0):14090
            TOTAL: 1
            

            Dumping the directory block shows nothing completely broken, except that the parent entry ".." is inode <2>, which is the ext4 root inode that should never be used by Lustre directories, but shouldn't cause any problems for e2fsck:

            debugfs:  block_dump 14090
            0000  418e 3cb5 0c00 0102 2e00 0000 0200 0000  A.<.............
            0020  f40f 0202 2e2e 0000 2537 0000 1a37 0000  ........%7...7..
            0040  2037 0000 2f37 0000 2e37 0000 2637 0000   7../7...7..&7..
            0060  2d37 0000 0f37 0000 2737 0000 1337 0000  -7...7..'7...7..
            0100  1237 0000 1b37 0000 1e37 0000 2c37 0000  .7...7...7..,7..
            0120  3037 0000 0000 0000 1f37 0000 0d37 0000  07.......7...7..
            0140  3137 0000 1837 0000 3437 0000 1037 0000  17...7..47...7..
            0160  2937 0000 3237 0000 3337 0000 1d37 0000  )7..27..37...7..
            0200  1137 0000 3537 0000 2b37 0000 3637 0000  .7..57..+7..67..
            0220  2837 0000 3737 0000 0000 0000 0000 0000  (7..77..........
            0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
            *
            

            Even listing the directory shows nothing wrong with it except the ".." directory number:

            debugfs:  ls <3040644673>
             3040644673  (12) .    2  (4084) ..
            

            The code that is triggering is in e2fsck/pass2.c::check_dir_block(), where e2fsck_dir_info_get_parent() fails:

                                    if (e2fsck_dir_info_get_parent(ctx, dirent->inode,
                                                                   &subdir_parent)) {
                                            cd->pctx.ino = dirent->inode;
                                            fix_problem(ctx, PR_2_NO_DIRINFO, &cd->pctx);
                                            goto abort_free_dict;
                                    }
            

            after which e2fsck immediately aborts.

            I'll need to dig into the population of the dir_info list to see why it is choking on "2" as the parent, since every directory under the root "/" directory also has <2> as the parent. Maybe that is special-cased in the code? Another possibility isn't that the parent directory "2" is the problem, but rather that this directory is not connected anywhere? I'd think that would be handled by adding it to lost+found or similar. It may be possible to debug this by creating a test filesystem that is similarly corrupted (non-root, maybe disconnected directory with <2> as the parent) to see what it is that triggers e2fsck to be unhappy.

            adilger Andreas Dilger added a comment - The problematic directory inode is 3040644673 as reported by e2fsck and hasn't been fixed after several runs: debugfs: stat <3040644673> Inode: 3040644673 Type: directory Mode: 0777 Flags: 0x0 Generation: 2336434627 Version: 0x00000012:75b668cd User: 518 Group: 20 Size: 4096 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x57421268:00000000 -- Sun May 22 13:11:20 2016 atime: 0x57421268:00000000 -- Sun May 22 13:11:20 2016 mtime: 0x56b2143f:00000000 -- Wed Feb 3 06:52:47 2016 crtime: 0x57421268:888eaf54 -- Sun May 22 13:11:20 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 b5 83 00 00 02 00 00 00 df a8 00 00 00 00 00 00 " (24) lma: fid=[0x2000083b5:0xa8df:0x0] compat=0 incompat=0 link = "df f1 ea 11 01 00 00 00 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 00 19 00 00 00 02 00 00 83 b5 00 00 a8 bf 00 00 00 00 61 63 6c 6f 63 61 6c " ( 49) BLOCKS: (0):14090 TOTAL: 1 Dumping the directory block shows nothing completely broken, except that the parent entry ".." is inode <2>, which is the ext4 root inode that should never be used by Lustre directories, but shouldn't cause any problems for e2fsck: debugfs: block_dump 14090 0000 418e 3cb5 0c00 0102 2e00 0000 0200 0000 A.<............. 0020 f40f 0202 2e2e 0000 2537 0000 1a37 0000 ........%7...7.. 0040 2037 0000 2f37 0000 2e37 0000 2637 0000 7../7...7..&7.. 0060 2d37 0000 0f37 0000 2737 0000 1337 0000 -7...7..'7...7.. 0100 1237 0000 1b37 0000 1e37 0000 2c37 0000 .7...7...7..,7.. 0120 3037 0000 0000 0000 1f37 0000 0d37 0000 07.......7...7.. 0140 3137 0000 1837 0000 3437 0000 1037 0000 17...7..47...7.. 0160 2937 0000 3237 0000 3337 0000 1d37 0000 )7..27..37...7.. 0200 1137 0000 3537 0000 2b37 0000 3637 0000 .7..57..+7..67.. 0220 2837 0000 3737 0000 0000 0000 0000 0000 (7..77.......... 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * Even listing the directory shows nothing wrong with it except the ".." directory number: debugfs: ls <3040644673> 3040644673 (12) . 2 (4084) .. The code that is triggering is in e2fsck/pass2.c::check_dir_block() , where e2fsck_dir_info_get_parent() fails: if (e2fsck_dir_info_get_parent(ctx, dirent->inode, &subdir_parent)) { cd->pctx.ino = dirent->inode; fix_problem(ctx, PR_2_NO_DIRINFO, &cd->pctx); goto abort_free_dict; } after which e2fsck immediately aborts. I'll need to dig into the population of the dir_info list to see why it is choking on "2" as the parent, since every directory under the root "/" directory also has <2> as the parent. Maybe that is special-cased in the code? Another possibility isn't that the parent directory "2" is the problem, but rather that this directory is not connected anywhere? I'd think that would be handled by adding it to lost+found or similar. It may be possible to debug this by creating a test filesystem that is similarly corrupted (non-root, maybe disconnected directory with <2> as the parent) to see what it is that triggers e2fsck to be unhappy.

            There were a couple of patches landed for 2.8.0 that make ldiskfs more robust in the face of similar corruption:

            adilger Andreas Dilger added a comment - There were a couple of patches landed for 2.8.0 that make ldiskfs more robust in the face of similar corruption: http://review.whamcloud.com/16679 " LU-1026 ldiskfs: make bitmaps corruption not fatal" http://review.whamcloud.com/16312 " LU-7114 ldiskfs: corrupted bitmaps handling patches"

            The ext2fs_warn_bitmap() function is the source of the message Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map, but it turns out that this isn't the reason the e2fsck was aborted since in the first run this message appeared some time before e2fsck was aborted. That is caused by the later error Internal error: couldn't find dir_info for 3040644673, which was one of the inodes repaired in a previous run.

            The Internal error message did appear for a different inode the first time, so it does seem possible that e2fsck is still repairing these inodes in each run, but not as well as they should be, causing an abort each run.

            adilger Andreas Dilger added a comment - The ext2fs_warn_bitmap() function is the source of the message Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map , but it turns out that this isn't the reason the e2fsck was aborted since in the first run this message appeared some time before e2fsck was aborted. That is caused by the later error Internal error: couldn't find dir_info for 3040644673 , which was one of the inodes repaired in a previous run. The Internal error message did appear for a different inode the first time, so it does seem possible that e2fsck is still repairing these inodes in each run, but not as well as they should be, causing an abort each run.

            People

              ys Yang Sheng
              cyb Cory Brassington (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: