Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2041

SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • SWL - Hyperion/LLNL
    • 3
    • 4217

    Description

      Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT.

       
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      

      Ran fsck on device, recovered

      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      Lustre: MGC192.168.127.6@o2ib: Reactivating import
      Lustre: lustre-MDT0000: used disk, loading
      Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590
      Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages
      Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590)
      LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590)
      Lustre: lustre-MDT0000: disconnecting 1 stale clients
      Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted.
      Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans
      
      

      Will continue to run, see if we get repeat.

      Attachments

        Issue Links

          Activity

            [LU-2041] SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal

            just posted a patch : http://review.whamcloud.com/#change,4186
            It fixed a corner case in osd_it_ea_rec() which might cause inconsistent OI cache, I'm not 100% sure it's fix of this bug, but at least last time SWL ran much longer than before w/o seeing this bug after applying this patch.

            liang Liang Zhen (Inactive) added a comment - just posted a patch : http://review.whamcloud.com/#change,4186 It fixed a corner case in osd_it_ea_rec() which might cause inconsistent OI cache, I'm not 100% sure it's fix of this bug, but at least last time SWL ran much longer than before w/o seeing this bug after applying this patch.

            also add another fix from wangdi to the debug patch, the latest build is: http://build.whamcloud.com/job/lustre-reviews/9641/

            liang Liang Zhen (Inactive) added a comment - also add another fix from wangdi to the debug patch, the latest build is: http://build.whamcloud.com/job/lustre-reviews/9641/

            I merged fix of LU-1951, fix of LU-2041 and debug patch of LU-1948 together (http://review.whamcloud.com/#change,4153) , here is the build of it: http://build.whamcloud.com/job/lustre-reviews/9619/ , I think it's the next thing we should try on Hyperion, Cliff, could you please try it? Thanks.

            liang Liang Zhen (Inactive) added a comment - I merged fix of LU-1951 , fix of LU-2041 and debug patch of LU-1948 together ( http://review.whamcloud.com/#change,4153 ) , here is the build of it: http://build.whamcloud.com/job/lustre-reviews/9619/ , I think it's the next thing we should try on Hyperion, Cliff, could you please try it? Thanks.

            I rebased my debug patch and make it depend on patch of LU-1951, the build is ready:
            http://build.whamcloud.com/job/lustre-reviews/9573/

            liang Liang Zhen (Inactive) added a comment - I rebased my debug patch and make it depend on patch of LU-1951 , the build is ready: http://build.whamcloud.com/job/lustre-reviews/9573/
            liang Liang Zhen (Inactive) added a comment - - edited

            I checked PDO patch again but found nothing, because directory created by fdtree only has 20 entries and it's still a plain directory w/o index, I didn't change this part at all, also, directory created by fdtree has hostname+pid as part directory name, which means directory should be unique for each thread, if there's a bug to mess up file creation for a single thread, we might have seen it months ago because PDO is a 2.2 feature.

            However, I can't explain why we can see two failures in different places (wangdi's previous comment), the only possibility comes to my mind is OSD created two instances of htree_lock_head for the directory, or even created two lu_object for the same directory in rare case.

            The patch I posted in previous comment should give information to verify this (http://review.whamcloud.com/#change,4020, build is http://build.whamcloud.com/job/lustre-reviews/9547/), so we might have to wait for the other try with this patch.

            liang Liang Zhen (Inactive) added a comment - - edited I checked PDO patch again but found nothing, because directory created by fdtree only has 20 entries and it's still a plain directory w/o index, I didn't change this part at all, also, directory created by fdtree has hostname+pid as part directory name, which means directory should be unique for each thread, if there's a bug to mess up file creation for a single thread, we might have seen it months ago because PDO is a 2.2 feature. However, I can't explain why we can see two failures in different places (wangdi's previous comment), the only possibility comes to my mind is OSD created two instances of htree_lock_head for the directory, or even created two lu_object for the same directory in rare case. The patch I posted in previous comment should give information to verify this ( http://review.whamcloud.com/#change,4020 , build is http://build.whamcloud.com/job/lustre-reviews/9547/ ), so we might have to wait for the other try with this patch.
            liang Liang Zhen (Inactive) added a comment - - edited

            fdtree will actually create 10 subdirs and 10 files, so it has 20 entries, but the directory is still plain w/o index, I tried this on my local workstation, and verified this.
            The question is, why we can only see this while running fdtree, although logs on this ticket didn't print out filename, but because namelen is still 4 which matches namelen of fdtree, so I guess the directory is still created by fdtree

            liang Liang Zhen (Inactive) added a comment - - edited fdtree will actually create 10 subdirs and 10 files, so it has 20 entries, but the directory is still plain w/o index, I tried this on my local workstation, and verified this. The question is, why we can only see this while running fdtree, although logs on this ticket didn't print out filename, but because namelen is still 4 which matches namelen of fdtree, so I guess the directory is still created by fdtree
            di.wang Di Wang added a comment -

            Please see the trace on LU-1948

            2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs]
            2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
            2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
            2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd]
            2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd]
            2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd]
            2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm]
            2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
            2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt]

            Then in ldiskfs_add_entry, because there are only about 10 entries per dir, so it should not be dx dir, and the size should be == 4096.

            blocks = dir->i_size >> sb->s_blocksize_bits;
            for (block = 0; block < blocks; block++) {
            bh = ldiskfs_bread(handle, dir, block, 0, &retval);
            if(!bh)
            return retval;
            retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh); <--- it should go into add_dirent_to_buf here, since the initial size should be 4096, and add 10 entries should not be over 4096.

            if (retval != -ENOSPC)

            { brelse(bh); return retval; }

            if (blocks == 1 && !dx_fallback &&
            LDISKFS_HAS_COMPAT_FEATURE(sb, LDISKFS_FEATURE_COMPAT_DIR_INDEX))
            return make_indexed_dir(handle, dentry, inode, bh);
            brelse(bh);
            }
            bh = ldiskfs_append(handle, dir, &block, &retval);
            if (!bh)
            return retval;
            de = (struct ldiskfs_dir_entry_2 *) bh->b_data;
            de->inode = 0;
            de->rec_len = ldiskfs_rec_len_to_disk(blocksize, blocksize);
            retval = add_dirent_to_buf(handle, dentry, inode, de, bh); <---- even it goes into from here, de->inode = 0 make it impossible to get to the BUG_ON place.

            di.wang Di Wang added a comment - Please see the trace on LU-1948 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] 2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd] 2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd] 2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm] 2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt] Then in ldiskfs_add_entry, because there are only about 10 entries per dir, so it should not be dx dir, and the size should be == 4096. blocks = dir->i_size >> sb->s_blocksize_bits; for (block = 0; block < blocks; block++) { bh = ldiskfs_bread(handle, dir, block, 0, &retval); if(!bh) return retval; retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh); <--- it should go into add_dirent_to_buf here, since the initial size should be 4096, and add 10 entries should not be over 4096. if (retval != -ENOSPC) { brelse(bh); return retval; } if (blocks == 1 && !dx_fallback && LDISKFS_HAS_COMPAT_FEATURE(sb, LDISKFS_FEATURE_COMPAT_DIR_INDEX)) return make_indexed_dir(handle, dentry, inode, bh); brelse(bh); } bh = ldiskfs_append(handle, dir, &block, &retval); if (!bh) return retval; de = (struct ldiskfs_dir_entry_2 *) bh->b_data; de->inode = 0; de->rec_len = ldiskfs_rec_len_to_disk(blocksize, blocksize); retval = add_dirent_to_buf(handle, dentry, inode, de, bh); <---- even it goes into from here, de->inode = 0 make it impossible to get to the BUG_ON place.

            Di, why you think the "If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry" ?

            If thread A call add_dirent_to_buf() with NULL "de", it will call ldiskfs_check_dir_entry(); If thread B call ldiskfs_check_dir_entry() with non-NULL "de", it will not call ldiskfs_check_dir_entry().

            So thread A and thread B may fail at different positions. Maybe I miss something, please correct me.

            yong.fan nasf (Inactive) added a comment - Di, why you think the "If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry" ? If thread A call add_dirent_to_buf() with NULL "de", it will call ldiskfs_check_dir_entry(); If thread B call ldiskfs_check_dir_entry() with non-NULL "de", it will not call ldiskfs_check_dir_entry(). So thread A and thread B may fail at different positions. Maybe I miss something, please correct me.

            Nasf, I would suggest to run with wangdi's patch on LU-1948 to gather more information instead of yours patch, actually I just added a little more debug information to Di's patch: http://review.whamcloud.com/#change,4020
            How do you think?

            liang Liang Zhen (Inactive) added a comment - Nasf, I would suggest to run with wangdi's patch on LU-1948 to gather more information instead of yours patch, actually I just added a little more debug information to Di's patch: http://review.whamcloud.com/#change,4020 How do you think?

            for directory w/o indices, PDO lock equals to a RW semaphore which is exclusive for writing, which means PDO is actually disabled for small directory like this. But I will check my code again...

            liang Liang Zhen (Inactive) added a comment - for directory w/o indices, PDO lock equals to a RW semaphore which is exclusive for writing, which means PDO is actually disabled for small directory like this. But I will check my code again...
            di.wang Di Wang added a comment -

            Sorry for stepping in here. Clearly 1948 is not being fixed with the patch of 1976. I just re-investigated the result Cliff got on 1948. It seems to me two threads concurrently access the dir.

            Thread A adds entry to dir 78644697, and met zero rec_len here.

             
            2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent:
            2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4
            2012-09-18 14:13:37 Aborting journal on device md1-8.
            

            Thread B hit BUG_ON in the later place, which I added for debugging the problem,

            1930	@@ -1387,7 +1751,17 @@
            			1931	 »      if (de->inode) {
            			1932	 »      »       struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
            			1933	 »      »       de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, blocksize);
            			1934	+»      »       if (de1->rec_len == 0) {
            			1935	+»      »       »       __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p, name_len=%d\n",
            			1936	+»      »       »       »            dir->i_ino, name, name_len);
            			1937	+»      »       »       BUG_ON(de1->rec_len == 0);
            			1938	+»      »       }
            			1939	 »      »       de->rec_len = ext4_rec_len_to_disk(nlen, blocksize);
            			1940	+»      »       if (de->rec_len == 0) {
            			1941	+»      »       »       __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p name_len=%d\n",
            			1942	+»      »       »       »            dir->i_ino, name, name_len);
            			1943	+»      »       »       BUG_ON(de->rec_len == 0);
            			1944	+»      »       }
            			1945	 »      »       de = de1;
            			1946	 »      }
            			1947	 »      de->file_type = EXT4_FT_UNKNOWN;
            
            

            The console message for hitting the BUG.

             
            2012-09-18 14:13:37 78644697: name L0F7, name_len=4
            2012-09-18 14:13:37
            2012-09-18 14:13:37 ------------[ cut here ]------------
            2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
            2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP
            2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port
            2012-09-18 14:13:37 CPU 7
            ......
            

            If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry. So it still seems ldiskfs pdir patch problem. Liang, Could you please have a look again for this one.

            And also for this test, each dir only includes 10 files, so it should not involve index splitting IMHO, all dir are just single block dir.

            di.wang Di Wang added a comment - Sorry for stepping in here. Clearly 1948 is not being fixed with the patch of 1976. I just re-investigated the result Cliff got on 1948. It seems to me two threads concurrently access the dir. Thread A adds entry to dir 78644697, and met zero rec_len here. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. Thread B hit BUG_ON in the later place, which I added for debugging the problem, 1930 @@ -1387,7 +1751,17 @@ 1931 » if (de->inode) { 1932 » » struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen); 1933 » » de1->rec_len = ext4_rec_len_to_disk(rlen - nlen, blocksize); 1934 +» » if (de1->rec_len == 0) { 1935 +» » » __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p, name_len=%d\n", 1936 +» » » » dir->i_ino, name, name_len); 1937 +» » » BUG_ON(de1->rec_len == 0); 1938 +» » } 1939 » » de->rec_len = ext4_rec_len_to_disk(nlen, blocksize); 1940 +» » if (de->rec_len == 0) { 1941 +» » » __ext4_error(dir->i_sb, function, "%lu: name %s, lck %p name_len=%d\n", 1942 +» » » » dir->i_ino, name, name_len); 1943 +» » » BUG_ON(de->rec_len == 0); 1944 +» » } 1945 » » de = de1; 1946 » } 1947 » de->file_type = EXT4_FT_UNKNOWN; The console message for hitting the BUG. 2012-09-18 14:13:37 78644697: name L0F7, name_len=4 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 ...... If thread A and B are serialized for this function, both of them should fail at ldiskfs_check_dir_entry. So it still seems ldiskfs pdir patch problem. Liang, Could you please have a look again for this one. And also for this test, each dir only includes 10 files, so it should not involve index splitting IMHO, all dir are just single block dir.

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: