[LU-477] Oops: RIP: ldiskfs:ldiskfs_clear_inode+0x81/0xb0 Created: 02/Jul/11  Updated: 21/Jul/11  Resolved: 21/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Branch: master
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/192/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/
Distro/Arch: CentOS5.6/x86_64
Kernel Version: 2.6.18-238.12.1.el5_lustre.g6a3d997


Severity: 3
Rank (Obsolete): 4949

 Description   

While formatting an 128TB OST on DDN SFA10KE with the following command:

mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=10.0.2.15@tcp --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv

It hit kernel panic as follows:

Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv =====================
init dynlocks cache
ldiskfs created from ext4-2.6-rhel5
LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended
LDISKFS-fs: can't allocate buddy meta group
LDISKFS-fs (dm-3): failed to initalize mballoc (-12)
LDISKFS-fs (dm-3): mount failed
Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP: 
 [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
PGD 7c3436067 PUD 7c0051067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 3 
Modules linked in: ldiskfs(U) jbd2(U) crc16(U) lnet(U) libcfs(U) raid0(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) tpm_tis(U) ide_cd(U) i2c_piix4(U) tpm(U) parport_pc(U) sfablkdrvr(U) parport(U) 8139cp(U) mlx4_core(U) serio_raw(U) tpm_bios(U) cdrom(U) pcspkr(U) i2c_core(U) mii(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 3290, comm: mkfs.lustre Tainted: G      2.6.18-238.12.1.el5_lustre.g6a3d997 #1
RIP: 0010:[<ffffffff887421f1>]  [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
RSP: 0018:ffff8107c0bc7ad8  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff8104d1978990 RCX: ffff8107c01b2cc0
RDX: ffff8107c01b2cc0 RSI: ffff8104d1978b98 RDI: ffff8104d1978990
RBP: ffff8104d1978890 R08: ffff810000032600 R09: 7fffffffffffffff
R10: ffff8107c0bc78a8 R11: ffffffff80039e56 R12: ffff8107c0050948
R13: 0000000000000000 R14: ffff8107d908d000 R15: ffffffff88742600
FS:  00002aaed55fa6e0(0000) GS:ffff81011bbdb640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000001c8 CR3: 00000007c6925000 CR4: 00000000000006e0
Process mkfs.lustre (pid: 3290, threadinfo ffff8107c0bc6000, task ffff8107d77507a0)
Stack:  7fffffffffffffff ffff8104d1978990 ffff8107c01b2c00 ffffffff8002303b
 ffff8104d1978990 ffffffff80039f9c 0000000000000000 ffff8107c00508e8
 0000000000000000 ffffffff800ede72 ffff8107c01b2c00 ffffffff88764d00
Call Trace:
 [<ffffffff8002303b>] clear_inode+0xd2/0x123
 [<ffffffff80039f9c>] generic_drop_inode+0x146/0x15a
 [<ffffffff800ede72>] shrink_dcache_for_umount_subtree+0x1f2/0x21e
 [<ffffffff800ee40c>] shrink_dcache_for_umount+0x35/0x43
 [<ffffffff800e636b>] generic_shutdown_super+0x1b/0xfb
 [<ffffffff800e647c>] kill_block_super+0x31/0x45
 [<ffffffff800e654a>] deactivate_super+0x6a/0x82
 [<ffffffff800e6c6f>] get_sb_bdev+0x121/0x16c
 [<ffffffff800e65f5>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800e66be>] do_kern_mount+0x36/0x4d
 [<ffffffff800f0fc6>] do_mount+0x6a9/0x719
 [<ffffffff8002b502>] flush_tlb_page+0xac/0xda
 [<ffffffff8001125b>] do_wp_page+0x3f8/0x91e
 [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530
 [<ffffffff80019de3>] __getblk+0x25/0x236
 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039
 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255
 [<ffffffff800ce756>] zone_statistics+0x3e/0x6d
 [<ffffffff8000f41e>] __alloc_pages+0x78/0x308
 [<ffffffff800eadb4>] sys_mkdirat+0xd1/0xe4
 [<ffffffff8004c74a>] sys_mount+0x8a/0xcd
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 30 
RIP  [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
 RSP <ffff8107c0bc7ad8>

The issue was also described in LU-136 #comment-14649, #comment-17082 and #comment-14650.



 Comments   
Comment by Jian Yu [ 02/Jul/11 ]

The core dump showed that:

[root@localhost ~]# crash /usr/lib/debug/lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/vmlinux /mnt/var/crash/2011-07-01-10\:36/vmcore
<~snip~>
WARNING: cannot determine pgdat list for this kernel/architecture

please wait... (gathering kmem slab cache data)
crash: invalid size request: 0  type: "array cache array"

crash: unable to initialize kmem slab cache subsystem

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/vmlinux
    DUMPFILE: /mnt/var/crash/2011-07-01-10:36/vmcore
        CPUS: 4
        DATE: Fri Jul  1 10:35:51 2011
      UPTIME: 00:05:55
LOAD AVERAGE: 1.46, 0.54, 0.21
       TASKS: 150
    NODENAME: localhost.localdomain
     RELEASE: 2.6.18-238.12.1.el5_lustre.g6a3d997
     VERSION: #1 SMP Thu Jun 23 12:18:56 PDT 2011
     MACHINE: x86_64  (2667 Mhz)
      MEMORY: 0
       PANIC: ""
         PID: 3290
     COMMAND: "mkfs.lustre"
        TASK: ffff8107d77507a0  [THREAD_INFO: ffff8107c0bc6000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC)

crash> bt -l 3290
PID: 3290   TASK: ffff8107d77507a0  CPU: 3   COMMAND: "mkfs.lustre"
 #0 [ffff8107c0bc7830] crash_kexec at ffffffff800af898
    include/asm/system.h: 161
 #1 [ffff8107c0bc78f0] __die at ffffffff80065117
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 566
 #2 [ffff8107c0bc7930] do_page_fault at ffffffff8006748d
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 582
 #3 [ffff8107c0bc7a20] error_exit at ffffffff8005dde9
    include/linux/bitops.h: 42
    [exception RIP: ldiskfs_clear_inode+129]
    RIP: ffffffff887421f1  RSP: ffff8107c0bc7ad8  RFLAGS: 00010296
    RAX: 0000000000000000  RBX: ffff8104d1978990  RCX: ffff8107c01b2cc0
    RDX: ffff8107c01b2cc0  RSI: ffff8104d1978b98  RDI: ffff8104d1978990
    RBP: ffff8104d1978890   R8: ffff810000032600   R9: 7fffffffffffffff
    R10: ffff8107c0bc78a8  R11: ffffffff80039e56  R12: ffff8107c0050948
    R13: 0000000000000000  R14: ffff8107d908d000  R15: ffffffff88742600
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff8107c0bc7af0] clear_inode at ffffffff8002303b
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/inode.c: 257
 #5 [ffff8107c0bc7b00] generic_drop_inode at ffffffff80039f9c
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/inode.c: 1091
 #6 [ffff8107c0bc7b20] shrink_dcache_for_umount_subtree at ffffffff800ede72
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/dcache.c: 642
 #7 [ffff8107c0bc7b40] shrink_dcache_for_umount at ffffffff800ee40c
    include/linux/list.h: 732
 #8 [ffff8107c0bc7b50] generic_shutdown_super at ffffffff800e636b
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 243
 #9 [ffff8107c0bc7b70] kill_block_super at ffffffff800e647c
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 756
#10 [ffff8107c0bc7b90] deactivate_super at ffffffff800e654a
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 184
#11 [ffff8107c0bc7bb0] get_sb_bdev at ffffffff800e6c6f
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 728
#12 [ffff8107c0bc7c20] vfs_kern_mount at ffffffff800e65f5
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 845
#13 [ffff8107c0bc7c60] do_kern_mount at ffffffff800e66be
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 879
#14 [ffff8107c0bc7c90] do_mount at ffffffff800f0fc6
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/namespace.c: 1105
#15 [ffff8107c0bc7f30] sys_mount at ffffffff8004c74a
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/namespace.c: 1600
#16 [ffff8107c0bc7f80] tracesys at ffffffff8005d28d (via system_call)
    include/linux/bitops.h: 42
    RIP: 00000031e78d4a0a  RSP: 00007fff20effc88  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000407593  RSI: 00007fff20f00d10  RDI: 00007fff20f03d60
    RBP: 0000000000613bc0   R8: 00007fff20f01d60   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000060a040
    R13: 00007fff20f00d60  R14: 00007fff20f03d60  R15: 0000000000000000
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

crash> l *0xffffffff887421f1
No source file for address 0xffffffff887421f1.
gdb: gdb request failed: l *0xffffffff887421f1

crash> l *0xffffffff88742272
No source file for address 0xffffffff88742272.
gdb: gdb request failed: l *0xffffffff88742272

crash> l *0xffffffff8002303b
0xffffffff8002303b is in clear_inode (fs/inode.c:257).
252             BUG_ON(inode->i_state & I_CLEAR);
253             wait_on_inode(inode);
254             DQUOT_DROP(inode);
255             if (inode->i_sb && inode->i_sb->s_op->clear_inode)
256                     inode->i_sb->s_op->clear_inode(inode);
257             if (S_ISBLK(inode->i_mode) && inode->i_bdev)
258                     bd_forget(inode);
259             if (S_ISCHR(inode->i_mode) && inode->i_cdev)
260                     cd_forget(inode);
261             inode->i_state = I_CLEAR;
crash> 

The stack backtrace did not show in which file the ldiskfs_clear_inode() was located and where the exception occurred inside ldiskfs_clear_inode() with offset +129.

By running gdb, I got:

[root@localhost ~]# gdb /lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko
<~snip~>
(gdb) l ldiskfs_clear_inode
814     /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el5/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c: No such file or directory.
        in /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el5/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c
(gdb) 

I could not get the above super.c file, so I used the kernel and Lustre source codes to do the build again and got:

ldiskfs/ldiskfs/super.c:

    813 static void destroy_inodecache(void)
    814 {
    815         kmem_cache_destroy(ldiskfs_inode_cachep);
    816 }
    817 
    818 static void ldiskfs_clear_inode(struct inode *inode)
    819 {
    820 #ifdef CONFIG_LDISKFS_FS_POSIX_ACL
    821         if (LDISKFS_I(inode)->i_acl &&
    822                         LDISKFS_I(inode)->i_acl != LDISKFS_ACL_NOT_CACHED) {
    823                 posix_acl_release(LDISKFS_I(inode)->i_acl);
    824                 LDISKFS_I(inode)->i_acl = LDISKFS_ACL_NOT_CACHED;
    825         }
    826         if (LDISKFS_I(inode)->i_default_acl &&
    827                         LDISKFS_I(inode)->i_default_acl != LDISKFS_ACL_NOT_CACHED) {
    828                 posix_acl_release(LDISKFS_I(inode)->i_default_acl);
    829                 LDISKFS_I(inode)->i_default_acl = LDISKFS_ACL_NOT_CACHED;
    830         }
    831 #endif
    832         ldiskfs_discard_preallocations(inode);
    833         if (LDISKFS_JOURNAL(inode))
    834                 jbd2_journal_release_jbd_inode(LDISKFS_SB(inode->i_sb)->s_journal,
    835                                        &LDISKFS_I(inode)->jinode);
    836 }

Andreas, could you please give some suggestions here? I'm a bit confused how to investigate further to find the exact location of the exception.

Comment by Andreas Dilger [ 03/Jul/11 ]

It is possIble to use somerhing lke:

Gdb> list *(gdb_list_inodes + 123)

To figure out the line number within the function. This ahold be printed in the original oops message.

Alternately, it should hopefully be possible to look at the stack trace to see where the code was running before it crashed.

Comment by Jian Yu [ 05/Jul/11 ]

Here:

# gdb /lib/modules/2.6.18-238.12.1.el5_lustre/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko
<~snip~>
Reading symbols from /lib/modules/2.6.18-238.12.1.el5_lustre/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko...done.
(gdb) l *(ldiskfs_clear_inode+129)
0x291f1 is in ldiskfs_clear_inode (/mnt/src/lustre-release/build/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c:833).
828                     posix_acl_release(LDISKFS_I(inode)->i_default_acl);
829                     LDISKFS_I(inode)->i_default_acl = LDISKFS_ACL_NOT_CACHED;
830             }
831     #endif
832             ldiskfs_discard_preallocations(inode);
833             if (LDISKFS_JOURNAL(inode))
834                     jbd2_journal_release_jbd_inode(LDISKFS_SB(inode->i_sb)->s_journal,
835                                            &LDISKFS_I(inode)->jinode);
836     }
837

The oops occurred here:

833             if (LDISKFS_JOURNAL(inode))

Here are the definitions of LDISKFS_JOURNAL and LDISKFS_SB:

#define LDISKFS_JOURNAL(inode)  (LDISKFS_SB((inode)->i_sb)->s_journal)

static inline struct ldiskfs_sb_info *LDISKFS_SB(struct super_block *sb)
{               
        return sb->s_fs_info;
}
Comment by Alex Zhuravlev [ 05/Jul/11 ]

given the following:

LDISKFS-fs: can't allocate buddy meta group
LDISKFS-fs (dm-3): failed to initalize mballoc (-12)
LDISKFS-fs (dm-3): mount failed

it should be easy to reproduce the problem ? and it gives a hint that probably the inode was not
quite initialized yet

Comment by Jian Yu [ 05/Jul/11 ]

it should be easy to reproduce the problem ? and it gives a hint that probably the inode was not quite initialized yet

Yes, the memory allocation failure and oops could be easily reproduced while formatting an 128TB OST.

The following kmalloc codes in fs/ext4/mballoc.c produced the memory allocation failure:

static int ext4_mb_init_backend(struct super_block *sb)
{
    //......
    sbi->s_group_info = kmalloc(array_size, GFP_KERNEL);
    if (sbi->s_group_info == NULL) {
        printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
        return -ENOMEM;
    }
    //......
}

I'm changing the codes to use vmalloc in case kmalloc failed to allocate enough memory.

Comment by Alex Zhuravlev [ 05/Jul/11 ]

no-no, I'm saying to understand and fix that oops you can replace that kmalloc() with just return -ENOMEM and catch the oops?

Comment by Andreas Dilger [ 05/Jul/11 ]

It should be possible to find and fix this issue just through code inspection. Initially I thought it might be the buddy inode, but that isn't allocated until after the failed kmalloc(), so it shouldn't be the cause of the problem. It is possible to determine whether it is i_sb or s_fs_info that is NULL, by checking which one has an offset of 0x1c8 in the struct, due to the oops message "NULL pointer dereference at 00000000000001c8".

I'm looking through this code and have found some other issues:

  • in ext4_mb_init_backend() it is calling the generic get_next_ino() function to assign an inode number to the inode that is holding the buddy bitmap cache. However, this inode number may be any random value, and may conflict with a real in-use inode number. As a result, it looks like it could potentially cause data corruption with the on-disk inode of the same number and/or the in-memory buddy bitmap if iget(sb, ino) finds the wrong inode. It would be better to use EXT4_BAD_INO for this and add a comment that this inode is not hashed, so iget() shouldn't find it.
  • in ext4_mb_init() if ext4_mb_init_backend() succeeds, but there is an error later on in the function (e.g. s_locality_groups), the buddy inode and group cache allocated in ext4_mb_init_backend() are not freed. Moving the call to ext4_mb_init_backend() just above s_proc setup would avoid this problem

Even

Comment by Jian Yu [ 06/Jul/11 ]

It is possible to determine whether it is i_sb or s_fs_info that is NULL, by checking which one has an offset of 0x1c8 in the struct, due to the oops message "NULL pointer dereference at 00000000000001c8".

(gdb) p &((struct inode *)0).i_sb
$11 = (struct super_block **) 0xf8
(gdb) p &((struct super_block *)0).s_fs_info
$12 = (void **) 0x260
(gdb) p &((struct ldiskfs_sb_info *)0).s_journal
$13 = (struct journal_s **) 0x1c8

So, s_fs_info is NULL.

In ldiskfs_fill_super():

{
    //......
    root = ldiskfs_iget(sb, LDISKFS_ROOT_INO);
    //......
    err = ldiskfs_mb_init(sb, needs_recovery);
    if (err) {
        ldiskfs_msg(sb, KERN_ERR, "failed to initalize mballoc (%d)",
            err);
        goto failed_mount4;
    }
    //......
failed_mount4:
    ldiskfs_msg(sb, KERN_ERR, "mount failed");
    destroy_workqueue(LDISKFS_SB(sb)->dio_unwritten_wq);
    //......
out_fail:
    sb->s_fs_info = NULL;
    kfree(sbi);
    lock_kernel();
    return ret;
}

The missing iput of root inode before "sb->s_fs_info = NULL" caused the crash in ldiskfs_clear_inode().
It's the same issue as https://bugzilla.kernel.org/show_bug.cgi?id=26752, which was fixed in the following commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f

Thanks Li Wei for helping investigate this.

I'd incorporate the patch with other changes.

Comment by Jian Yu [ 06/Jul/11 ]

In the above patch:

        sb->s_root = d_alloc_root(root);
        if (!sb->s_root) {
                ext4_msg(sb, KERN_ERR, "get root dentry failed");
-               iput(root);
                ret = -ENOMEM;
                goto failed_mount4;
        }
@@ -3647,6 +3646,8 @@ cantfind_ext4:
        goto failed_mount;
 
 failed_mount4:
+       iput(root);
+       sb->s_root = NULL;

After getting root dentry by running "sb->s_root = d_alloc_root(root);", it was set to NULL directly in the mount failure path. Should d_free() be called to free the dentry here?

Comment by Andreas Dilger [ 06/Jul/11 ]

Yu Jian, you are correct. It looks like the dentry is leaked in this failure case.

However, it also looks like the upstream kernel has coincidentally fixed the original oops by not dereferencing s_fs_info in ext4_clear_inode() in this case, so the original patch could be reverted.

I think the effort to fix this correctly for the older kernels is not worthwhile because it would mean either adding an extra check in ext4_clear_inode() that is virtually always unnecessary. I think we should just use the upstream fix from commit 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f for our kernel (less risk and effort for us), and work separately to fix the code correctly in the upstream kernel. I've sent an email to that effect, and CC'd you.

Can you please verify that with the fix from 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f there are no longer crashes on mount when ENOMEM is hit? After that it makes sense to add the patch from http://review.whamcloud.com/#change,545 and any other vmalloc-or-kmalloc changes that are needed to mount the filesystem at > 128 TB. Even if we do full testing for 128TB LUNs, doing mount testing with 129TB LUNs ensures that smaller LUNs can still mount in case of memory fragmentation (as was seen here with 128TB LUNs, and could be hit at even smaller sizes).

Comment by Jian Yu [ 07/Jul/11 ]

Can you please verify that with the fix from 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f there are no longer crashes on mount when ENOMEM is hit?

Sure. Here is the result:

# time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=10.0.2.15@tcp --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv
mkfs.lustre: Unable to mount /dev/large_vg/ost_lv: Invalid argument

mkfs.lustre FATAL: failed to write local files

   Permanent disk data:
Target:     largefs-OSTffff
Index:      unassigned
Lustre FS:  largefs
Mount type: ldiskfs
Flags:      0x72
              (OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc,force_over_16tb
Parameters: mgsnode=10.0.2.15@tcp

device size = 134217728MB
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
        target name  largefs-OSTffff
        4k blocks     34359738368
        options        -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff  -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F /dev/large_vg/ost_lv 34359738368
mkfs.lustre: exiting with 22 (Invalid argument)

Console log:

Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv =====================
LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended
LDISKFS-fs: can't allocate buddy meta group
LDISKFS-fs (dm-3): failed to initalize mballoc (-12)
LDISKFS-fs (dm-3): mount failed

No crash occurred. A small issue was that ldiskfs_fill_super() returned the default error number "-EINVAL" instead of the one "-ENOMEM" returned from ldiskfs_mb_init().

I'm verifying the vmalloc patch.

Comment by Jian Yu [ 07/Jul/11 ]

I'm verifying the vmalloc patch.

The patch for master branch is in http://review.whamcloud.com/1071.
I verified it on formatting and mounting 24T, 128T, 129T and 198T OSTs, it worked.

Comment by Andreas Dilger [ 07/Jul/11 ]

Can you please also submit a patch with the iput() changes. It should contain the upstream commit hash in the commit comment for future reference.

Also, please make a force-over-128tb patch for master.

Comment by Jian Yu [ 07/Jul/11 ]

in ext4_mb_init_backend() it is calling the generic get_next_ino() function to assign an inode number to the inode that is holding the buddy bitmap cache. However, this inode number may be any random value, and may conflict with a real in-use inode number. As a result, it looks like it could potentially cause data corruption with the on-disk inode of the same number and/or the in-memory buddy bitmap if iget(sb, ino) finds the wrong inode. It would be better to use EXT4_BAD_INO for this and add a comment that this inode is not hashed, so iget() shouldn't find it.

I did not find get_next_ino() function in kernel linux-2.6.18-238.12.1. The codes for getting a buddy cache inode in this kernel version are as follows:

        sbi->s_buddy_cache = new_inode(sb);
        if (sbi->s_buddy_cache == NULL) {
                printk(KERN_ERR "EXT4-fs: can't get new inode\n");
                goto err_freesgi;
        }

If it's better to use EXT4_BAD_INO for the buddy cache inode number, could you please review whether the following patch is correct to get the inode?

--- ext4.h.orig 2011-07-07 14:50:14.000000000 +0800
+++ ext4.h      2011-07-07 14:52:24.000000000 +0800
 static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 {
-       return ino == EXT4_ROOT_INO ||
+       return ino == EXT4_BAD_INO ||
+               ino == EXT4_ROOT_INO ||
                ino == EXT4_JOURNAL_INO ||
                ino == EXT4_RESIZE_INO ||
                (ino >= EXT4_FIRST_INO(sb) &&
--- mballoc.c.orig      2011-07-05 19:20:05.000000000 +0800
+++ mballoc.c   2011-07-07 18:34:00.000000000 +0800
-       sbi->s_buddy_cache = new_inode(sb);
-       if (sbi->s_buddy_cache == NULL) {
-               printk(KERN_ERR "EXT4-fs: can't get new inode\n");
+
+       /*
+        * To avoid conflicting with an on-disk inode of the same number,
+        * EXT4_BAD_INO is used here as the number of the buddy cache inode,
+        * which is not hashed in the inode cache, and then would not be found
+        * by iget().
+        */
+       sbi->s_buddy_cache = ext4_iget(sb, EXT4_BAD_INO);
+       if (IS_ERR(sbi->s_buddy_cache)) {
+               printk(KERN_ERR "EXT4-fs: can't get buddy cache inode\n");
+               ret = PTR_ERR(sbi->s_buddy_cache);
+               sbi->s_buddy_cache = NULL;
                goto err_freesgi;
        }
Comment by Jian Yu [ 07/Jul/11 ]

Can you please also submit a patch with the iput() changes. It should contain the upstream commit hash in the commit comment for future reference.

It's included in http://review.whamcloud.com/1071.

Also, please make a force-over-128tb patch for master.

OK, will do.

Comment by Andreas Dilger [ 07/Jul/11 ]

.bq I did not find get_next_ino() function in kernel linux-2.6.18-238.12.1.

Sorry, I was looking at a newer kernel, where the i_ino assignment was moved out from new_inode() and calls:

sbi->s_buddy_cache->i_ino = get_next_ino();

This was probably done as part of some generic search & replace operation. However, I also don't think it is desirable to use "ext4_iget(EXT4_BAD_INO)" to read the real on-disk inode, since one of the reasons for excluding EXT4_BAD_INO from the "valid" inode range is that it shouldn't be accessed from within the kernel. All I wanted was to make sure that the allocated inode has an inode number that is not colliding with an valid on-disk inode number, and EXT4_BAD_INO is a relatively safe choice. I would simply assign that value after new_inode() is finished:

sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; /* avoid potential confusion */

According to Alex, the inode allocated by new_inode() does not actually exist in the inode hash table, so as long as this isn't changed to use ext4_iget() it is safe from being found from another ext4_iget() operation.

Comment by Jian Yu [ 08/Jul/11 ]

Also, please make a force-over-128tb patch for master.

Patch for master branch: http://review.whamcloud.com/1073.

Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #198
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel5.patch
Comment by Peter Jones [ 08/Jul/11 ]

Can this issue be marked as resolved now or does some further work still remain outstanding?

Comment by Jian Yu [ 11/Jul/11 ]

Can this issue be marked as resolved now or does some further work still remain outstanding?

The similar patch is also needed for the RHEL6 ldiskfs patch series. I'm working on it.

In addition, should I port the RHEL5 series patch in http://review.whamcloud.com/1071 to b1_8?

Comment by Jian Yu [ 11/Jul/11 ]

Hello Andreas,
While looking into the RHEL6 kernel 2.6.32-131.2.1, I found:

include/linux/slab.h:
/*
 * The largest kmalloc size supported by the slab allocators is
 * 32 megabyte (2^25) or the maximum allocatable page order if that is
 * less than 32 MB.
 *
 * WARNING: Its not easy to increase this value since the allocators have
 * to do various tricks to work around compiler limitations in order to
 * ensure proper constant folding.
 */
#define KMALLOC_SHIFT_HIGH      ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
                                (MAX_ORDER + PAGE_SHIFT - 1) : 25)

#define KMALLOC_MAX_SIZE        (1UL << KMALLOC_SHIFT_HIGH)
#define KMALLOC_MAX_ORDER       (KMALLOC_SHIFT_HIGH - PAGE_SHIFT)

The above codes were introduced by the following upstream kernel commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0aa817f078b655d0ae36669169d73a5c8a388016

So, 128KB is not an allocation limitation for kmalloc() in this kernel. Should I still need make the kmalloc+vmalloc changes to this kernel or just the iput+EXT4_BAD_INO changes?

Comment by Peter Jones [ 11/Jul/11 ]

Yu Jian

We are not likely to extend the largest LUN size beyond 24TB for 1.8.x. Is this fix relevant for <=24TB LUNs?

Peter

Comment by Jian Yu [ 11/Jul/11 ]

We are not likely to extend the largest LUN size beyond 24TB for 1.8.x. Is this fix relevant for <=24TB LUNs?

The patch in http://review.whamcloud.com/1071 is mainly for fixing the out of memory issue while formatting >=128TB LUNs. The <=24TB LUNs would not hit such issue, so the patch is not needed on 1.8.x if it would not support >=128TB LUNs.

Comment by Peter Jones [ 11/Jul/11 ]

ok yujian then it sounds like all that remains is the RHEL6 version of the ldiskfs patch.

Comment by Andreas Dilger [ 13/Jul/11 ]

I've submitted a version of this patch to upstream, and hopefully it will be included in the Linux 3.1 kernel. Even for the RHEL6 kernel the vmalloc() patch is needed, since large kmalloc() requests can, and will fail due to memory fragmentation even when a large enough kmalloc() is possible.

Comment by Jian Yu [ 14/Jul/11 ]

The patch for RHEL6 ldiskfs patch series is in http://review.whamcloud.com/1095.

Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Peter Jones [ 21/Jul/11 ]

Please reopen if any problems observed running with 128TB LUNs on RHEL6

Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
Comment by Build Master (Inactive) [ 21/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #222
LU-477 allocate memory for s_group_desc and s_group_info by vmalloc()

Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
Files :

  • ldiskfs/kernel_patches/patches/ext4-vmalloc-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Generated at Sat Feb 10 01:07:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.