[LU-477] Oops: RIP: ldiskfs:ldiskfs_clear_inode+0x81/0xb0 Created: 02/Jul/11 Updated: 21/Jul/11 Resolved: 21/Jul/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: master |
||
| Severity: | 3 |
| Rank (Obsolete): | 4949 |
| Description |
|
While formatting an 128TB OST on DDN SFA10KE with the following command: mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=10.0.2.15@tcp --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv It hit kernel panic as follows: Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv ===================== init dynlocks cache ldiskfs created from ext4-2.6-rhel5 LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended LDISKFS-fs: can't allocate buddy meta group LDISKFS-fs (dm-3): failed to initalize mballoc (-12) LDISKFS-fs (dm-3): mount failed Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP: [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 PGD 7c3436067 PUD 7c0051067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 3 Modules linked in: ldiskfs(U) jbd2(U) crc16(U) lnet(U) libcfs(U) raid0(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) tpm_tis(U) ide_cd(U) i2c_piix4(U) tpm(U) parport_pc(U) sfablkdrvr(U) parport(U) 8139cp(U) mlx4_core(U) serio_raw(U) tpm_bios(U) cdrom(U) pcspkr(U) i2c_core(U) mii(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 3290, comm: mkfs.lustre Tainted: G 2.6.18-238.12.1.el5_lustre.g6a3d997 #1 RIP: 0010:[<ffffffff887421f1>] [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP: 0018:ffff8107c0bc7ad8 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffff8104d1978990 RCX: ffff8107c01b2cc0 RDX: ffff8107c01b2cc0 RSI: ffff8104d1978b98 RDI: ffff8104d1978990 RBP: ffff8104d1978890 R08: ffff810000032600 R09: 7fffffffffffffff R10: ffff8107c0bc78a8 R11: ffffffff80039e56 R12: ffff8107c0050948 R13: 0000000000000000 R14: ffff8107d908d000 R15: ffffffff88742600 FS: 00002aaed55fa6e0(0000) GS:ffff81011bbdb640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000001c8 CR3: 00000007c6925000 CR4: 00000000000006e0 Process mkfs.lustre (pid: 3290, threadinfo ffff8107c0bc6000, task ffff8107d77507a0) Stack: 7fffffffffffffff ffff8104d1978990 ffff8107c01b2c00 ffffffff8002303b ffff8104d1978990 ffffffff80039f9c 0000000000000000 ffff8107c00508e8 0000000000000000 ffffffff800ede72 ffff8107c01b2c00 ffffffff88764d00 Call Trace: [<ffffffff8002303b>] clear_inode+0xd2/0x123 [<ffffffff80039f9c>] generic_drop_inode+0x146/0x15a [<ffffffff800ede72>] shrink_dcache_for_umount_subtree+0x1f2/0x21e [<ffffffff800ee40c>] shrink_dcache_for_umount+0x35/0x43 [<ffffffff800e636b>] generic_shutdown_super+0x1b/0xfb [<ffffffff800e647c>] kill_block_super+0x31/0x45 [<ffffffff800e654a>] deactivate_super+0x6a/0x82 [<ffffffff800e6c6f>] get_sb_bdev+0x121/0x16c [<ffffffff800e65f5>] vfs_kern_mount+0x93/0x11a [<ffffffff800e66be>] do_kern_mount+0x36/0x4d [<ffffffff800f0fc6>] do_mount+0x6a9/0x719 [<ffffffff8002b502>] flush_tlb_page+0xac/0xda [<ffffffff8001125b>] do_wp_page+0x3f8/0x91e [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530 [<ffffffff80019de3>] __getblk+0x25/0x236 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255 [<ffffffff800ce756>] zone_statistics+0x3e/0x6d [<ffffffff8000f41e>] __alloc_pages+0x78/0x308 [<ffffffff800eadb4>] sys_mkdirat+0xd1/0xe4 [<ffffffff8004c74a>] sys_mount+0x8a/0xcd [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 30 RIP [<ffffffff887421f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP <ffff8107c0bc7ad8> The issue was also described in |
| Comments |
| Comment by Jian Yu [ 02/Jul/11 ] |
|
The core dump showed that: [root@localhost ~]# crash /usr/lib/debug/lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/vmlinux /mnt/var/crash/2011-07-01-10\:36/vmcore
<~snip~>
WARNING: cannot determine pgdat list for this kernel/architecture
please wait... (gathering kmem slab cache data)
crash: invalid size request: 0 type: "array cache array"
crash: unable to initialize kmem slab cache subsystem
KERNEL: /usr/lib/debug/lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/vmlinux
DUMPFILE: /mnt/var/crash/2011-07-01-10:36/vmcore
CPUS: 4
DATE: Fri Jul 1 10:35:51 2011
UPTIME: 00:05:55
LOAD AVERAGE: 1.46, 0.54, 0.21
TASKS: 150
NODENAME: localhost.localdomain
RELEASE: 2.6.18-238.12.1.el5_lustre.g6a3d997
VERSION: #1 SMP Thu Jun 23 12:18:56 PDT 2011
MACHINE: x86_64 (2667 Mhz)
MEMORY: 0
PANIC: ""
PID: 3290
COMMAND: "mkfs.lustre"
TASK: ffff8107d77507a0 [THREAD_INFO: ffff8107c0bc6000]
CPU: 3
STATE: TASK_RUNNING (PANIC)
crash> bt -l 3290
PID: 3290 TASK: ffff8107d77507a0 CPU: 3 COMMAND: "mkfs.lustre"
#0 [ffff8107c0bc7830] crash_kexec at ffffffff800af898
include/asm/system.h: 161
#1 [ffff8107c0bc78f0] __die at ffffffff80065117
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 566
#2 [ffff8107c0bc7930] do_page_fault at ffffffff8006748d
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 582
#3 [ffff8107c0bc7a20] error_exit at ffffffff8005dde9
include/linux/bitops.h: 42
[exception RIP: ldiskfs_clear_inode+129]
RIP: ffffffff887421f1 RSP: ffff8107c0bc7ad8 RFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff8104d1978990 RCX: ffff8107c01b2cc0
RDX: ffff8107c01b2cc0 RSI: ffff8104d1978b98 RDI: ffff8104d1978990
RBP: ffff8104d1978890 R8: ffff810000032600 R9: 7fffffffffffffff
R10: ffff8107c0bc78a8 R11: ffffffff80039e56 R12: ffff8107c0050948
R13: 0000000000000000 R14: ffff8107d908d000 R15: ffffffff88742600
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#4 [ffff8107c0bc7af0] clear_inode at ffffffff8002303b
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/inode.c: 257
#5 [ffff8107c0bc7b00] generic_drop_inode at ffffffff80039f9c
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/inode.c: 1091
#6 [ffff8107c0bc7b20] shrink_dcache_for_umount_subtree at ffffffff800ede72
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/dcache.c: 642
#7 [ffff8107c0bc7b40] shrink_dcache_for_umount at ffffffff800ee40c
include/linux/list.h: 732
#8 [ffff8107c0bc7b50] generic_shutdown_super at ffffffff800e636b
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 243
#9 [ffff8107c0bc7b70] kill_block_super at ffffffff800e647c
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 756
#10 [ffff8107c0bc7b90] deactivate_super at ffffffff800e654a
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 184
#11 [ffff8107c0bc7bb0] get_sb_bdev at ffffffff800e6c6f
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 728
#12 [ffff8107c0bc7c20] vfs_kern_mount at ffffffff800e65f5
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 845
#13 [ffff8107c0bc7c60] do_kern_mount at ffffffff800e66be
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/super.c: 879
#14 [ffff8107c0bc7c90] do_mount at ffffffff800f0fc6
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/namespace.c: 1105
#15 [ffff8107c0bc7f30] sys_mount at ffffffff8004c74a
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/namespace.c: 1600
#16 [ffff8107c0bc7f80] tracesys at ffffffff8005d28d (via system_call)
include/linux/bitops.h: 42
RIP: 00000031e78d4a0a RSP: 00007fff20effc88 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff
RDX: 0000000000407593 RSI: 00007fff20f00d10 RDI: 00007fff20f03d60
RBP: 0000000000613bc0 R8: 00007fff20f01d60 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000060a040
R13: 00007fff20f00d60 R14: 00007fff20f03d60 R15: 0000000000000000
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
crash> l *0xffffffff887421f1
No source file for address 0xffffffff887421f1.
gdb: gdb request failed: l *0xffffffff887421f1
crash> l *0xffffffff88742272
No source file for address 0xffffffff88742272.
gdb: gdb request failed: l *0xffffffff88742272
crash> l *0xffffffff8002303b
0xffffffff8002303b is in clear_inode (fs/inode.c:257).
252 BUG_ON(inode->i_state & I_CLEAR);
253 wait_on_inode(inode);
254 DQUOT_DROP(inode);
255 if (inode->i_sb && inode->i_sb->s_op->clear_inode)
256 inode->i_sb->s_op->clear_inode(inode);
257 if (S_ISBLK(inode->i_mode) && inode->i_bdev)
258 bd_forget(inode);
259 if (S_ISCHR(inode->i_mode) && inode->i_cdev)
260 cd_forget(inode);
261 inode->i_state = I_CLEAR;
crash>
The stack backtrace did not show in which file the ldiskfs_clear_inode() was located and where the exception occurred inside ldiskfs_clear_inode() with offset +129. By running gdb, I got: [root@localhost ~]# gdb /lib/modules/2.6.18-238.12.1.el5_lustre.g6a3d997/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko
<~snip~>
(gdb) l ldiskfs_clear_inode
814 /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el5/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c: No such file or directory.
in /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el5/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c
(gdb)
I could not get the above super.c file, so I used the kernel and Lustre source codes to do the build again and got: ldiskfs/ldiskfs/super.c:
813 static void destroy_inodecache(void)
814 {
815 kmem_cache_destroy(ldiskfs_inode_cachep);
816 }
817
818 static void ldiskfs_clear_inode(struct inode *inode)
819 {
820 #ifdef CONFIG_LDISKFS_FS_POSIX_ACL
821 if (LDISKFS_I(inode)->i_acl &&
822 LDISKFS_I(inode)->i_acl != LDISKFS_ACL_NOT_CACHED) {
823 posix_acl_release(LDISKFS_I(inode)->i_acl);
824 LDISKFS_I(inode)->i_acl = LDISKFS_ACL_NOT_CACHED;
825 }
826 if (LDISKFS_I(inode)->i_default_acl &&
827 LDISKFS_I(inode)->i_default_acl != LDISKFS_ACL_NOT_CACHED) {
828 posix_acl_release(LDISKFS_I(inode)->i_default_acl);
829 LDISKFS_I(inode)->i_default_acl = LDISKFS_ACL_NOT_CACHED;
830 }
831 #endif
832 ldiskfs_discard_preallocations(inode);
833 if (LDISKFS_JOURNAL(inode))
834 jbd2_journal_release_jbd_inode(LDISKFS_SB(inode->i_sb)->s_journal,
835 &LDISKFS_I(inode)->jinode);
836 }
Andreas, could you please give some suggestions here? I'm a bit confused how to investigate further to find the exact location of the exception. |
| Comment by Andreas Dilger [ 03/Jul/11 ] |
|
It is possIble to use somerhing lke: Gdb> list *(gdb_list_inodes + 123) To figure out the line number within the function. This ahold be printed in the original oops message. Alternately, it should hopefully be possible to look at the stack trace to see where the code was running before it crashed. |
| Comment by Jian Yu [ 05/Jul/11 ] |
|
Here: # gdb /lib/modules/2.6.18-238.12.1.el5_lustre/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko <~snip~> Reading symbols from /lib/modules/2.6.18-238.12.1.el5_lustre/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko...done. (gdb) l *(ldiskfs_clear_inode+129) 0x291f1 is in ldiskfs_clear_inode (/mnt/src/lustre-release/build/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/super.c:833). 828 posix_acl_release(LDISKFS_I(inode)->i_default_acl); 829 LDISKFS_I(inode)->i_default_acl = LDISKFS_ACL_NOT_CACHED; 830 } 831 #endif 832 ldiskfs_discard_preallocations(inode); 833 if (LDISKFS_JOURNAL(inode)) 834 jbd2_journal_release_jbd_inode(LDISKFS_SB(inode->i_sb)->s_journal, 835 &LDISKFS_I(inode)->jinode); 836 } 837 The oops occurred here: 833 if (LDISKFS_JOURNAL(inode)) Here are the definitions of LDISKFS_JOURNAL and LDISKFS_SB: #define LDISKFS_JOURNAL(inode) (LDISKFS_SB((inode)->i_sb)->s_journal)
static inline struct ldiskfs_sb_info *LDISKFS_SB(struct super_block *sb)
{
return sb->s_fs_info;
}
|
| Comment by Alex Zhuravlev [ 05/Jul/11 ] |
|
given the following: LDISKFS-fs: can't allocate buddy meta group it should be easy to reproduce the problem ? and it gives a hint that probably the inode was not |
| Comment by Jian Yu [ 05/Jul/11 ] |
Yes, the memory allocation failure and oops could be easily reproduced while formatting an 128TB OST. The following kmalloc codes in fs/ext4/mballoc.c produced the memory allocation failure: static int ext4_mb_init_backend(struct super_block *sb)
{
//......
sbi->s_group_info = kmalloc(array_size, GFP_KERNEL);
if (sbi->s_group_info == NULL) {
printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
return -ENOMEM;
}
//......
}
I'm changing the codes to use vmalloc in case kmalloc failed to allocate enough memory. |
| Comment by Alex Zhuravlev [ 05/Jul/11 ] |
|
no-no, I'm saying to understand and fix that oops you can replace that kmalloc() with just return -ENOMEM and catch the oops? |
| Comment by Andreas Dilger [ 05/Jul/11 ] |
|
It should be possible to find and fix this issue just through code inspection. Initially I thought it might be the buddy inode, but that isn't allocated until after the failed kmalloc(), so it shouldn't be the cause of the problem. It is possible to determine whether it is i_sb or s_fs_info that is NULL, by checking which one has an offset of 0x1c8 in the struct, due to the oops message "NULL pointer dereference at 00000000000001c8". I'm looking through this code and have found some other issues:
Even |
| Comment by Jian Yu [ 06/Jul/11 ] |
(gdb) p &((struct inode *)0).i_sb $11 = (struct super_block **) 0xf8 (gdb) p &((struct super_block *)0).s_fs_info $12 = (void **) 0x260 (gdb) p &((struct ldiskfs_sb_info *)0).s_journal $13 = (struct journal_s **) 0x1c8 So, s_fs_info is NULL. In ldiskfs_fill_super(): {
//......
root = ldiskfs_iget(sb, LDISKFS_ROOT_INO);
//......
err = ldiskfs_mb_init(sb, needs_recovery);
if (err) {
ldiskfs_msg(sb, KERN_ERR, "failed to initalize mballoc (%d)",
err);
goto failed_mount4;
}
//......
failed_mount4:
ldiskfs_msg(sb, KERN_ERR, "mount failed");
destroy_workqueue(LDISKFS_SB(sb)->dio_unwritten_wq);
//......
out_fail:
sb->s_fs_info = NULL;
kfree(sbi);
lock_kernel();
return ret;
}
The missing iput of root inode before "sb->s_fs_info = NULL" caused the crash in ldiskfs_clear_inode(). Thanks Li Wei for helping investigate this. I'd incorporate the patch with other changes. |
| Comment by Jian Yu [ 06/Jul/11 ] |
|
In the above patch: sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
ext4_msg(sb, KERN_ERR, "get root dentry failed");
- iput(root);
ret = -ENOMEM;
goto failed_mount4;
}
@@ -3647,6 +3646,8 @@ cantfind_ext4:
goto failed_mount;
failed_mount4:
+ iput(root);
+ sb->s_root = NULL;
After getting root dentry by running "sb->s_root = d_alloc_root(root);", it was set to NULL directly in the mount failure path. Should d_free() be called to free the dentry here? |
| Comment by Andreas Dilger [ 06/Jul/11 ] |
|
Yu Jian, you are correct. It looks like the dentry is leaked in this failure case. However, it also looks like the upstream kernel has coincidentally fixed the original oops by not dereferencing s_fs_info in ext4_clear_inode() in this case, so the original patch could be reverted. I think the effort to fix this correctly for the older kernels is not worthwhile because it would mean either adding an extra check in ext4_clear_inode() that is virtually always unnecessary. I think we should just use the upstream fix from commit 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f for our kernel (less risk and effort for us), and work separately to fix the code correctly in the upstream kernel. I've sent an email to that effect, and CC'd you. Can you please verify that with the fix from 32a9bb57d7c1fd04ae0f72b8f671501f000a0e9f there are no longer crashes on mount when ENOMEM is hit? After that it makes sense to add the patch from http://review.whamcloud.com/#change,545 and any other vmalloc-or-kmalloc changes that are needed to mount the filesystem at > 128 TB. Even if we do full testing for 128TB LUNs, doing mount testing with 129TB LUNs ensures that smaller LUNs can still mount in case of memory fragmentation (as was seen here with 128TB LUNs, and could be hit at even smaller sizes). |
| Comment by Jian Yu [ 07/Jul/11 ] |
Sure. Here is the result: # time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=10.0.2.15@tcp --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv
mkfs.lustre: Unable to mount /dev/large_vg/ost_lv: Invalid argument
mkfs.lustre FATAL: failed to write local files
Permanent disk data:
Target: largefs-OSTffff
Index: unassigned
Lustre FS: largefs
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc,force_over_16tb
Parameters: mgsnode=10.0.2.15@tcp
device size = 134217728MB
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
target name largefs-OSTffff
4k blocks 34359738368
options -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F /dev/large_vg/ost_lv 34359738368
mkfs.lustre: exiting with 22 (Invalid argument)
Console log: Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv ===================== LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended LDISKFS-fs: can't allocate buddy meta group LDISKFS-fs (dm-3): failed to initalize mballoc (-12) LDISKFS-fs (dm-3): mount failed No crash occurred. A small issue was that ldiskfs_fill_super() returned the default error number "-EINVAL" instead of the one "-ENOMEM" returned from ldiskfs_mb_init(). I'm verifying the vmalloc patch. |
| Comment by Jian Yu [ 07/Jul/11 ] |
The patch for master branch is in http://review.whamcloud.com/1071. |
| Comment by Andreas Dilger [ 07/Jul/11 ] |
|
Can you please also submit a patch with the iput() changes. It should contain the upstream commit hash in the commit comment for future reference. Also, please make a force-over-128tb patch for master. |
| Comment by Jian Yu [ 07/Jul/11 ] |
I did not find get_next_ino() function in kernel linux-2.6.18-238.12.1. The codes for getting a buddy cache inode in this kernel version are as follows: sbi->s_buddy_cache = new_inode(sb);
if (sbi->s_buddy_cache == NULL) {
printk(KERN_ERR "EXT4-fs: can't get new inode\n");
goto err_freesgi;
}
If it's better to use EXT4_BAD_INO for the buddy cache inode number, could you please review whether the following patch is correct to get the inode? --- ext4.h.orig 2011-07-07 14:50:14.000000000 +0800
+++ ext4.h 2011-07-07 14:52:24.000000000 +0800
static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
{
- return ino == EXT4_ROOT_INO ||
+ return ino == EXT4_BAD_INO ||
+ ino == EXT4_ROOT_INO ||
ino == EXT4_JOURNAL_INO ||
ino == EXT4_RESIZE_INO ||
(ino >= EXT4_FIRST_INO(sb) &&
--- mballoc.c.orig 2011-07-05 19:20:05.000000000 +0800
+++ mballoc.c 2011-07-07 18:34:00.000000000 +0800
- sbi->s_buddy_cache = new_inode(sb);
- if (sbi->s_buddy_cache == NULL) {
- printk(KERN_ERR "EXT4-fs: can't get new inode\n");
+
+ /*
+ * To avoid conflicting with an on-disk inode of the same number,
+ * EXT4_BAD_INO is used here as the number of the buddy cache inode,
+ * which is not hashed in the inode cache, and then would not be found
+ * by iget().
+ */
+ sbi->s_buddy_cache = ext4_iget(sb, EXT4_BAD_INO);
+ if (IS_ERR(sbi->s_buddy_cache)) {
+ printk(KERN_ERR "EXT4-fs: can't get buddy cache inode\n");
+ ret = PTR_ERR(sbi->s_buddy_cache);
+ sbi->s_buddy_cache = NULL;
goto err_freesgi;
}
|
| Comment by Jian Yu [ 07/Jul/11 ] |
It's included in http://review.whamcloud.com/1071.
OK, will do. |
| Comment by Andreas Dilger [ 07/Jul/11 ] |
|
.bq I did not find get_next_ino() function in kernel linux-2.6.18-238.12.1. Sorry, I was looking at a newer kernel, where the i_ino assignment was moved out from new_inode() and calls: sbi->s_buddy_cache->i_ino = get_next_ino(); This was probably done as part of some generic search & replace operation. However, I also don't think it is desirable to use "ext4_iget(EXT4_BAD_INO)" to read the real on-disk inode, since one of the reasons for excluding EXT4_BAD_INO from the "valid" inode range is that it shouldn't be accessed from within the kernel. All I wanted was to make sure that the allocated inode has an inode number that is not colliding with an valid on-disk inode number, and EXT4_BAD_INO is a relatively safe choice. I would simply assign that value after new_inode() is finished: sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; /* avoid potential confusion */ According to Alex, the inode allocated by new_inode() does not actually exist in the inode hash table, so as long as this isn't changed to use ext4_iget() it is safe from being found from another ext4_iget() operation. |
| Comment by Jian Yu [ 08/Jul/11 ] |
Patch for master branch: http://review.whamcloud.com/1073. |
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : e2d082eb4451488baea54be34410371122adf0d5
|
| Comment by Peter Jones [ 08/Jul/11 ] |
|
Can this issue be marked as resolved now or does some further work still remain outstanding? |
| Comment by Jian Yu [ 11/Jul/11 ] |
The similar patch is also needed for the RHEL6 ldiskfs patch series. I'm working on it. In addition, should I port the RHEL5 series patch in http://review.whamcloud.com/1071 to b1_8? |
| Comment by Jian Yu [ 11/Jul/11 ] |
|
Hello Andreas, include/linux/slab.h:
/*
* The largest kmalloc size supported by the slab allocators is
* 32 megabyte (2^25) or the maximum allocatable page order if that is
* less than 32 MB.
*
* WARNING: Its not easy to increase this value since the allocators have
* to do various tricks to work around compiler limitations in order to
* ensure proper constant folding.
*/
#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_HIGH)
#define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_HIGH - PAGE_SHIFT)
The above codes were introduced by the following upstream kernel commit: So, 128KB is not an allocation limitation for kmalloc() in this kernel. Should I still need make the kmalloc+vmalloc changes to this kernel or just the iput+EXT4_BAD_INO changes? |
| Comment by Peter Jones [ 11/Jul/11 ] |
|
Yu Jian We are not likely to extend the largest LUN size beyond 24TB for 1.8.x. Is this fix relevant for <=24TB LUNs? Peter |
| Comment by Jian Yu [ 11/Jul/11 ] |
The patch in http://review.whamcloud.com/1071 is mainly for fixing the out of memory issue while formatting >=128TB LUNs. The <=24TB LUNs would not hit such issue, so the patch is not needed on 1.8.x if it would not support >=128TB LUNs. |
| Comment by Peter Jones [ 11/Jul/11 ] |
|
ok yujian then it sounds like all that remains is the RHEL6 version of the ldiskfs patch. |
| Comment by Andreas Dilger [ 13/Jul/11 ] |
|
I've submitted a version of this patch to upstream, and hopefully it will be included in the Linux 3.1 kernel. Even for the RHEL6 kernel the vmalloc() patch is needed, since large kmalloc() requests can, and will fail due to memory fragmentation even when a large enough kmalloc() is possible. |
| Comment by Jian Yu [ 14/Jul/11 ] |
|
The patch for RHEL6 ldiskfs patch series is in http://review.whamcloud.com/1095. |
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Peter Jones [ 21/Jul/11 ] |
|
Please reopen if any problems observed running with 128TB LUNs on RHEL6 |
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|
| Comment by Build Master (Inactive) [ 21/Jul/11 ] |
|
Integrated in Oleg Drokin : 0081295f9a0095e52aaa3c39d72172be61d93de6
|