[LU-136] test e2fsprogs-1.42.wc1 against 32TB+ ldiskfs filesystems Created: 17/Mar/11 Updated: 13/Sep/11 Resolved: 01/Sep/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Task | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Bugzilla ID: | 16,038 |
| Rank (Obsolete): | 4966 |
| Description |
|
In order for Lustre to use OSTs larger than 16TB, the e2fsprogs "master" branch needs to be tested against such large LUNs. The "master" branch has unreleased modifications that should allow mke2fs, e2fsck, and other tools to use LUNs over 16TB, but it has not been heavily tested at this point. Bruce, I believe we previously discussed a test plan for this work, using llverdev and llverfs. Please attach a document or comment here with details. The testing for 16TB LUNs is documented in https://bugzilla.lustre.org/show_bug.cgi?id=16038. After the local ldiskfs filesystem testing is complete, then obdfilter-survey and full Lustre client testing is needed. |
| Comments |
| Comment by Bruce Cassidy (Inactive) [ 24/Mar/11 ] |
|
Hardware for testing: DDN RAID5 disk arrays. Each array is made up of 9 2TB disks with a chunk size of 128KiB. Each array has a size of 14.5TB. Making a large LUN: Since the DDN controller does not support RAID0 or RAID50, the individual RAID5 arrays will have to be combined using software RAID. A create command such as "mdadm --create /dev/md10 --level=0 --chunk=1024 ..." will combine the RAID5 array into one large array. The chunk size of 1MiB will match the full stripe size of each RAID5 array. The number of arrays to use will depend on availability. Testing the the large LUN: The LUN can be tested using lustre's verify device tool: "llverdev -vpf {dev}".Create the filesystem: "script mkfs.log mkfs.lustre --mkfsoptions='-T ext4' --ost --index=0 --mgsnode={mgsnode} {dev} " Mount filesystem locally for testing: "mount -t ldiskfs {dev} {mnt}"Quick verify locally mounted ldiskfs filesystem: "script llverfs-vp.ldiskfs.log llverfs -vp {mnt}" Check filesystem: "script e2fsck.ldiskfs-vp.log time e2fsck -fn {dev} " Full verify locally mounted ldiskfs filesystem: "script llverfs-vl.ldiskfs.log llverfs -vl {dev}"Check filesystem: "script e2fsck.ldiskfs-vl.log time e2fsck -fn {dev} " Mount filesystem as OST: "mount -t lustre {dev} {mnt}"Mount client filesystem: "mount -t lustre {mgsnode}:/lustre /mnt/lustre" Quick read/write test on lustre filesystem: "script llverfs-vp.lustre.log llverfs -vp /mnt/lustre" Check filesystem: "script e2fsck.lustre-vp.time e2fsck -fn {dev} " Full read/write test on lustre filesystem: "llverfs -vl /mnt/lustre" Check filesystem: "script e2fsck.lustre-vl.time e2fsck -fn {dev}" |
| Comment by Andreas Dilger [ 24/Mar/11 ] |
|
Bruce, one thing to watch out for later when you are using llverfs is that it might have a bug in it that causes it to exit at the end of the write phase, before starting the read phase. There were a couple of times problem reports in bugzilla about this. The workaround is to start a read-only test with the same parameters and an explicit "timestamp" to match the write tests. Even better would be to determine why the write test is exiting and fix it. I'd suggest to test llverfs on a much smaller filesytsem while you are waiting for llverdev on the huge device to pass. Cheers, Andreas |
| Comment by Jian Yu [ 03/May/11 ] |
|
Hello Andreas, |
| Comment by Andreas Dilger [ 03/May/11 ] |
|
Please submit patch inspection requests via Gerrit. Cheers, Andreas |
| Comment by Jian Yu [ 04/May/11 ] |
The patches for b1_8 are in http://review.whamcloud.com/487. |
| Comment by Jian Yu [ 05/May/11 ] |
|
Branch: master Bug 24017 was not reproduced while running llverfs in full mode on Lustre filesystem with one 400GB OST and 40GB MGS/MDT: As per bug 24017 comment #32, the issue was also hit on 1TB filesystem. However, the server nodes in Toro cluster which have >1TB storage are all used by the autotest system. So, I'd setup 1TB Lustre filesystem on DDN SFA10KE storage system and try to reproduce the issue there. In addition, since the updates of llverdev and llverfs have been pushed to b1_8, I would start the testing on b1_8. |
| Comment by Jian Yu [ 10/May/11 ] |
|
Quoted Andreas' comments from
I adjusted the test script accordingly and ran it on a small size (8G) OST quickly to verify the correctness of the script. Andreas, could you please review the following report? If the steps are correct, then I'd add the llverdev part before formatting the devices, and run the script on 1TB LUN and then >16TB LUNs against the latest Lustre b1_8 branch (after In addition, I found the existing Storage Pools(RAID Groups) on DDN SFA10KE were all configured as RAID5 with 9 1863GB SATA disks in each pool. And there were 16 pools exported as 16 Virtual Disks presented to the Virtual Machine. Each VD has a size of 14.5TB. Should I re-create the storage pools as RAID6 with 8+2 disks in each pool? Or just use the current RAID5 VDs and create software RAID0 among them to get 29TB, 203TB LUNs separately? |
| Comment by Andreas Dilger [ 10/May/11 ] |
|
I looked at the updated test results, and at a minimum you need to run "sync" before running e2fsck in order to flush the dirty data to disk. It would be even better to unmount the filesystem before running e2fsck, so that we are sure to get a consistent state to run the check. Otherwise, there can be false errors reported by e2fsck, in particular the free blocks and inode counts are not reliable for a mounted filesystem. For large filesystems it looks like it is faster after the llverfs "full" test to reformat the filesystem than remounting the filesystem and deleting all of the test files. You may also want to consider limiting the number of inodes on the filesystem to speed up the mke2fs time. Using "-t ext4 -T largefile" for the OST is fine for this testing - it will create one inode per 1MB of space in the filesystem (the current default is one inode per 16kB of space on the OST). Once As for the DDN LUN configuration, I wouldn't bother changing it to RAID-6, since that won't affect the outcome of this test, but can consume a lot of time to reconfigure. Instead, add the multiple LUNs to an LVM VG and then create an LV of the required size using DM RAID-0, or use MD RAID-0. |
| Comment by Jian Yu [ 11/May/11 ] |
I modified the test script to reformat the OST after the llverfs "full" test on the "ldiskfs" filesystem, and kept the remount-and-delete way after the llverfs "partial" tests on both the "ldiskfs" and "lustre" filesystems. Here is the new test report: Could you please review it? |
| Comment by Andreas Dilger [ 11/May/11 ] |
|
The test script looks good. The only minor issue is that the "--mkfsoptions -T largefile" is not working as expected (bug in mkfs.lustre). Instead, please use "--mkfsoptions -t ext4 -i 1058576". Also, after mkfs.lustre is run (before full llverfs) it would be good to run "dumpe2fs -h {ostdev}" just to record the exact parameters used for creating the filesystem (ext4 features, layout, etc). I don't think it is necessary to re-run the small test just for this change, or at least you don't need to wait for my review before starting on the larger tests. Please start the full test runs ASAP. If you have 2 nodes that can access the DDN then it would be desirable to run the 24TB and 2xxTB tests in parallel. |
| Comment by Jian Yu [ 11/May/11 ] |
192TB and 24TB LUN testings against Lustre b1_8 on CentOS5.6/x86_64 are run in parallel on DDN SFA10KE App Stack 01 and 04. Testing start time: |
| Comment by Jian Yu [ 12/May/11 ] |
|
Formatting the 192TB OST failed as follows: ===================== format the OST /dev/large_vg/ost_lv =====================
# time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib --mkfsoptions='-t ext4 -i 1058576' --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv
Permanent disk data:
Target: largefs-OSTffff
Index: unassigned
Lustre FS: largefs
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc,force_over_16tb
Parameters: mgsnode=192.168.77.1@o2ib
device size = 201326592MB
2 6 18
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
target name largefs-OSTffff
4k blocks 51539607552
options -t ext4 -i 1058576 -J size=400 -I 256 -q -O dir_index,extents,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff -t ext4 -i 1058576 -J size=400 -I 256 -q -O dir_index,extents,uninit_groups -F /dev/large_vg/ost_lv 51539607552
mkfs.lustre: Unable to mount /dev/large_vg/ost_lv: Invalid argument
mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 22 (Invalid argument)
Dmesg showed that: Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv ===================== LDISKFS-fs (dm-3): not enough memory Memory status in the system: # free
total used free shared buffers cached
Mem: 30897304 205012 30692292 0 332 22564
-/+ buffers/cache: 182116 30715188
Swap: 9601016 152 9600864
The current virtual machine has 30GB memory in total. # dumpe2fs -h /dev/large_vg/ost_lv dumpe2fs 1.41.90.wc1 (18-Mar-2011) Filesystem volume name: largefs-OSTffff Last mounted on: <not available> Filesystem UUID: 0a8d234a-94b8-4c61-be65-1ffc8a3b9d57 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr dir_index filetype extent 64bit flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 201326592 Block count: 51539607552 Reserved block count: 2576980377 Free blocks: 51523063773 Free inodes: 201326581 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 128 Inode blocks per group: 8 Flex block group size: 16 Filesystem created: Thu May 12 20:30:48 2011 Last mount time: n/a Last write time: Thu May 12 20:33:12 2011 Mount count: 0 Maximum mount count: 0 Last checked: Thu May 12 20:30:48 2011 Check interval: 0 (<none>) Lifetime writes: 51 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: dd954dcc-5d06-4ca4-a06f-d32aa07f6d13 Journal backup: inode blocks Journal features: (none) Journal size: 400M Journal length: 102400 Journal sequence: 0x00000001 Journal start: 0 Andreas, could you please tell me how to calculate the amount of memory needed to mount the large size device? |
| Comment by Andreas Dilger [ 12/May/11 ] |
|
Yu Jian, For filesystems larger than 16TB, each group descriptor is 64 bytes in size (struct ext4_group_desc), and for 4kB blocks there is a group descriptor for each 128MB of the filesystem (4096 byte block bitmap * 8 bits/byte * 4096 byte block/bit = 128MB). For a 192TB filesystem there are 192TB / 128MB = 1.5M block groups * 64 bytes / 4096 bytes/block = 24576 blocks of group descriptors. 24576 blocks * 8 bytes per pointer = 192kB for the kmalloc. On 2.6.32 kernels it is possible to do kmalloc() up to 4MB, but on older kernels (e.g. RHEL5) it is only possible to kmalloc() up to 128kB. Also, large kmalloc() calls (larger than 16kB) may fail if memory is fragmented. Could you please make a patch which will try kmalloc() first, but if that fails will use vmalloc() to allocate the memory. It should set a flag in ext4_sb_info whether kmalloc() or vmalloc() was used, so that when s_group_desc is freed it knows whether to call kfree() or vfree() on the memory. Also, it makes sense to fix this error message to be better, like: if (sbi->s_group_desc == NULL) { printk(KERN_ERR "EXT4-fs: %s: not enough memory for %u groups (%ukB)\n" sb->s_id, sbi->s_groups_count, db_count * sizeof(struct buffer_head *) / 1024); goto failed_mount; }It should be possible to mount a filesystem of 128TB, because it would only try to allocate 128kB of memory. It might be worthwhile to run a "partial" test of llverdev and llverfs at 128TB, so that it can run quickly. It should be possible for you to make a patch relatively quickly, so I don't think it is worthwhile to run the full testing at 128TB, but instead wait for fixing this in ldiskfs and run the full test at 192TB or larger. |
| Comment by Jian Yu [ 12/May/11 ] |
Yes, I could format and mount an 128TB LUN successfully. Let me start the "partial" test on it.
OK, will do this right away. |
| Comment by Jian Yu [ 13/May/11 ] |
Patch for b1_8 branch: http://review.whamcloud.com/545. The 24TB LUN testing has been running for 32 hours. It's still ongoing. |
| Comment by Jian Yu [ 17/May/11 ] |
|
After running about 99 hours, testing against the 24TB LUN on App Stack 04 was interrupted by the App Stack 01 reboot issue caused by mounting an 192TB LUN. The reboot issue is under investigation in http://review.whamcloud.com/545. And here is the test report for 24TB LUN: The test went into the "full" llverfs run on the Lustre filesystem. The write operations had been finished, and the read operations were half performed (there were about 10TB data left to be read). Since the above 24TB LUN testing was performed on kernel 2.6.18-194.17.1.el5 (kernel 2.6.18-238.9.1.el5 was not ready on b1_8 at that time), I'd re-run it on the latest kernel after the reboot issue is fixed. |
| Comment by Andreas Dilger [ 17/May/11 ] |
|
It should be possible to just restart the ldiskfs full llverfs run after the reboot in read mode "-r", using the timestamp printed at the start of the run "-t 1305182040", at the last directory that was being checked "-o 179" after mounting the filesystem: llverfs -vl -r -t 1305182040 -o 179 /mnt/ost1 |
| Comment by Jian Yu [ 17/May/11 ] |
Ah, I forgot this. Thanks for the instructions. I'll run this right away. |
| Comment by Jian Yu [ 18/May/11 ] |
|
The 24TB LUN testing against Lustre b1_8 on CentOS5.6/x86_64 (kernel version: 2.6.18-194.17.1.el5) passed: |
| Comment by Jian Yu [ 18/May/11 ] |
|
Quoted the comments from http://review.whamcloud.com/545 :
I tried to format and mount the 129TB filesystems four times on the same node, and each time the node was rebooted while reading different group descriptor blocks in block group 0: sbi->s_group_desc[i] = sb_bread(sb, block); Time 1: i=12050 block=12051 Time 2: i=12045 block=12046 Time 3: i=12056 block=12057 Time 4: i=12044 block=12045 Here is the output of dumpe2fs: Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0x27cf, unused inodes 117 Primary superblock at 0, Group descriptors at 1-16512 Block bitmap at 16513 (+16513), Inode bitmap at 16529 (+16529) Inode table at 16545-16552 (+16545) 16090 free blocks, 117 free inodes, 2 directories, 117 unused inodes Free blocks: 16678-32767 Free inodes: 12-128 So, as we can see, the block number is not large, which is not the issue.
Right. For <=128TB filesystems, the kmalloc() was used to allocate memory space to sbi->s_group_desc array, and for > 128TB filesystems, vmalloc() was used. Here are the virtual memory addresses allocated by vmalloc() while mounting the 129TB filesystem: VMALLOC_START: 0xffffc20000000000 VMALLOC_END: 0xffffe1ffffffffff &sbi->s_group_desc[12050]: 0xffffc2000063c890 &sbi->s_group_desc[12045]: 0xffffc2000063c868 &sbi->s_group_desc[12056]: 0xffffc2000063c8c0 &sbi->s_group_desc[12044]: 0xffffc2000063c860 For the four times I formatting and mounting the 129TB filesystems, the virtual addresses allocated by vmalloc() to sbi->s_group_desc array were the same. I could not find out what's wrong here.
I set up an RHEL6.0 App Stack on the DDN SFA10KE appliance with the latest master RHEL6.0/x86_64 server packages. However, I found the SFA block driver did not support >2.6.28 Linux kernel. I've asked Jim Shankland for help to see whether there is an updated version of the driver to support 2.6.32 kernel or not. |
| Comment by Andreas Dilger [ 18/May/11 ] |
|
Some different things to check here:
printk(KERN_NOTICE "sbi->s_group_desc = %p-%p (%p)\n",
sbi->s_group_desc, &sbi->s_group_desc[db_count],
(char *)(sbi->s_group_desc) + size);
for (i = 0; i < db_count; i++) {
struct buffer_head *bh;
block = descriptor_loc(sb, logical_sb_block, i);
printk(KERN_NOTICE "i = %u/%u, block = %llu\n", i, db_count, block);
printk(KERN_NOTICE "&sbi->s_group_desc[%u]: %p\n", i, &sbi->s_group_desc[i]);
bh = sb_bread(sb, block);
printk(KERN_NOTICE "bh[%llu] = %p\n", block, bh);
sbi->s_group_desc[i] = bh;
if (!sbi->s_group_desc[i]) {
ext4_msg(sb, KERN_ERR,
"can't read group descriptor %d", i);
db_count = i;
goto failed_mount2;
}
schedule_timeout(HZ/20); /* 10 minutes until crash! */
}
printk(KERN_NOTICE "entering ext4_check_descriptors()\n");
schedule_timeout(5*HZ);
if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
|
| Comment by Andreas Dilger [ 18/May/11 ] |
|
In further discussion with Peter Jones, we would like you to start the testing with 128TB LUNs for RHEL5. We can test 192 TB LUNs at a later time, possibly once we get RHEL6 working on the DDN 10000E system. Please first verify at least one ldiskfs mount with a kernel using only vmalloc() to allocate s_group_desc to verify that it is not just the vmalloc() memory that is failing at ~96TB (as mentioned in point #3 above). We don't want to find out at some customer site that vmalloc() is causing problems even with smaller filesystems due to a problem with kmalloc() failing on a system with fragmented memory. Next, please start a test against the 24TB LUN that is creating inodes that are located beyond the 16TB limit. Looking at the previous Maloo test output it appears there are about 25M inodes created on the OST filesystem, and about 50M+ inodes on the MDT filesystem. This should be based on some existing test like mdsrate-create- {small,large}.sh, using 25 directories with 1M files each to ensure that the inodes are being allocated above 16TB. While the 24TB LUN inode testing is running, can you please also make a new version of the ext4-force_over_16tb-rhel[56].patch that is now renamed to ext4-force_over_24tb-rhel[56].patch that has a limit of 24TB ((6ULL << 30) blocks). This can be tested at the end simply by mounting a 24TB filesystem without the need to re-run the full llverfs/llverdev tests. This should be used for 1.8.6. Next, I would like you to modify llverfs.c::print_filename() to print out the current read/write performance as described in The full 128TB testing should be done using the current master (2.1) at this point with the kmalloc+vmalloc patch you wrote using the new llverdev tool. This needs corresponding ext4-force_over_128tb-rhel[56].patch files to be created. I estimate that it may take as long as 40 days to complete. |
| Comment by Jian Yu [ 19/May/11 ] |
Thanks a lot for the suggestions! After making the VM's serial-console output redirected on a remote telnet connection, I got the exact Oops messages as follows: Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv ===================== LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended LDISKFS-fs: can't allocate buddy meta group LDISKFS-fs (dm-3): failed to initalize mballoc (-12) LDISKFS-fs (dm-3): mount failed Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP: [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 PGD 7bd06f067 PUD 7bd2af067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 0 Modules linked in: ldiskfs(U) jbd2(U) crc16(U) lnet(U) libcfs(U) raid0(U) mlx4_ib(U) ib_ipoib(U) ipoib_helper(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sun rpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxg b3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hw mon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) ide_cd(U) tpm_tis(U) 8139cp(U) tpm(U) i2c_p iix4(U) mlx4_core(U) cdrom(U) sfablkdrvr(U) parport_pc(U) mii(U) tpm_bios(U) parport(U) i2c_core(U) pcspkr(U) serio_raw(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_c ache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 3301, comm: mkfs.lustre Tainted: G 2.6.18-238.9.1.el5_lustre.20110509050254 #1 RIP: 0010:[<ffffffff8876a741>] [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP: 0018:ffff8107bd1a3ad8 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffff8106b9fcd558 RCX: ffff81063b37dcc0 RDX: ffff81063b37dcc0 RSI: ffff8106b9fcd770 RDI: ffff8106b9fcd558 RBP: ffff8106b9fcd458 R08: ffff810000032600 R09: 7fffffffffffffff R10: ffff8107bd1a38a8 R11: ffffffff80039e22 R12: ffff8107bd0640d8 R13: 0000000000000000 R14: ffff8107bd036000 R15: ffffffff8876af30 FS: 00002b9ada9816e0(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000001c8 CR3: 00000007bd010000 CR4: 00000000000006e0 Process mkfs.lustre (pid: 3301, threadinfo ffff8107bd1a2000, task ffff8107db4cc040) Stack: 7fffffffffffffff ffff8106b9fcd558 ffff81063b37dc00 ffffffff80023011 ffff8106b9fcd558 ffffffff80039f68 0000000000000000 ffff8107bd064078 0000000000000000 ffffffff800edf8f ffff81063b37dc00 ffffffff8878d9a0 Call Trace: [<ffffffff80023011>] clear_inode+0xd2/0x123 [<ffffffff80039f68>] generic_drop_inode+0x146/0x15a [<ffffffff800edf8f>] shrink_dcache_for_umount_subtree+0x1f2/0x21e [<ffffffff800ee3ff>] shrink_dcache_for_umount+0x35/0x43 [<ffffffff800e635b>] generic_shutdown_super+0x1b/0xfb [<ffffffff800e646c>] kill_block_super+0x31/0x45 [<ffffffff800e653a>] deactivate_super+0x6a/0x82 [<ffffffff800e6c5f>] get_sb_bdev+0x121/0x16c [<ffffffff800e65e5>] vfs_kern_mount+0x93/0x11a [<ffffffff800e66ae>] do_kern_mount+0x36/0x4d [<ffffffff800f0fba>] do_mount+0x6a9/0x719 [<ffffffff8002b4d6>] flush_tlb_page+0xac/0xda [<ffffffff8001125a>] do_wp_page+0x3f8/0x91e [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530 [<ffffffff80019de2>] __getblk+0x25/0x236 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255 [<ffffffff800ce751>] zone_statistics+0x3e/0x6d [<ffffffff8000f41d>] __alloc_pages+0x78/0x308 [<ffffffff800eada4>] sys_mkdirat+0xd1/0xe4 [<ffffffff8004c717>] sys_mount+0x8a/0xcd [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 40 RIP [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP <ffff8107bd1a3ad8> CR2: 00000000000001c8 <0>Kernel panic - not syncing: Fatal exception The panic was caused by the following kmalloc codes in fs/ext4/mballoc.c: static int ext4_mb_init_backend(struct super_block *sb)
{
//......
sbi->s_group_info = kmalloc(array_size, GFP_KERNEL);
if (sbi->s_group_info == NULL) {
printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
return -ENOMEM;
}
//......
}
I'll make a patch for this and check whether there are any other codes which have such issue. The patch would be uploaded to http://review.whamcloud.com/545. |
| Comment by Andreas Dilger [ 19/May/11 ] |
|
It is important to note that while the kmalloc() failure in ext4_mb_init_backend() caused an error, the actual oops was in ldiskfs_clear_inode(), so at some point that should be investigated as well. Also, while fixing up ext4_mb_init_backend(), it appears that the comment for the s_group_info kmalloc() call is incorrect. A 128TB filesystem has 16384 group descriptor blocks (== 128kB pointer array), because the group descriptors for > 16TB filesystems are twice as large. Please fix it up to read: /* A 16TB filesystem with 64-bit pointers requires an 8192 byte
* kmalloc(). Filesystems larger than 2^32 blocks (16TB normally)
* have group descriptors at least twice as large (64 bytes or
* more vs. 32 bytes for traditional ext3 filesystems, so a 128TB
* filesystem needs a 128kB allocation, which may need vmalloc(). */
Please ensure that starting the 24TB inode testing is your highest priority, since this is blocking our 1.8.6.wc release. We can continue to resolve these issues and test 128TB or larger LUNs for 2.1.x while the 24TB testing is running. |
| Comment by Jian Yu [ 19/May/11 ] |
OK, got it. |
| Comment by Jian Yu [ 20/May/11 ] |
|
The 24TB inode testing against Lustre b1_8 on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5_lustre.20110509050254) was started at Fri May 20 03:08:49 PDT 2011. The following builds were used: The test passed at Fri May 20 06:33:34 PDT 2011: Here is a short summary of the test result after running mdsrate with "--create" option: # /opt/bin/mpirun -np 25 -machinefile /tmp/mdsrate-create.machines /usr/lib64/lustre/tests/mdsrate --create --verbose --ndirs 25 --dirfmt '/mnt/lustre/mdsrate/dir%d' --nfiles 1000000 --filefmt 'file%%d' Rate: 2068.64 eff 2069.13 aggr 82.77 avg client creates/sec (total: 25 threads 25000000 creates 25 dirs 1 threads/dir 12085.21 secs) # lfs df -h /mnt/lustre UUID bytes Used Available Use% Mounted on largefs-MDT0000_UUID 224.0G 1.2G 210.0G 1% /mnt/lustre[MDT:0] largefs-OST0000_UUID 24.0T 938.0M 22.8T 0% /mnt/lustre[OST:0] filesystem summary: 24.0T 938.0M 22.8T 0% /mnt/lustre # lfs df -i /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on largefs-MDT0000_UUID 67108864 25000052 42108812 37% /mnt/lustre[MDT:0] largefs-OST0000_UUID 25165824 25000087 165737 99% /mnt/lustre[OST:0] filesystem summary: 67108864 25000052 42108812 37% /mnt/lustre |
| Comment by Jian Yu [ 23/May/11 ] |
Patch for b1_8 branch: http://review.whamcloud.com/589. |
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Andreas Dilger [ 24/May/11 ] |
|
Yu Jian, |
| Comment by Build Master (Inactive) [ 24/May/11 ] |
|
Integrated in Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
|
| Comment by Jian Yu [ 24/May/11 ] |
Done: https://maloo.whamcloud.com/test_sets/5a99a9da-869c-11e0-b4df-52540025f9af |
| Comment by Jian Yu [ 25/May/11 ] |
|
Status: I would continue working on this ticket after Lustre 1.8.6 pre-release/release testing. |
| Comment by Andreas Dilger [ 03/Jun/11 ] |
|
Yu Jian, given how long we expect the testing for this problem to take, would it be possible to start a 128TB test with the current master (2.1 pre) code? I expect the tests will take at least 30 days to complete, and if these are not started now they will likely delay the 2.1 release. Please make a script which includes all if the tests we ran for 24TB (partial tests first, then full tests, including the many inodes test and full e2fsck after each test). To keep the test logs consistent it probably makes sense to just name the test as "large-LUN-partial" and "large-LUN-full" and "large-LUN-inodes" or similar instead of putting the LUN size in the test name. Once the tests are running they will hopefully not take much of your time, but the loss if elapsed time is hurting us here. |
| Comment by Andreas Dilger [ 03/Jun/11 ] |
|
NB - I believe the problems we saw are related to >128TB only, is that correct? |
| Comment by Jian Yu [ 04/Jun/11 ] |
Right, I could format and mount 128TB LUN successfully. I'd start the testing against the latest master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5) soon. |
| Comment by Jian Yu [ 07/Jun/11 ] |
|
The 128TB LUN partial testing against Lustre master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5_lustre.gc66d831) was started at Tue Jun 7 03:13:08 PDT 2011. The following builds were used: Formatting the 128TB LUN failed: |
| Comment by Jian Yu [ 14/Jun/11 ] |
|
The 128TB LUN partial testing against Lustre master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g57944e2) was started at Tue Jun 14 00:37:57 PDT 2011. The following builds were used: After running 6223s, the test passed: The 128TB LUN full testing was started at Tue Jun 14 02:58:30 PDT 2011. The patch for |
| Comment by Andreas Dilger [ 22/Jun/11 ] |
|
This is good news that testing has worked so well (excluding the one unrelated bug). For testing on master, no extra mkfs.lustre options should be needed when formatting the filesystem. This was an oversight in the 1.8.6 testing, because the > 16TB support appeared to work OK, but as soon as DDN used mkfs.lustre without specifying any options the format failed. Upon closer inspection, it does seem that mkfs_lustre.c needs to set the "64bit" flag for huge filesystems. I attached a patch to change 996 to fix this problem. Are you planning on testing the inode creation + e2fsck testing that was run previously for 24TB LUNs? Also, please create a new ext4-force_over_128tb-rhel6.patch file with updated mount options. We also need to find an OSS node with 128TB+ of storage that we can use for RHEL6 kernel/ldiskfs testing, since this cannot be tested within the SFA10000E VM. |
| Comment by Jian Yu [ 22/Jun/11 ] |
Yes, I'll.
OK, got it. |
| Comment by Jian Yu [ 28/Jun/11 ] |
|
After http://review.whamcloud.com/#change,996 was merged into the master branch, I proceeded with the remaining tests on 128TB LUN. However, formatting the 128TB OST caused kernel panic as follows: Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv ===================== LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended LDISKFS-fs: can't allocate buddy meta group LDISKFS-fs (dm-3): failed to initalize mballoc (-12) LDISKFS-fs (dm-3): mount failed Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP: [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 PGD 7c0755067 PUD 7cbdea067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 3 Modules linked in: ldiskfs(U) jbd2(U) crc16(U) raid0(U) mlx4_ib(U) ib_ipoib(U) ipoib_helper(U) lnet(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) tpm_tis(U) parport_pc(U) ide_cd(U) tpm(U) 8139cp(U) mlx4_core(U) i2c_piix4(U) parport(U) sfablkdrvr(U) cdrom(U) mii(U) tpm_bios(U) serio_raw(U) i2c_core(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 3406, comm: mkfs.lustre Tainted: G 2.6.18-238.12.1.el5_lustre.g6a3d997 #1 RIP: 0010:[<ffffffff887801f1>] [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP: 0000:ffff810431595ad8 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffff8107d06b8a10 RCX: ffff8107d21c90c0 RDX: ffff8107d21c90c0 RSI: ffff8107d06b8c18 RDI: ffff8107d06b8a10 RBP: ffff8107d06b8910 R08: ffff810000032600 R09: 7fffffffffffffff R10: ffff8104315958a8 R11: ffffffff80039e56 R12: ffff8107cf3ec0d8 R13: 0000000000000000 R14: ffff8107c06e3000 R15: ffffffff88780600 FS: 00002b3dd812b6e0(0000) GS:ffff81011bbdb640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000001c8 CR3: 00000007c0644000 CR4: 00000000000006e0 Process mkfs.lustre (pid: 3406, threadinfo ffff810431594000, task ffff8107dfeb9080) Stack: 7fffffffffffffff ffff8107d06b8a10 ffff8107d21c9000 ffffffff8002303b ffff8107d06b8a10 ffffffff80039f9c 0000000000000000 ffff8107cf3ec078 0000000000000000 ffffffff800ede72 ffff8107d21c9000 ffffffff887a2d00 Call Trace: [<ffffffff8002303b>] clear_inode+0xd2/0x123 [<ffffffff80039f9c>] generic_drop_inode+0x146/0x15a [<ffffffff800ede72>] shrink_dcache_for_umount_subtree+0x1f2/0x21e [<ffffffff800ee40c>] shrink_dcache_for_umount+0x35/0x43 [<ffffffff800e636b>] generic_shutdown_super+0x1b/0xfb [<ffffffff800e647c>] kill_block_super+0x31/0x45 [<ffffffff800e654a>] deactivate_super+0x6a/0x82 [<ffffffff800e6c6f>] get_sb_bdev+0x121/0x16c [<ffffffff800e65f5>] vfs_kern_mount+0x93/0x11a [<ffffffff800e66be>] do_kern_mount+0x36/0x4d [<ffffffff800f0fc6>] do_mount+0x6a9/0x719 [<ffffffff8002b502>] flush_tlb_page+0xac/0xda [<ffffffff8001125b>] do_wp_page+0x3f8/0x91e [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530 [<ffffffff80019de3>] __getblk+0x25/0x236 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255 [<ffffffff800ce756>] zone_statistics+0x3e/0x6d [<ffffffff800efd44>] copy_mount_options+0xcc/0x127 [<ffffffff8004c74a>] sys_mount+0x8a/0xcd [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 30 RIP [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0 RSP <ffff810431595ad8> CR2: 00000000000001c8 <0>Kernel panic - not syncing: Fatal exception The mkfs.lustre command I run was: mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv The panic was the same as what was described in #comment-14649 above. I'll look into ldiskfs_clear_inode() per the above comment #comment-14650. |
| Comment by Jian Yu [ 02/Jul/11 ] |
A new ticket |
| Comment by Jian Yu [ 08/Jul/11 ] |
Patch for master branch: http://review.whamcloud.com/1073. |
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Build Master (Inactive) [ 08/Jul/11 ] |
|
Integrated in Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
|
| Comment by Jian Yu [ 11/Jul/11 ] |
|
After http://review.whamcloud.com/1071 and http://review.whamcloud.com/1073 were merged into the master branch, I proceeded with the 128TB LUN full testing on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g5c1e9f9). The testing was started at Sun Jul 10 23:56:02 PDT 2011. The following builds were used: There were no extra mkfs.lustre options specified when formatting the 128TB OST. ===================== format the OST /dev/large_vg/ost_lv =====================
# time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib /dev/large_vg/ost_lv
Permanent disk data:
Target: largefs-OSTffff
Index: unassigned
Lustre FS: largefs
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.77.1@o2ib
device size = 134217728MB
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
target name largefs-OSTffff
4k blocks 34359738368
options -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F /dev/large_vg/ost_lv 34359738368
Writing CONFIGS/mountdata
real 0m44.489s
user 0m6.669s
sys 0m31.087s
|
| Comment by Jian Yu [ 26/Jul/11 ] |
|
After running for about 12385 minutes (206 hours, 8 days), the 128TB Lustre filesystem was successfully filled up by llverfs: # lfs df -h /mnt/lustre UUID bytes Used Available Use% Mounted on largefs-MDT0000_UUID 1.5T 499.3M 1.4T 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 128.0T 121.4T 120.0G 100% /mnt/lustre[OST:0] filesystem summary: 128.0T 121.4T 120.0G 100% /mnt/lustre # lfs df -i /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on largefs-MDT0000_UUID 1073741824 32099 1073709725 0% /mnt/lustre[MDT:0] largefs-OST0000_UUID 134217728 31191 134186537 0% /mnt/lustre[OST:0] filesystem summary: 1073741824 32099 1073709725 0% /mnt/lustre Now, the read operation is ongoing... |
| Comment by Jian Yu [ 03/Aug/11 ] |
Done. After running for about 21 days in total, the 128TB LUN full testing on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g5c1e9f9) passed on Lustre master build v2_0_65_0: The "large-LUN-inodes" testing is going to be started on the latest master branch... |
| Comment by Jian Yu [ 09/Aug/11 ] |
The inode creation testing on 128TB Lustre filesystem against master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.gd4ea36c) was started at Mon Aug 8 22:51:49 PDT 2011. About 134M inodes would be created. The following builds were used: After running for about 53 hours, the test passed at Thu Aug 11 04:41:09 PDT 2011: Here is a short summary of the test result after running mdsrate with "--create" option: # /opt/mpich/bin/mpirun -np 25 -machinefile /tmp/mdsrate-create.machines /usr/lib64/lustre/tests/mdsrate --create --verbose --ndirs 25 --dirfmt '/mnt/lustre/mdsrate/dir%d' --nfiles 5360000 --filefmt 'file%%d' Rate: 694.17 eff 694.18 aggr 27.77 avg client creates/sec (total: 25 threads 134000000 creates 25 dirs 1 threads/dir 193035.50 secs) # lfs df -h /mnt/lustre UUID bytes Used Available Use% Mounted on largefs-MDT0000_UUID 1.5T 13.6G 1.4T 1% /mnt/lustre[MDT:0] largefs-OST0000_UUID 128.0T 3.6G 121.6T 0% /mnt/lustre[OST:0] filesystem summary: 128.0T 3.6G 121.6T 0% /mnt/lustre # lfs df -i /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on largefs-MDT0000_UUID 1073741824 134000062 939741762 12% /mnt/lustre[MDT:0] largefs-OST0000_UUID 134217728 134006837 210891 100% /mnt/lustre[OST:0] filesystem summary: 1073741824 134000062 939741762 12% /mnt/lustre |
| Comment by Jian Yu [ 15/Aug/11 ] |
The test log was not showed up in the above Maloo report. Please find it in the attachment - large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.log. |
| Comment by Andreas Dilger [ 15/Aug/11 ] |
|
Yu Jian, I looked through the inodes run, but I didn't see it running e2fsck on the large LUN? That should be added as part of the test script if it isn't there today. If the LUN with the 135M files still exists, can you please start an e2fsck on both the MDS and the OST. |
| Comment by Jian Yu [ 15/Aug/11 ] |
Sorry for the confusion, Andreas. The e2fsck part is in the test script. While running e2fsck on the OST after creating the 134M files, the following errors occurred on the virtual disks which were presented to the virtual machine: --------8<-------- kernel: janusdrvr: WARNING: cpCompleteIoReq(): Req Context ID 0x0 completed with error status 0x7 kernel: end_request: I/O error, dev sfa0066, sector 0 kernel: Buffer I/O error on device sfa0066, logical block 0 kernel: janusdrvr: WARNING: cpCompleteIoReq(): Req Context ID 0x1 completed with error status 0x7 kernel: end_request: I/O error, dev sfa0066, sector 0 kernel: Buffer I/O error on device sfa0066, logical block 0 --------8<-------- The same issue also occurred on other disks presented to other virtual machines. And then all of the disks became invisible. I've tried to reboot the virtual machine and re-load the disk driver, but it did not work. I think it's hardware issue, so I removed the incomplete e2fsck part from the test result and just uploaded the complete inodes creation part. After the issue is resolved, I'll complete the e2fsck part. |
| Comment by Jian Yu [ 19/Aug/11 ] |
OK, now the issue is resolved. The testing is restarted on the following master build: Lustre build: http://newbuild.whamcloud.com/job/lustre-master/263/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/ After running for about 120 hours, the inodes creation and e2fsck tests passed on 128TB Lustre filesystem. |
| Comment by Andreas Dilger [ 01/Sep/11 ] |
|
For the 1.41.90.wc4 e2fsprogs I've cherry-picked a couple of recent 64-bit fixes from upstream: commit bc526c65d2a4cf0c6c04e9ed4837d6dd7dbbf2b3 libext2fs: fix 64-bit support in ext2fs_bmap2() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> commit 24404aa340b274e077b2551fa7bdde5122d3eb43 libext2fs: fix 64-bit support in ext2fs_ {read,write}_inode_full() This fixes a problem where reading or writing inodes located after the Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> The first one is unlikely to affect most uses, but may hit in rare cases. I don't think there is anything left to do for this bug, so it can be closed. |
| Comment by Andreas Dilger [ 09/Sep/11 ] |
Yu Jian, I'm looking at the log file, and found some strange results. Firstly, do you know why none of the large-LUN-inodes test results in Maloo include the test logs? That makes it hard to look at the results in the future if there is reason to do so. I wanted to see the e2fsck times for the many-inodes runs, but only have the one test result above to look at. Could you please file a separate TT- bug to fix whatever problem is preventing the logs for this test to be sent to Maloo. Looking at the above log, it seems that the MDT (with 25 dirs of 5M files each) took only 7 minutes to run e2fsck, while the OST (with 32 dirs of 4M files each) took 3500 minutes (58 hours) to run. That doesn't make sense, and I wanted to compare this to the most recent large-LUN-inodes test result, which took 20h less time to run. Are the MDT and OST e2fsck runs in the same VM on the SFA10k, or is the MDT on a separate MDS node? |
| Comment by Jian Yu [ 13/Sep/11 ] |
I've no idea about this issue. Syslog could be displayed, but not the suite log and test log. I just created TT-180 to ask John for help.
The MDT and OST are in the same VM. Before TT-180 is fixed, please find the attached large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build273.log file for the test output of the inodes creation + e2fsck test on the following builds: Lustre build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/ |
| Comment by Jian Yu [ 13/Sep/11 ] |
TT-180 was just fixed. |