[LU-136] test e2fsprogs-1.42.wc1 against 32TB+ ldiskfs filesystems Created: 17/Mar/11  Updated: 13/Sep/11  Resolved: 01/Sep/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0

Type: Task Priority: Major
Reporter: Andreas Dilger Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File 128TB_partial.log     Text File large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build263.log     Text File large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build273.log     Text File large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.log     File llverdev_b1_8_master.diff     File llverfs_b1_8_master.diff    
Bugzilla ID: 16,038
Rank (Obsolete): 4966

 Description   

In order for Lustre to use OSTs larger than 16TB, the e2fsprogs "master" branch needs to be tested against such large LUNs. The "master" branch has unreleased modifications that should allow mke2fs, e2fsck, and other tools to use LUNs over 16TB, but it has not been heavily tested at this point.

Bruce, I believe we previously discussed a test plan for this work, using llverdev and llverfs. Please attach a document or comment here with details. The testing for 16TB LUNs is documented in https://bugzilla.lustre.org/show_bug.cgi?id=16038.

After the local ldiskfs filesystem testing is complete, then obdfilter-survey and full Lustre client testing is needed.



 Comments   
Comment by Bruce Cassidy (Inactive) [ 24/Mar/11 ]

Hardware for testing: DDN RAID5 disk arrays. Each array is made up of 9 2TB disks with a chunk size of 128KiB. Each array has a size of 14.5TB.

Making a large LUN: Since the DDN controller does not support RAID0 or RAID50, the individual RAID5 arrays will have to be combined using software RAID. A create command such as "mdadm --create /dev/md10 --level=0 --chunk=1024 ..." will combine the RAID5 array into one large array. The chunk size of 1MiB will match the full stripe size of each RAID5 array. The number of arrays to use will depend on availability.

Testing the the large LUN: The LUN can be tested using lustre's verify device tool: "llverdev -vpf

{dev}".

Create the filesystem: "script mkfs.log mkfs.lustre --mkfsoptions='-T ext4' --ost --index=0 --mgsnode={mgsnode} {dev}

"

Mount filesystem locally for testing: "mount -t ldiskfs

{dev} {mnt}"

Quick verify locally mounted ldiskfs filesystem: "script llverfs-vp.ldiskfs.log llverfs -vp {mnt}"

Check filesystem: "script e2fsck.ldiskfs-vp.log time e2fsck -fn {dev}

"

Full verify locally mounted ldiskfs filesystem: "script llverfs-vl.ldiskfs.log llverfs -vl

{dev}"

Check filesystem: "script e2fsck.ldiskfs-vl.log time e2fsck -fn {dev}

"

Mount filesystem as OST: "mount -t lustre

{dev} {mnt}"
Mount client filesystem: "mount -t lustre {mgsnode}:/lustre /mnt/lustre"

Quick read/write test on lustre filesystem: "script llverfs-vp.lustre.log llverfs -vp /mnt/lustre"

Check filesystem: "script e2fsck.lustre-vp.time e2fsck -fn {dev}

"

Full read/write test on lustre filesystem: "llverfs -vl /mnt/lustre"

Check filesystem: "script e2fsck.lustre-vl.time e2fsck -fn

{dev}

"

Comment by Andreas Dilger [ 24/Mar/11 ]

Bruce, one thing to watch out for later when you are using llverfs is that it might have a bug in it that causes it to exit at the end of the write phase, before starting the read phase. There were a couple of times problem reports in bugzilla about this.

The workaround is to start a read-only test with the same parameters and an explicit "timestamp" to match the write tests. Even better would be to determine why the write test is exiting and fix it. I'd suggest to test llverfs on a much smaller filesytsem while you are waiting for llverdev on the huge device to pass.

Cheers, Andreas

Comment by Jian Yu [ 03/May/11 ]

Hello Andreas,
As per the suggestion from Peter, I'm going to start the >16TB LUN testing on b1_8 first. After looking into the llverdev.c and llverfs.c files in b1_8 branch, I found both of them were a bit obsolete in comparison with the ones in master branch. The diff files are attached. May I port the changes to b1_8 branch?

Comment by Andreas Dilger [ 03/May/11 ]

Please submit patch inspection requests via Gerrit.

Cheers, Andreas

Comment by Jian Yu [ 04/May/11 ]

Please submit patch inspection requests via Gerrit.

The patches for b1_8 are in http://review.whamcloud.com/487.
They passed reviews and were pushed to b1_8 branch in fs/lustre-release repo.

Comment by Jian Yu [ 05/May/11 ]

Branch: master
Build: http://newbuild.whamcloud.com/job/lustre-reviews/317/
Distro/Arch: CentOS5.5/x86_64 (kernel 2.6.18-194.17.1.el5)
e2fsprogs version: 1.41.14.wc1
Network: IB (in-kernel OFED)
Test Node: fat-intel-4

Bug 24017 was not reproduced while running llverfs in full mode on Lustre filesystem with one 400GB OST and 40GB MGS/MDT:
https://maloo.whamcloud.com/test_sets/8429ad84-76c2-11e0-a1b3-52540025f9af

As per bug 24017 comment #32, the issue was also hit on 1TB filesystem. However, the server nodes in Toro cluster which have >1TB storage are all used by the autotest system. So, I'd setup 1TB Lustre filesystem on DDN SFA10KE storage system and try to reproduce the issue there. In addition, since the updates of llverdev and llverfs have been pushed to b1_8, I would start the testing on b1_8.

Comment by Jian Yu [ 10/May/11 ]

Quoted Andreas' comments from LU-297:

I noticed something incorrect in the test script - after the partial llverfs run on ldiskfs the test files were deleted before the e2fsck was run. This makes the e2fsck check much less valuable, because it hides any kernel or e2fsck bugs related to how the files are allocated on disk. The e2fsck should be run after each llverfs test is finished, but before the test files are deleted. There is currently no e2fsck check at all after the partial llverfs run on Lustre, which needs to be added.

Also, for proper problem isolation there should be a full run of llverfs on the ldiskfs-mounted OST filesystem after the partial ldiskfs test, because I suspect there may be some bugs in ext4 or the ldiskfs patches themselves, even before we start testing obdfilter running on large filesystems. Again, there should be a full e2fsck run after llverfs is run.

I adjusted the test script accordingly and ran it on a small size (8G) OST quickly to verify the correctness of the script. Andreas, could you please review the following report?
https://maloo.whamcloud.com/test_sets/8938513c-7aef-11e0-b5bf-52540025f9af

If the steps are correct, then I'd add the llverdev part before formatting the devices, and run the script on 1TB LUN and then >16TB LUNs against the latest Lustre b1_8 branch (after LU-302 is fixed).

In addition, I found the existing Storage Pools(RAID Groups) on DDN SFA10KE were all configured as RAID5 with 9 1863GB SATA disks in each pool. And there were 16 pools exported as 16 Virtual Disks presented to the Virtual Machine. Each VD has a size of 14.5TB. Should I re-create the storage pools as RAID6 with 8+2 disks in each pool? Or just use the current RAID5 VDs and create software RAID0 among them to get 29TB, 203TB LUNs separately?

Comment by Andreas Dilger [ 10/May/11 ]

I looked at the updated test results, and at a minimum you need to run "sync" before running e2fsck in order to flush the dirty data to disk. It would be even better to unmount the filesystem before running e2fsck, so that we are sure to get a consistent state to run the check. Otherwise, there can be false errors reported by e2fsck, in particular the free blocks and inode counts are not reliable for a mounted filesystem.

For large filesystems it looks like it is faster after the llverfs "full" test to reformat the filesystem than remounting the filesystem and deleting all of the test files. You may also want to consider limiting the number of inodes on the filesystem to speed up the mke2fs time. Using "-t ext4 -T largefile" for the OST is fine for this testing - it will create one inode per 1MB of space in the filesystem (the current default is one inode per 16kB of space on the OST). Once LU-255 is landed this will be the new default.

As for the DDN LUN configuration, I wouldn't bother changing it to RAID-6, since that won't affect the outcome of this test, but can consume a lot of time to reconfigure. Instead, add the multiple LUNs to an LVM VG and then create an LV of the required size using DM RAID-0, or use MD RAID-0.

Comment by Jian Yu [ 11/May/11 ]

For large filesystems it looks like it is faster after the llverfs "full" test to reformat the filesystem than remounting the filesystem and deleting all of the test files.

I modified the test script to reformat the OST after the llverfs "full" test on the "ldiskfs" filesystem, and kept the remount-and-delete way after the llverfs "partial" tests on both the "ldiskfs" and "lustre" filesystems. Here is the new test report:
https://maloo.whamcloud.com/test_sets/09405958-7bb6-11e0-b5bf-52540025f9af

Could you please review it?

Comment by Andreas Dilger [ 11/May/11 ]

The test script looks good. The only minor issue is that the "--mkfsoptions -T largefile" is not working as expected (bug in mkfs.lustre).

Instead, please use "--mkfsoptions -t ext4 -i 1058576". Also, after mkfs.lustre is run (before full llverfs) it would be good to run "dumpe2fs -h

{ostdev}

" just to record the exact parameters used for creating the filesystem (ext4 features, layout, etc).

I don't think it is necessary to re-run the small test just for this change, or at least you don't need to wait for my review before starting on the larger tests. Please start the full test runs ASAP. If you have 2 nodes that can access the DDN then it would be desirable to run the 24TB and 2xxTB tests in parallel.

Comment by Jian Yu [ 11/May/11 ]

If you have 2 nodes that can access the DDN then it would be desirable to run the 24TB and 2xxTB tests in parallel.

192TB and 24TB LUN testings against Lustre b1_8 on CentOS5.6/x86_64 are run in parallel on DDN SFA10KE App Stack 01 and 04.

Testing start time:
24TB LUN: Wed May 11 22:49:35 PDT 2011
192TB LUN: Thu May 12 01:07:17 PDT 2011

Comment by Jian Yu [ 12/May/11 ]

Formatting the 192TB OST failed as follows:

===================== format the OST /dev/large_vg/ost_lv =====================
# time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib --mkfsoptions='-t ext4 -i 1058576' --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv

   Permanent disk data:
Target:     largefs-OSTffff
Index:      unassigned
Lustre FS:  largefs
Mount type: ldiskfs
Flags:      0x72
              (OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc,force_over_16tb
Parameters: mgsnode=192.168.77.1@o2ib

device size = 201326592MB
2 6 18
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
        target name  largefs-OSTffff
        4k blocks     51539607552
        options       -t ext4 -i 1058576 -J size=400 -I 256 -q -O dir_index,extents,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff -t ext4 -i 1058576 -J size=400 -I 256 -q -O dir_index,extents,uninit_groups -F /dev/large_vg/ost_lv 51539607552
mkfs.lustre: Unable to mount /dev/large_vg/ost_lv: Invalid argument

mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 22 (Invalid argument)

Dmesg showed that:

Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv =====================
LDISKFS-fs (dm-3): not enough memory

Memory status in the system:

# free
             total       used       free     shared    buffers     cached
Mem:      30897304     205012   30692292          0        332      22564
-/+ buffers/cache:     182116   30715188
Swap:      9601016        152    9600864

The current virtual machine has 30GB memory in total.
The /dev/large_vg/ost_lv device was formatted successfully but failed to be mounted due to the "not enough memory" issue. Here is the output of dumpe2fs:

# dumpe2fs -h /dev/large_vg/ost_lv
dumpe2fs 1.41.90.wc1 (18-Mar-2011)
Filesystem volume name:   largefs-OSTffff
Last mounted on:          <not available>
Filesystem UUID:          0a8d234a-94b8-4c61-be65-1ffc8a3b9d57
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype extent 64bit flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              201326592
Block count:              51539607552
Reserved block count:     2576980377
Free blocks:              51523063773
Free inodes:              201326581
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         128
Inode blocks per group:   8
Flex block group size:    16
Filesystem created:       Thu May 12 20:30:48 2011
Last mount time:          n/a
Last write time:          Thu May 12 20:33:12 2011
Mount count:              0
Maximum mount count:      0
Last checked:             Thu May 12 20:30:48 2011
Check interval:           0 (<none>)
Lifetime writes:          51 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      dd954dcc-5d06-4ca4-a06f-d32aa07f6d13
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             400M
Journal length:           102400
Journal sequence:         0x00000001
Journal start:            0

Andreas, could you please tell me how to calculate the amount of memory needed to mount the large size device?

Comment by Andreas Dilger [ 12/May/11 ]

Yu Jian,
the ext4 mount code only prints the "not enough memory" error message (a very useless message indeed) in one location - when it is allocating the array of pointers for the group descriptor buffers (s_group_desc).

For filesystems larger than 16TB, each group descriptor is 64 bytes in size (struct ext4_group_desc), and for 4kB blocks there is a group descriptor for each 128MB of the filesystem (4096 byte block bitmap * 8 bits/byte * 4096 byte block/bit = 128MB).

For a 192TB filesystem there are 192TB / 128MB = 1.5M block groups * 64 bytes / 4096 bytes/block = 24576 blocks of group descriptors. 24576 blocks * 8 bytes per pointer = 192kB for the kmalloc. On 2.6.32 kernels it is possible to do kmalloc() up to 4MB, but on older kernels (e.g. RHEL5) it is only possible to kmalloc() up to 128kB. Also, large kmalloc() calls (larger than 16kB) may fail if memory is fragmented.

Could you please make a patch which will try kmalloc() first, but if that fails will use vmalloc() to allocate the memory. It should set a flag in ext4_sb_info whether kmalloc() or vmalloc() was used, so that when s_group_desc is freed it knows whether to call kfree() or vfree() on the memory. Also, it makes sense to fix this error message to be better, like:

if (sbi->s_group_desc == NULL)

{ printk(KERN_ERR "EXT4-fs: %s: not enough memory for %u groups (%ukB)\n" sb->s_id, sbi->s_groups_count, db_count * sizeof(struct buffer_head *) / 1024); goto failed_mount; }

It should be possible to mount a filesystem of 128TB, because it would only try to allocate 128kB of memory. It might be worthwhile to run a "partial" test of llverdev and llverfs at 128TB, so that it can run quickly.

It should be possible for you to make a patch relatively quickly, so I don't think it is worthwhile to run the full testing at 128TB, but instead wait for fixing this in ldiskfs and run the full test at 192TB or larger.

Comment by Jian Yu [ 12/May/11 ]

It should be possible to mount a filesystem of 128TB, because it would only try to allocate 128kB of memory. It might be worthwhile to run a "partial" test of llverdev and llverfs at 128TB, so that it can run quickly.

Yes, I could format and mount an 128TB LUN successfully. Let me start the "partial" test on it.

It should be possible for you to make a patch relatively quickly, so I don't think it is worthwhile to run the full testing at 128TB, but instead wait for fixing this in ldiskfs and run the full test at 192TB or larger.

OK, will do this right away.

Comment by Jian Yu [ 13/May/11 ]

OK, will do this right away.

Patch for b1_8 branch: http://review.whamcloud.com/545.

The 24TB LUN testing has been running for 32 hours. It's still ongoing.
The 128TB LUN "partial" testing has been completed (test duration: 8304s). The test output is attached. (I somehow failed to upload the result to Maloo.)

Comment by Jian Yu [ 17/May/11 ]

After running about 99 hours, testing against the 24TB LUN on App Stack 04 was interrupted by the App Stack 01 reboot issue caused by mounting an 192TB LUN.

The reboot issue is under investigation in http://review.whamcloud.com/545.

And here is the test report for 24TB LUN:
https://maloo.whamcloud.com/test_sets/c64483ce-7fb9-11e0-b5bf-52540025f9af

The test went into the "full" llverfs run on the Lustre filesystem. The write operations had been finished, and the read operations were half performed (there were about 10TB data left to be read).

Since the above 24TB LUN testing was performed on kernel 2.6.18-194.17.1.el5 (kernel 2.6.18-238.9.1.el5 was not ready on b1_8 at that time), I'd re-run it on the latest kernel after the reboot issue is fixed.

Comment by Andreas Dilger [ 17/May/11 ]

It should be possible to just restart the ldiskfs full llverfs run after the reboot in read mode "-r", using the timestamp printed at the start of the run "-t 1305182040", at the last directory that was being checked "-o 179" after mounting the filesystem:

llverfs -vl -r -t 1305182040 -o 179 /mnt/ost1

Comment by Jian Yu [ 17/May/11 ]

It should be possible to just restart the ldiskfs full llverfs run after the reboot in read mode "-r", using the timestamp printed at the start of the run "-t 1305182040", at the last directory that was being checked "-o 179" after mounting the filesystem:

llverfs -vl -r -t 1305182040 -o 179 /mnt/ost1

Ah, I forgot this. Thanks for the instructions. I'll run this right away.

Comment by Jian Yu [ 18/May/11 ]

The 24TB LUN testing against Lustre b1_8 on CentOS5.6/x86_64 (kernel version: 2.6.18-194.17.1.el5) passed:
https://maloo.whamcloud.com/test_sets/5faed404-816f-11e0-b4df-52540025f9af

Comment by Jian Yu [ 18/May/11 ]

Quoted the comments from http://review.whamcloud.com/545 :

Please look at the output of "dumpe2fs" for both the 128TB and 129TB filesystems, to check what the block offset is for the group numbers that are causing problems. The group number is likely (block_nr * (4096 / 64)). There are 1M groups in a 128TB filesystem.

There shouldn't be any problem from reading blocks over 16384 for the 129TB filesystem,

I tried to format and mount the 129TB filesystems four times on the same node, and each time the node was rebooted while reading different group descriptor blocks in block group 0:

sbi->s_group_desc[i] = sb_bread(sb, block);

Time 1: i=12050 block=12051
Time 2: i=12045 block=12046
Time 3: i=12056 block=12057
Time 4: i=12044 block=12045

Here is the output of dumpe2fs:

Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0x27cf, unused inodes 117
  Primary superblock at 0, Group descriptors at 1-16512
  Block bitmap at 16513 (+16513), Inode bitmap at 16529 (+16529)
  Inode table at 16545-16552 (+16545)
  16090 free blocks, 117 free inodes, 2 directories, 117 unused inodes
  Free blocks: 16678-32767
  Free inodes: 12-128

So, as we can see, the block number is not large, which is not the issue.

so I guess the problem is probably in the allocation of the s_group_desc array, because it will likely be using kmalloc() for <= 128TB filesystems and vmalloc() for > 128TB filesystems.

I can't imagine why this code is actually failing, but it is probably related to the access of vmalloc() memory above 128TB.

Right. For <=128TB filesystems, the kmalloc() was used to allocate memory space to sbi->s_group_desc array, and for > 128TB filesystems, vmalloc() was used.

Here are the virtual memory addresses allocated by vmalloc() while mounting the 129TB filesystem:

VMALLOC_START: 0xffffc20000000000
VMALLOC_END:   0xffffe1ffffffffff

&sbi->s_group_desc[12050]: 0xffffc2000063c890
&sbi->s_group_desc[12045]: 0xffffc2000063c868
&sbi->s_group_desc[12056]: 0xffffc2000063c8c0
&sbi->s_group_desc[12044]: 0xffffc2000063c860

For the four times I formatting and mounting the 129TB filesystems, the virtual addresses allocated by vmalloc() to sbi->s_group_desc array were the same.

I could not find out what's wrong here.

Is it possible to test this on RHEL6? That kernel should allow larger kmalloc() allocations above 128kB.

I set up an RHEL6.0 App Stack on the DDN SFA10KE appliance with the latest master RHEL6.0/x86_64 server packages. However, I found the SFA block driver did not support >2.6.28 Linux kernel. I've asked Jim Shankland for help to see whether there is an updated version of the driver to support 2.6.32 kernel or not.

Comment by Andreas Dilger [ 18/May/11 ]

Some different things to check here:

  • the console/serial log on the OSS node will hopefully provide an Oops
    message that tells us what went wrong. If you don't have a serial
    console configured for the VM then this needs to be done. The kernel
    command-line needs something like "console=tty0 console=ttyS0,115200"
    or similar. You may need to read the SFA10k documentation for how to
    configure the rest of the system to read/log this console data, or
    please contact Ihara or Atul if you need guidance
  • if all of this debug output is from a remote syslog instead of from a
    serial console, I'm wondering whether the problems are happening at some
    later offset, but are just being delayed/lost when sent to the syslog?
    It may be that the crash is actually happening in ext4_check_descriptors()?
  • it is possible to change the code to always use vmalloc() instead of
    kmalloc() and see at what filesystem size the kernel is crashing. It
    may be that the crash will happen at ~12050 blocks == 771200 groups ==
    94TB with vmalloc() even if the filesystem itself is not larger than
    128TB, but I don't think so.
  • we can add more debugging into this code to print exactly what is being
    done. We know from the earlier memset() that all of the vmalloced
    memory in that array was touched without problems, so we shouldn't be
    hitting a problem just to access this memory now. Something like what
    you likely have now, but with more details. I added a delay to the loop
    so that we are sure to see the latest output in case you can't get the
    console logging configured easily.
        printk(KERN_NOTICE "sbi->s_group_desc = %p-%p (%p)\n",
               sbi->s_group_desc, &sbi->s_group_desc[db_count],
               (char *)(sbi->s_group_desc) + size);
        for (i = 0; i < db_count; i++) {
                struct buffer_head *bh;
                block = descriptor_loc(sb, logical_sb_block, i);
                printk(KERN_NOTICE "i = %u/%u, block = %llu\n", i, db_count, block);
                printk(KERN_NOTICE "&sbi->s_group_desc[%u]: %p\n", i, &sbi->s_group_desc[i]);
                bh = sb_bread(sb, block);
                printk(KERN_NOTICE "bh[%llu] = %p\n", block, bh);
                sbi->s_group_desc[i] = bh;
                if (!sbi->s_group_desc[i]) {
                        ext4_msg(sb, KERN_ERR,
                               "can't read group descriptor %d", i);
                        db_count = i;
                        goto failed_mount2;
                }
                schedule_timeout(HZ/20); /* 10 minutes until crash! */ 
        }
        printk(KERN_NOTICE "entering ext4_check_descriptors()\n");
        schedule_timeout(5*HZ);
        if (!ext4_check_descriptors(sb, &first_not_zeroed)) {
Comment by Andreas Dilger [ 18/May/11 ]

In further discussion with Peter Jones, we would like you to start the testing with 128TB LUNs for RHEL5. We can test 192 TB LUNs at a later time, possibly once we get RHEL6 working on the DDN 10000E system.

Please first verify at least one ldiskfs mount with a kernel using only vmalloc() to allocate s_group_desc to verify that it is not just the vmalloc() memory that is failing at ~96TB (as mentioned in point #3 above). We don't want to find out at some customer site that vmalloc() is causing problems even with smaller filesystems due to a problem with kmalloc() failing on a system with fragmented memory.

Next, please start a test against the 24TB LUN that is creating inodes that are located beyond the 16TB limit. Looking at the previous Maloo test output it appears there are about 25M inodes created on the OST filesystem, and about 50M+ inodes on the MDT filesystem. This should be based on some existing test like mdsrate-create-

{small,large}

.sh, using 25 directories with 1M files each to ensure that the inodes are being allocated above 16TB.

While the 24TB LUN inode testing is running, can you please also make a new version of the ext4-force_over_16tb-rhel[56].patch that is now renamed to ext4-force_over_24tb-rhel[56].patch that has a limit of 24TB ((6ULL << 30) blocks). This can be tested at the end simply by mounting a 24TB filesystem without the need to re-run the full llverfs/llverdev tests. This should be used for 1.8.6.

Next, I would like you to modify llverfs.c::print_filename() to print out the current read/write performance as described in LU-297. It appears from the Maloo logs of llverdev output that there are serious performance problems with reads on the SFA 10000E (i.e. only 80MB/s read vs. 300MB/s write), and I want to see whether this is also true with Lustre IO, or only llverdev reading/writing from userspace.

The full 128TB testing should be done using the current master (2.1) at this point with the kmalloc+vmalloc patch you wrote using the new llverdev tool. This needs corresponding ext4-force_over_128tb-rhel[56].patch files to be created. I estimate that it may take as long as 40 days to complete.

Comment by Jian Yu [ 19/May/11 ]

Some different things to check here:

  • the console/serial log on the OSS node will hopefully provide an Oops
    message that tells us what went wrong. If you don't have a serial
    console configured for the VM then this needs to be done. The kernel
    command-line needs something like "console=tty0 console=ttyS0,115200"
    or similar. You may need to read the SFA10k documentation for how to
    configure the rest of the system to read/log this console data, or
    please contact Ihara or Atul if you need guidance
  • if all of this debug output is from a remote syslog instead of from a
    serial console, I'm wondering whether the problems are happening at some
    later offset, but are just being delayed/lost when sent to the syslog?
    It may be that the crash is actually happening in ext4_check_descriptors()?

Thanks a lot for the suggestions! After making the VM's serial-console output redirected on a remote telnet connection, I got the exact Oops messages as follows:

Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv =====================
LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended
LDISKFS-fs: can't allocate buddy meta group
LDISKFS-fs (dm-3): failed to initalize mballoc (-12)
LDISKFS-fs (dm-3): mount failed
Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP:
 [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
PGD 7bd06f067 PUD 7bd2af067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 0
Modules linked in: ldiskfs(U) jbd2(U) crc16(U) lnet(U) libcfs(U) raid0(U) mlx4_ib(U) ib_ipoib(U) ipoib_helper(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sun
rpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxg
b3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hw
mon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) ide_cd(U) tpm_tis(U) 8139cp(U) tpm(U) i2c_p
iix4(U) mlx4_core(U) cdrom(U) sfablkdrvr(U) parport_pc(U) mii(U) tpm_bios(U) parport(U) i2c_core(U) pcspkr(U) serio_raw(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_c
ache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 3301, comm: mkfs.lustre Tainted: G      2.6.18-238.9.1.el5_lustre.20110509050254 #1
RIP: 0010:[<ffffffff8876a741>]  [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
RSP: 0018:ffff8107bd1a3ad8  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff8106b9fcd558 RCX: ffff81063b37dcc0
RDX: ffff81063b37dcc0 RSI: ffff8106b9fcd770 RDI: ffff8106b9fcd558
RBP: ffff8106b9fcd458 R08: ffff810000032600 R09: 7fffffffffffffff
R10: ffff8107bd1a38a8 R11: ffffffff80039e22 R12: ffff8107bd0640d8
R13: 0000000000000000 R14: ffff8107bd036000 R15: ffffffff8876af30
FS:  00002b9ada9816e0(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000001c8 CR3: 00000007bd010000 CR4: 00000000000006e0
Process mkfs.lustre (pid: 3301, threadinfo ffff8107bd1a2000, task ffff8107db4cc040)
Stack:  7fffffffffffffff ffff8106b9fcd558 ffff81063b37dc00 ffffffff80023011
 ffff8106b9fcd558 ffffffff80039f68 0000000000000000 ffff8107bd064078
 0000000000000000 ffffffff800edf8f ffff81063b37dc00 ffffffff8878d9a0
Call Trace:
 [<ffffffff80023011>] clear_inode+0xd2/0x123
 [<ffffffff80039f68>] generic_drop_inode+0x146/0x15a
 [<ffffffff800edf8f>] shrink_dcache_for_umount_subtree+0x1f2/0x21e
 [<ffffffff800ee3ff>] shrink_dcache_for_umount+0x35/0x43
 [<ffffffff800e635b>] generic_shutdown_super+0x1b/0xfb
 [<ffffffff800e646c>] kill_block_super+0x31/0x45
 [<ffffffff800e653a>] deactivate_super+0x6a/0x82
 [<ffffffff800e6c5f>] get_sb_bdev+0x121/0x16c
 [<ffffffff800e65e5>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800e66ae>] do_kern_mount+0x36/0x4d
 [<ffffffff800f0fba>] do_mount+0x6a9/0x719
 [<ffffffff8002b4d6>] flush_tlb_page+0xac/0xda
 [<ffffffff8001125a>] do_wp_page+0x3f8/0x91e
 [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530
 [<ffffffff80019de2>] __getblk+0x25/0x236
 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039
 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255
 [<ffffffff800ce751>] zone_statistics+0x3e/0x6d
 [<ffffffff8000f41d>] __alloc_pages+0x78/0x308
 [<ffffffff800eada4>] sys_mkdirat+0xd1/0xe4
 [<ffffffff8004c717>] sys_mount+0x8a/0xcd
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 40
RIP  [<ffffffff8876a741>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
 RSP <ffff8107bd1a3ad8>
CR2: 00000000000001c8
 <0>Kernel panic - not syncing: Fatal exception

The panic was caused by the following kmalloc codes in fs/ext4/mballoc.c:

static int ext4_mb_init_backend(struct super_block *sb)
{
    //......
    sbi->s_group_info = kmalloc(array_size, GFP_KERNEL);
    if (sbi->s_group_info == NULL) {
        printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
        return -ENOMEM;
    }
    //......
}

I'll make a patch for this and check whether there are any other codes which have such issue. The patch would be uploaded to http://review.whamcloud.com/545.

Comment by Andreas Dilger [ 19/May/11 ]

It is important to note that while the kmalloc() failure in ext4_mb_init_backend() caused an error, the actual oops was in ldiskfs_clear_inode(), so at some point that should be investigated as well.

Also, while fixing up ext4_mb_init_backend(), it appears that the comment for the s_group_info kmalloc() call is incorrect. A 128TB filesystem has 16384 group descriptor blocks (== 128kB pointer array), because the group descriptors for > 16TB filesystems are twice as large. Please fix it up to read:

        /* A 16TB filesystem with 64-bit pointers requires an 8192 byte
         * kmalloc().  Filesystems larger than 2^32 blocks (16TB normally)
         * have group descriptors at least twice as large (64 bytes or
         * more vs. 32 bytes for traditional ext3 filesystems, so a 128TB
         * filesystem needs a 128kB allocation, which may need vmalloc(). */

Please ensure that starting the 24TB inode testing is your highest priority, since this is blocking our 1.8.6.wc release. We can continue to resolve these issues and test 128TB or larger LUNs for 2.1.x while the 24TB testing is running.

Comment by Jian Yu [ 19/May/11 ]

Please ensure that starting the 24TB inode testing is your highest priority, since this is blocking our 1.8.6.wc release. We can continue to resolve these issues and test 128TB or larger LUNs for 2.1.x while the 24TB testing is running.

OK, got it.

Comment by Jian Yu [ 20/May/11 ]

The 24TB inode testing against Lustre b1_8 on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5_lustre.20110509050254) was started at Fri May 20 03:08:49 PDT 2011.

The following builds were used:
Lustre build: http://newbuild.whamcloud.com/job/lustre-reviews/581/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el5/26/

The test passed at Fri May 20 06:33:34 PDT 2011:
https://maloo.whamcloud.com/test_sets/9a51a09e-84eb-11e0-b4df-52540025f9af

Here is a short summary of the test result after running mdsrate with "--create" option:

# /opt/bin/mpirun -np 25 -machinefile /tmp/mdsrate-create.machines /usr/lib64/lustre/tests/mdsrate --create --verbose --ndirs 25 --dirfmt '/mnt/lustre/mdsrate/dir%d' --nfiles 1000000 --filefmt 'file%%d'

Rate: 2068.64 eff 2069.13 aggr 82.77 avg client creates/sec (total: 25 threads 25000000 creates 25 dirs 1 threads/dir 12085.21 secs)

# lfs df -h /mnt/lustre
UUID                       bytes        Used   Available Use% Mounted on
largefs-MDT0000_UUID      224.0G        1.2G      210.0G   1% /mnt/lustre[MDT:0]
largefs-OST0000_UUID       24.0T      938.0M       22.8T   0% /mnt/lustre[OST:0]

filesystem summary:        24.0T      938.0M       22.8T   0% /mnt/lustre


# lfs df -i /mnt/lustre
UUID                      Inodes       IUsed       IFree IUse% Mounted on
largefs-MDT0000_UUID    67108864    25000052    42108812  37% /mnt/lustre[MDT:0]
largefs-OST0000_UUID    25165824    25000087      165737  99% /mnt/lustre[OST:0]

filesystem summary:     67108864    25000052    42108812  37% /mnt/lustre
Comment by Jian Yu [ 23/May/11 ]

can you please also make a new version of the ext4-force_over_16tb-rhel[56].patch that is now renamed to ext4-force_over_24tb-rhel[56].patch that has a limit of 24TB ((6ULL << 30) blocks).

Patch for b1_8 branch: http://review.whamcloud.com/589.

Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
Comment by Andreas Dilger [ 24/May/11 ]

Yu Jian,
if you still have the test filesystem for 24TB inode testing, can you please run "time e2fsck -fn" on the OST filesystem? I would like to verify that e2fsck can properly handle inodes located beyond the 16TB offset limit.

Comment by Build Master (Inactive) [ 24/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #61
LU-136 change "force_over_16tb" mount option to "force_over_24tb"

Johann Lombardi : bd5a07010489666d7adf79c074f2dbd694f49f4a
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_24tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
Comment by Jian Yu [ 24/May/11 ]

if you still have the test filesystem for 24TB inode testing, can you please run "time e2fsck -fn" on the OST filesystem? I would like to verify that e2fsck can properly handle inodes located beyond the 16TB offset limit.

Done: https://maloo.whamcloud.com/test_sets/5a99a9da-869c-11e0-b4df-52540025f9af

Comment by Jian Yu [ 25/May/11 ]

Status: I would continue working on this ticket after Lustre 1.8.6 pre-release/release testing.

Comment by Andreas Dilger [ 03/Jun/11 ]

Yu Jian, given how long we expect the testing for this problem to take, would it be possible to start a 128TB test with the current master (2.1 pre) code? I expect the tests will take at least 30 days to complete, and if these are not started now they will likely delay the 2.1 release.

Please make a script which includes all if the tests we ran for 24TB (partial tests first, then full tests, including the many inodes test and full e2fsck after each test).

To keep the test logs consistent it probably makes sense to just name the test as "large-LUN-partial" and "large-LUN-full" and "large-LUN-inodes" or similar instead of putting the LUN size in the test name.

Once the tests are running they will hopefully not take much of your time, but the loss if elapsed time is hurting us here.

Comment by Andreas Dilger [ 03/Jun/11 ]

NB - I believe the problems we saw are related to >128TB only, is that correct?

Comment by Jian Yu [ 04/Jun/11 ]

I believe the problems we saw are related to >128TB only, is that correct?

Right, I could format and mount 128TB LUN successfully. I'd start the testing against the latest master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5) soon.

Comment by Jian Yu [ 07/Jun/11 ]

The 128TB LUN partial testing against Lustre master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.9.1.el5_lustre.gc66d831) was started at Tue Jun 7 03:13:08 PDT 2011.

The following builds were used:
Lustre build: http://newbuild.whamcloud.com/job/lustre-master/156/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/28/arch=x86_64,distro=el5/

Formatting the 128TB LUN failed: LU-399.

Comment by Jian Yu [ 14/Jun/11 ]

The 128TB LUN partial testing against Lustre master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g57944e2) was started at Tue Jun 14 00:37:57 PDT 2011.

The following builds were used:
Lustre build: http://newbuild.whamcloud.com/job/lustre-master/168/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/40/arch=x86_64,distro=el5/

After running 6223s, the test passed:
https://maloo.whamcloud.com/test_sets/08e88644-966c-11e0-9a27-52540025f9af

The 128TB LUN full testing was started at Tue Jun 14 02:58:30 PDT 2011.
After running llverfs in partial&full mode on the OST ldiskfs filesystem and then in partial mode on the Lustre filesystem, unmounting the OST hung due to LU-395:
https://maloo.whamcloud.com/test_sets/28db3042-9be8-11e0-9a27-52540025f9af

The patch for LU-395 was landed on master branch on 16 June. I'd setup the node with the latest master build and complete the remaining part (run llverfs in full mode on the Lustre filesystem) of the testing.

Comment by Andreas Dilger [ 22/Jun/11 ]

This is good news that testing has worked so well (excluding the one unrelated bug).

For testing on master, no extra mkfs.lustre options should be needed when formatting the filesystem. This was an oversight in the 1.8.6 testing, because the > 16TB support appeared to work OK, but as soon as DDN used mkfs.lustre without specifying any options the format failed.

Upon closer inspection, it does seem that mkfs_lustre.c needs to set the "64bit" flag for huge filesystems. I attached a patch to change 996 to fix this problem.

Are you planning on testing the inode creation + e2fsck testing that was run previously for 24TB LUNs? Also, please create a new ext4-force_over_128tb-rhel6.patch file with updated mount options.

We also need to find an OSS node with 128TB+ of storage that we can use for RHEL6 kernel/ldiskfs testing, since this cannot be tested within the SFA10000E VM.

Comment by Jian Yu [ 22/Jun/11 ]

Are you planning on testing the inode creation + e2fsck testing that was run previously for 24TB LUNs?

Yes, I'll.

Also, please create a new ext4-force_over_128tb-rhel6.patch file with updated mount options.

OK, got it.

Comment by Jian Yu [ 28/Jun/11 ]

After http://review.whamcloud.com/#change,996 was merged into the master branch, I proceeded with the remaining tests on 128TB LUN. However, formatting the 128TB OST caused kernel panic as follows:

Lustre: DEBUG MARKER: ===================== format the OST /dev/large_vg/ost_lv =====================
LDISKFS-fs (dm-3): warning: maximal mount count reached, running e2fsck is recommended
LDISKFS-fs: can't allocate buddy meta group
LDISKFS-fs (dm-3): failed to initalize mballoc (-12)
LDISKFS-fs (dm-3): mount failed
Unable to handle kernel NULL pointer dereference at 00000000000001c8 RIP:
 [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
PGD 7c0755067 PUD 7cbdea067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 3
Modules linked in: ldiskfs(U) jbd2(U) crc16(U) raid0(U) mlx4_ib(U) ib_ipoib(U) ipoib_helper(U) lnet(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) floppy(U) 8139too(U) mlx4_en(U) tpm_tis(U) parport_pc(U) ide_cd(U) tpm(U) 8139cp(U) mlx4_core(U) i2c_piix4(U) parport(U) sfablkdrvr(U) cdrom(U) mii(U) tpm_bios(U) serio_raw(U) i2c_core(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 3406, comm: mkfs.lustre Tainted: G      2.6.18-238.12.1.el5_lustre.g6a3d997 #1
RIP: 0010:[<ffffffff887801f1>]  [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
RSP: 0000:ffff810431595ad8  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff8107d06b8a10 RCX: ffff8107d21c90c0
RDX: ffff8107d21c90c0 RSI: ffff8107d06b8c18 RDI: ffff8107d06b8a10
RBP: ffff8107d06b8910 R08: ffff810000032600 R09: 7fffffffffffffff
R10: ffff8104315958a8 R11: ffffffff80039e56 R12: ffff8107cf3ec0d8
R13: 0000000000000000 R14: ffff8107c06e3000 R15: ffffffff88780600
FS:  00002b3dd812b6e0(0000) GS:ffff81011bbdb640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000001c8 CR3: 00000007c0644000 CR4: 00000000000006e0
Process mkfs.lustre (pid: 3406, threadinfo ffff810431594000, task ffff8107dfeb9080)
Stack:  7fffffffffffffff ffff8107d06b8a10 ffff8107d21c9000 ffffffff8002303b
 ffff8107d06b8a10 ffffffff80039f9c 0000000000000000 ffff8107cf3ec078
 0000000000000000 ffffffff800ede72 ffff8107d21c9000 ffffffff887a2d00
Call Trace:
 [<ffffffff8002303b>] clear_inode+0xd2/0x123
 [<ffffffff80039f9c>] generic_drop_inode+0x146/0x15a
 [<ffffffff800ede72>] shrink_dcache_for_umount_subtree+0x1f2/0x21e
 [<ffffffff800ee40c>] shrink_dcache_for_umount+0x35/0x43
 [<ffffffff800e636b>] generic_shutdown_super+0x1b/0xfb
 [<ffffffff800e647c>] kill_block_super+0x31/0x45
 [<ffffffff800e654a>] deactivate_super+0x6a/0x82
 [<ffffffff800e6c6f>] get_sb_bdev+0x121/0x16c
 [<ffffffff800e65f5>] vfs_kern_mount+0x93/0x11a
 [<ffffffff800e66be>] do_kern_mount+0x36/0x4d
 [<ffffffff800f0fc6>] do_mount+0x6a9/0x719
 [<ffffffff8002b502>] flush_tlb_page+0xac/0xda
 [<ffffffff8001125b>] do_wp_page+0x3f8/0x91e
 [<ffffffff88030d09>] :jbd:do_get_write_access+0x4f9/0x530
 [<ffffffff80019de3>] __getblk+0x25/0x236
 [<ffffffff800096d4>] __handle_mm_fault+0xf6b/0x1039
 [<ffffffff88030804>] :jbd:journal_stop+0x249/0x255
 [<ffffffff800ce756>] zone_statistics+0x3e/0x6d
 [<ffffffff800efd44>] copy_mount_options+0xcc/0x127
 [<ffffffff8004c74a>] sys_mount+0x8a/0xcd
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b b8 c8 01 00 00 48 85 ff 74 13 48 83 c4 08 48 8d b5 30
RIP  [<ffffffff887801f1>] :ldiskfs:ldiskfs_clear_inode+0x81/0xb0
 RSP <ffff810431595ad8>
CR2: 00000000000001c8
 <0>Kernel panic - not syncing: Fatal exception

The mkfs.lustre command I run was:

mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib --mountfsoptions='errors=remount-ro,extents,mballoc,force_over_16tb' /dev/large_vg/ost_lv

The panic was the same as what was described in #comment-14649 above. I'll look into ldiskfs_clear_inode() per the above comment #comment-14650.

Comment by Jian Yu [ 02/Jul/11 ]

The panic was the same as what was described in #comment-14649 above. I'll look into ldiskfs_clear_inode() per the above comment #comment-14650.

A new ticket LU-477 was filed to track and fix the above issue.

Comment by Jian Yu [ 08/Jul/11 ]

Also, please create a new ext4-force_over_128tb-rhel6.patch file with updated mount options.

Patch for master branch: http://review.whamcloud.com/1073.

Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
Comment by Build Master (Inactive) [ 08/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #199
LU-136 change "force_over_16tb" mount option to "force_over_128tb"

Oleg Drokin : 79ec0a1df07733183f19d71813f99306b31f3636
Files :

  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-extents-mount-option-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel5.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_16tb-rhel6.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-disable-mb-cache-rhel6.patch
  • ldiskfs/kernel_patches/patches/ext4-force_over_128tb-rhel6.patch
Comment by Jian Yu [ 11/Jul/11 ]

After http://review.whamcloud.com/1071 and http://review.whamcloud.com/1073 were merged into the master branch, I proceeded with the 128TB LUN full testing on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g5c1e9f9). The testing was started at Sun Jul 10 23:56:02 PDT 2011.

The following builds were used:
Lustre build: http://newbuild.whamcloud.com/job/lustre-master/199/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/

There were no extra mkfs.lustre options specified when formatting the 128TB OST.

===================== format the OST /dev/large_vg/ost_lv =====================
# time mkfs.lustre --reformat --fsname=largefs --ost --mgsnode=192.168.77.1@o2ib /dev/large_vg/ost_lv

   Permanent disk data:
Target:     largefs-OSTffff
Index:      unassigned
Lustre FS:  largefs
Mount type: ldiskfs
Flags:      0x72
              (OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.77.1@o2ib

device size = 134217728MB
formatting backing filesystem ldiskfs on /dev/large_vg/ost_lv
        target name  largefs-OSTffff
        4k blocks     34359738368
        options        -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F
mkfs_cmd = mke2fs -j -b 4096 -L largefs-OSTffff  -J size=400 -I 256 -i 1048576 -q -O extents,uninit_bg,dir_nlink,huge_file,64bit,flex_bg -G 256 -E lazy_journal_init, -F /dev/large_vg/ost_lv 34359738368
Writing CONFIGS/mountdata

real    0m44.489s
user    0m6.669s
sys     0m31.087s
Comment by Jian Yu [ 26/Jul/11 ]

After running for about 12385 minutes (206 hours, 8 days), the 128TB Lustre filesystem was successfully filled up by llverfs:

# lfs df -h /mnt/lustre
UUID                       bytes        Used   Available Use% Mounted on
largefs-MDT0000_UUID        1.5T      499.3M        1.4T   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID      128.0T      121.4T      120.0G 100% /mnt/lustre[OST:0]

filesystem summary:       128.0T      121.4T      120.0G 100% /mnt/lustre

# lfs df -i /mnt/lustre
UUID                      Inodes       IUsed       IFree IUse% Mounted on
largefs-MDT0000_UUID  1073741824       32099  1073709725   0% /mnt/lustre[MDT:0]
largefs-OST0000_UUID   134217728       31191   134186537   0% /mnt/lustre[OST:0]

filesystem summary:   1073741824       32099  1073709725   0% /mnt/lustre

Now, the read operation is ongoing...

Comment by Jian Yu [ 03/Aug/11 ]

Now, the read operation is ongoing...

Done.

After running for about 21 days in total, the 128TB LUN full testing on CentOS5.6/x86_64 (kernel version: 2.6.18-238.12.1.el5_lustre.g5c1e9f9) passed on Lustre master build v2_0_65_0:
https://maloo.whamcloud.com/test_sets/69c35618-bdd3-11e0-8bdf-52540025f9af

The "large-LUN-inodes" testing is going to be started on the latest master branch...

Comment by Jian Yu [ 09/Aug/11 ]

The "large-LUN-inodes" testing is going to be started on the latest master branch...

The inode creation testing on 128TB Lustre filesystem against master branch on CentOS5.6/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.gd4ea36c) was started at Mon Aug 8 22:51:49 PDT 2011. About 134M inodes would be created.

The following builds were used:
Lustre build: http://newbuild.whamcloud.com/job/lustre-master/246/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/

After running for about 53 hours, the test passed at Thu Aug 11 04:41:09 PDT 2011:
https://maloo.whamcloud.com/test_sets/af225374-c72b-11e0-a7e2-52540025f9af

Here is a short summary of the test result after running mdsrate with "--create" option:

# /opt/mpich/bin/mpirun  -np 25 -machinefile /tmp/mdsrate-create.machines /usr/lib64/lustre/tests/mdsrate --create --verbose --ndirs 25 --dirfmt '/mnt/lustre/mdsrate/dir%d' --nfiles 5360000 --filefmt 'file%%d'

Rate: 694.17 eff 694.18 aggr 27.77 avg client creates/sec (total: 25 threads 134000000 creates 25 dirs 1 threads/dir 193035.50 secs)

# lfs df -h /mnt/lustre
UUID                       bytes        Used   Available Use% Mounted on
largefs-MDT0000_UUID        1.5T       13.6G        1.4T   1% /mnt/lustre[MDT:0]
largefs-OST0000_UUID      128.0T        3.6G      121.6T   0% /mnt/lustre[OST:0]

filesystem summary:       128.0T        3.6G      121.6T   0% /mnt/lustre


# lfs df -i /mnt/lustre
UUID                      Inodes       IUsed       IFree IUse% Mounted on
largefs-MDT0000_UUID  1073741824   134000062   939741762  12% /mnt/lustre[MDT:0]
largefs-OST0000_UUID   134217728   134006837      210891 100% /mnt/lustre[OST:0]

filesystem summary:   1073741824   134000062   939741762  12% /mnt/lustre
Comment by Jian Yu [ 15/Aug/11 ]

After running for about 53 hours, the test passed at Thu Aug 11 04:41:09 PDT 2011:
https://maloo.whamcloud.com/test_sets/af225374-c72b-11e0-a7e2-52540025f9af

The test log was not showed up in the above Maloo report. Please find it in the attachment - large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.log.

Comment by Andreas Dilger [ 15/Aug/11 ]

Yu Jian, I looked through the inodes run, but I didn't see it running e2fsck on the large LUN? That should be added as part of the test script if it isn't there today. If the LUN with the 135M files still exists, can you please start an e2fsck on both the MDS and the OST.

Comment by Jian Yu [ 15/Aug/11 ]

Yu Jian, I looked through the inodes run, but I didn't see it running e2fsck on the large LUN? That should be added as part of the test script if it isn't there today. If the LUN with the 135M files still exists, can you please start an e2fsck on both the MDS and the OST.

Sorry for the confusion, Andreas. The e2fsck part is in the test script. While running e2fsck on the OST after creating the 134M files, the following errors occurred on the virtual disks which were presented to the virtual machine:

--------8<--------
kernel: janusdrvr: WARNING: cpCompleteIoReq(): Req Context ID 0x0 completed with error status 0x7
kernel: end_request: I/O error, dev sfa0066, sector 0
kernel: Buffer I/O error on device sfa0066, logical block 0
kernel: janusdrvr: WARNING: cpCompleteIoReq(): Req Context ID 0x1 completed with error status 0x7
kernel: end_request: I/O error, dev sfa0066, sector 0
kernel: Buffer I/O error on device sfa0066, logical block 0
--------8<-------- 

The same issue also occurred on other disks presented to other virtual machines. And then all of the disks became invisible. I've tried to reboot the virtual machine and re-load the disk driver, but it did not work. I think it's hardware issue, so I removed the incomplete e2fsck part from the test result and just uploaded the complete inodes creation part.

After the issue is resolved, I'll complete the e2fsck part.

Comment by Jian Yu [ 19/Aug/11 ]

After the issue is resolved, I'll complete the e2fsck part.

OK, now the issue is resolved. The testing is restarted on the following master build:

Lustre build: http://newbuild.whamcloud.com/job/lustre-master/263/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/

After running for about 120 hours, the inodes creation and e2fsck tests passed on 128TB Lustre filesystem.
Please refer to the attached test output file: large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build263.log

Comment by Andreas Dilger [ 01/Sep/11 ]

For the 1.41.90.wc4 e2fsprogs I've cherry-picked a couple of recent 64-bit fixes from upstream:

commit bc526c65d2a4cf0c6c04e9ed4837d6dd7dbbf2b3
Author: Theodore Ts'o <tytso@mit.edu>
Date: Tue Jul 5 20:35:46 2011 -0400

libext2fs: fix 64-bit support in ext2fs_bmap2()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit 24404aa340b274e077b2551fa7bdde5122d3eb43
Author: Theodore Ts'o <tytso@mit.edu>
Date: Tue Jul 5 20:02:27 2011 -0400

libext2fs: fix 64-bit support in ext2fs_

{read,write}

_inode_full()

This fixes a problem where reading or writing inodes located after the
4GB boundary would fail.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

The first one is unlikely to affect most uses, but may hit in rare cases.
The second one is only a problem on 32-bit machines, so is unlikely to affect Lustre users.

I don't think there is anything left to do for this bug, so it can be closed.

Comment by Andreas Dilger [ 09/Sep/11 ]

After running for about 120 hours, the inodes creation and e2fsck tests passed on 128TB Lustre filesystem.
Please refer to the attached test output file: large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build263.log

Yu Jian, I'm looking at the log file, and found some strange results.

Firstly, do you know why none of the large-LUN-inodes test results in Maloo include the test logs? That makes it hard to look at the results in the future if there is reason to do so. I wanted to see the e2fsck times for the many-inodes runs, but only have the one test result above to look at. Could you please file a separate TT- bug to fix whatever problem is preventing the logs for this test to be sent to Maloo.

Looking at the above log, it seems that the MDT (with 25 dirs of 5M files each) took only 7 minutes to run e2fsck, while the OST (with 32 dirs of 4M files each) took 3500 minutes (58 hours) to run. That doesn't make sense, and I wanted to compare this to the most recent large-LUN-inodes test result, which took 20h less time to run.

Are the MDT and OST e2fsck runs in the same VM on the SFA10k, or is the MDT on a separate MDS node?

Comment by Jian Yu [ 13/Sep/11 ]

Firstly, do you know why none of the large-LUN-inodes test results in Maloo include the test logs? That makes it hard to look at the results in the future if there is reason to do so. I wanted to see the e2fsck times for the many-inodes runs, but only have the one test result above to look at. Could you please file a separate TT- bug to fix whatever problem is preventing the logs for this test to be sent to Maloo.

I've no idea about this issue. Syslog could be displayed, but not the suite log and test log. I just created TT-180 to ask John for help.

Are the MDT and OST e2fsck runs in the same VM on the SFA10k, or is the MDT on a separate MDS node?

The MDT and OST are in the same VM.

Before TT-180 is fixed, please find the attached large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build273.log file for the test output of the inodes creation + e2fsck test on the following builds:

Lustre build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/

Comment by Jian Yu [ 13/Sep/11 ]

Before TT-180 is fixed, please find the attached large-LUN-inodes.suite_log.ddn-sfa10000e-stack01.build273.log file for the test output of the inodes creation + e2fsck test on the following builds:

Lustre build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
e2fsprogs build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/arch=x86_64,distro=el5/

TT-180 was just fixed.
Here is the Maloo report for the above test result: https://maloo.whamcloud.com/test_sets/83e2174e-ddfb-11e0-9909-52540025f9af

Generated at Sat Feb 10 01:04:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.