[LU-2963] fail to create large stripe count file with -ENOSPC error Created: 14/Mar/13 Updated: 06/May/14 Resolved: 06/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7227 | ||||||||
| Comments |
| Comment by James A Simmons [ 14/Mar/13 ] |
|
Durning testing we did some runs creating directories of increasing stripe count. What was discovered was that around 128 stripes that the files being created would fail with a -ENOSPC no matter what size the file was. This test was also done with 1.8 clients and we saw no such problems. Also durning the runs with the 2.4 clients attempts to do a lfs getstripe on the large stripe count directory would lock up. |
| Comment by James A Simmons [ 14/Mar/13 ] |
|
He is a log from the MDS whne I attempted to use lfs getstripe and it hung. Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0 |
| Comment by Peter Jones [ 15/Mar/13 ] |
|
Minh Could you please see whether you are able to reproduce this? Thanks Peter |
| Comment by Minh Diep [ 18/Mar/13 ] |
|
Hi James, You said you ran create directory with increasing stripe count. Then you said at around 128 strips, "files" create failed. Did you try to create files under those directories after the directories created? |
| Comment by Minh Diep [ 19/Mar/13 ] |
|
Due to |
| Comment by Minh Diep [ 20/Mar/13 ] |
|
Hi James, I have setup 2 oss with 300 ost each but could not reproduce the problem you are seeing. Could you please try the latest build on lustre-master since there have been many fixes went in lately. Thanks |
| Comment by James A Simmons [ 20/Mar/13 ] |
|
Setting up a smaller scale test system to reproduce. |
| Comment by James A Simmons [ 21/Mar/13 ] |
|
While attempting to setup a system to reproduce this I ran into |
| Comment by James A Simmons [ 22/Mar/13 ] |
|
Managed to get a 224 stripe count system up. So far I haven't been able to reproduce the problem. Please don't close the ticket until our next test shot in the latter part of April when we can make sure that this is fixed. |
| Comment by Minh Diep [ 22/Mar/13 ] |
|
James, just curious, what is the memory size on your oss? I am hitting OOM when I tried to mount 300 OSTs, stopping at around 241 |
| Comment by James A Simmons [ 22/Mar/13 ] |
|
I originally tried 448 OSTs but I also hit a OOM as well. I reduced my to 224 and that worked for me. The OSS I'm working with has 16GB of ram. |
| Comment by Peter Jones [ 25/Mar/13 ] |
|
Dropping in priority as unable to reproduce this issue on the latest master. Will raise it in priority again if it reoccurs. |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
Good news and bad news. I can now duplicate this problem with the latest 2.3.63. What is the best debug settings to track down this problem? |
| Comment by Minh Diep [ 02/Apr/13 ] |
|
James, could you provide lustre debug log with debug=-1? how long does it take to produce this? |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
With my IOR job less than a minute to reproduce. I'm going to setup the debug script now. |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
Uploaded one clients log to ftp.whamcloud.com/uploads/ I have more client logs if you need them. |
| Comment by Minh Diep [ 02/Apr/13 ] |
|
how many oss and ost? how many client did you use for IOR? I'd like to try to reproduce this in the lab |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
Here are my scripts to create a file system with the config (testfs-barry-224.conf). new-build formats the file system and new-lustre-start mounts it. Hostlist is used to handle the pdsh format listing of the devices and servers. I also attached my ior job scripts I used to run my IOR job with. |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
4 OSS each with 56 OST to give a total of 224 OSTs. This is a LVM setup were each OSS has 7 real OSTs. The client side I ran IOR across 18 cray computes. I attached all my setup scripts and the job script I ran with. You will need to adapt for your system. |
| Comment by James A Simmons [ 05/Apr/13 ] |
|
Any updates? |
| Comment by Minh Diep [ 05/Apr/13 ] |
|
Sorry, no. I hit |
| Comment by James A Simmons [ 08/Apr/13 ] |
|
I have a client log uploaded already at ftp.whamcloud.com/uploads/ |
| Comment by Minh Diep [ 08/Apr/13 ] |
|
yes, please upload the server logs too. thanks |
| Comment by James A Simmons [ 08/Apr/13 ] |
|
I need to build a new large stripe count file system for other test so I will get you new logs. |
| Comment by James A Simmons [ 11/Apr/13 ] |
|
I rebased to the latest master and now I can't reproduce this bug. I have a feeling some of the layout patches that were merged the 8th fixed this issue. Please leave this ticket open until after our test shot which will take place tomorrow. |
| Comment by James A Simmons [ 22/Apr/13 ] |
|
During the last test shot we encountered this bug again. This time we got logs from the clients and servers. I uploaded all the logs to ftp.whamcloud.com/uploads/ |
| Comment by Peter Jones [ 22/Apr/13 ] |
|
Yu, Jian Could you please review this latest information from ORNL? Thanks Peter |
| Comment by Andreas Dilger [ 26/Apr/13 ] |
|
Sorry, haven't looked at the logs yet. My gut feeling is that the -ENOSPC is being returned from the server, and either from the journal layer or from the xattrs, since these are the few places that are affected by growing stripe count. James, first question - is the "large_xattr" feature enabled on your MDS? This is still not being enabled by default (see http://review.whamcloud.com/4315), since the patch has not been accepted upstream yet, and it makes sense to limit the feature exposure to sites that actually need it. This can be set any time during formatting (via mkfsoptions) or after (via tune2fs) with "-O large_xattr". |
| Comment by Di Wang [ 26/Apr/13 ] |
|
James: I just checked the debug log, I did not find mds log there? Just want to confirm, the bug you hit in the last test is still "-NOSPC when you try to create a file with 224 stripes? |
| Comment by Jian Yu [ 27/Apr/13 ] |
|
Lustre Branch: master MDSOPT="--mkfsoptions='-O large_xattr'" The parallel-scale test iorssf passed with 224 OSTs: As per run_ior() in lustre/tests/functions.sh, "$LFS setstripe $testdir -c -1" was performed before running the IOR command. Another test run with MDSOPT="--mkfsoptions='-O large_xattr -J size=1024'" also passed: + /usr/bin/lfs setstripe /mnt/lustre/d0.ior.ssf -c -1 + /usr/bin/lfs getstripe -d /mnt/lustre/d0.ior.ssf stripe_count: -1 stripe_size: 1048576 stripe_offset: -1 + /usr/bin/IOR -a POSIX -C -g -b 1g -o /mnt/lustre/d0.ior.ssf/iorData -t 4m -v -e -w -r -i 5 -k More tests passed: # ls -l /mnt/lustre/ total 0 # lfs setstripe -c 224 /mnt/lustre/file # lfs getstripe -i -c -s /mnt/lustre/file lmm_stripe_count: 224 lmm_stripe_size: 1048576 lmm_stripe_offset: 133 # yes | dd bs=1024 count=1048576 of=/mnt/lustre/file 1048576+0 records in 1048576+0 records out 1073741824 bytes (1.1 GB) copied, 1288.4 s, 833 kB/s # lfs getstripe -i -c -s /mnt/lustre/file lmm_stripe_count: 224 lmm_stripe_size: 1048576 lmm_stripe_offset: 133 # mkdir /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 # lfs setstripe -c 224 /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir stripe_count: 224 stripe_size: 1048576 stripe_offset: -1 # touch /mnt/lustre/dir/file # lfs getstripe -i -c -s /mnt/lustre/dir/file lmm_stripe_count: 224 lmm_stripe_size: 1048576 lmm_stripe_offset: 189 # yes | dd bs=1024 count=1048576 of=/mnt/lustre/dir/file 1048576+0 records in 1048576+0 records out 1073741824 bytes (1.1 GB) copied, 1359.48 s, 790 kB/s # lfs getstripe -i -c -s /mnt/lustre/dir/file lmm_stripe_count: 224 lmm_stripe_size: 1048576 lmm_stripe_offset: 189 # lfs getstripe -d /mnt/lustre/dir stripe_count: 224 stripe_size: 1048576 stripe_offset: -1 |
| Comment by James A Simmons [ 29/Apr/13 ] |
|
Sorry about the confusion with this ticket. When I created this ticket for our first test shot this problem was only observed during our hero wide stripe test with 367 ost at the time. After that test shot I opened this ticket and prepared a scaling job that would create directories with powers of two stripe count. So for the second test shot we ran this scaling job to discover that the failure happened around 128 stripes which is below the old 160 stripe limit. For this last test shot run we again saw this problem not only at larger stripe count (128 stripes again) but also for single shared file that was stripe across 4 OSTs. This shared file was being written to by 18K number of nodes. So I don't think it is a general wide stripe problem we are seeing but some other issue. We thought it might of been a grant issues since the OSTs are only 250 GB in size but Oleg told me during LUG this is unlikely the case. P.S |
| Comment by Andreas Dilger [ 29/Apr/13 ] |
|
James, Is it possible that earlier in your testing that some OSTs were filled up? |
| Comment by James A Simmons [ 01/May/13 ] |
|
For the last test shot we had to reformat the file system due to the changes in the fid format. After mounting the file system I always run the large stripe job first. |
| Comment by James Nunez (Inactive) [ 18/Feb/14 ] |
|
James, Have you run this large stripe job recently and, if so, are you still seeing this problem? Thanks, |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
The problem was large_xattr was not set on the MDS. That was resolved. What is not resolved is that when large stripe count is not set then the default LOV_MAX_STRIPE is not 160 but something less due to changes in the data being sent over wire. |
| Comment by James Nunez (Inactive) [ 10/Apr/14 ] |
|
James, The patch for Thank you. |
| Comment by James A Simmons [ 06/May/14 ] |
|
Excellent news. The patch from |
| Comment by Peter Jones [ 06/May/14 ] |
|
That is excellent news - thanks James! |