[LU-2963] fail to create large stripe count file with -ENOSPC error Created: 14/Mar/13  Updated: 06/May/14  Resolved: 06/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File hostlist.sh     File ior-lsc.pbs     File new-build.sh     File new-lustre-start.sh     File testfs-barry-224.conf    
Issue Links:
Related
is related to LU-4791 lod_ah_init() ASSERTION( lc->ldo_stri... Resolved
Severity: 3
Rank (Obsolete): 7227

 Comments   
Comment by James A Simmons [ 14/Mar/13 ]

Durning testing we did some runs creating directories of increasing stripe count. What was discovered was that around 128 stripes that the files being created would fail with a -ENOSPC no matter what size the file was. This test was also done with 1.8 clients and we saw no such problems. Also durning the runs with the 2.4 clients attempts to do a lfs getstripe on the large stripe count directory would lock up.

Comment by James A Simmons [ 14/Mar/13 ]

He is a log from the MDS whne I attempted to use lfs getstripe and it hung.

Mar 8 21:12:54 widow-mds1 kernel: [27048.704079] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3167@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f748600/0x4453d516b35bd568 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3167@gni remote: 0x1f865dc509f72511 expref: 11 pid: 28552 timeout: 4321715043 lvb_type: 0
Mar 8 21:12:54 widow-mds1 kernel: [27048.807514] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 3160@gni ns: mdt-ffff880e55b20000 lock: ffff880c91423800/0x4453d516b35bd481 lrc: 3/0,0 mode: CR/CR res: 8589939214/48084 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 3160@gni remote: 0x13f59fd0192bc693 expref: 9 pid: 28131 timeout: 4321715043 lvb_type: 0
Mar 8 21:19:15 widow-mds1 kernel: [27429.704085] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 4524@gni ns: mdt-ffff880e55b20000 lock: ffff880cecfe8800/0x4453d516b35c5e97 lrc: 3/0,0 mode: CR/CR res: 8589937079/12 bits 0x9 rrc: 36 type: IBT flags: 0x200000000020 nid: 4524@gni remote: 0x3f5d4900776bce4c expref: 16 pid: 28093 timeout: 4322096096 lvb_type: 0
Mar 8 21:19:15 widow-mds1 kernel: [27429.808017] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 11 previous similar messages
Mar 8 21:25:39 widow-mds1 kernel: [27813.704166] LustreError: 49:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 384s: evicting client at 3158@gni ns: mdt-ffff880e55b20000 lock: ffff880b58904000/0x4453d516b35ca546 lrc: 3/0,0 mode: CR/CR res: 8589936085/14 bits 0x9 rrc: 5 type: IBT flags: 0x200000000020 nid: 3158@gni remote: 0x72cd27e494baa553 expref: 16 pid: 28882 timeout: 4322480792 lvb_type: 0
Mar 8 21:25:39 widow-mds1 kernel: [27813.808394] LustreError: 49:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 34 previous similar messages
Mar 8 21:31:39 widow-mds1 kernel: [28173.189536] Lustre: routed1-MDT0000: haven't heard from client 61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib) in 1366 seconds. I think it's dead, and I am evicting it. exp ffff880d79545c00, cur 1362796299 expire 1362795399 last 1362794933
Mar 8 21:31:39 widow-mds1 kernel: [28173.251441] Lustre: Skipped 4 previous similar messages
Mar 8 21:47:52 widow-mds1 kernel: [29146.704092] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 1626@gni ns: mdt-ffff880e55b20000 lock: ffff880d2c019600/0x4453d516b365e418 lrc: 3/0,0 mode: CR/CR res: 8589939214/48110 bits 0x9 rrc: 14 type: IBT flags: 0x200000000020 nid: 1626@gni remote: 0xdbdb318b1cc3700e expref: 10 pid: 29051 timeout: 4323813395 lvb_type: 0
Mar 8 21:47:52 widow-mds1 kernel: [29146.809942] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 3 previous similar messages
Mar 8 21:54:13 widow-mds1 kernel: [29527.704103] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 379s: evicting client at 4582@gni ns: mdt-ffff880e55b20000 lock: ffff880b80052400/0x4453d516b3666b48 lrc: 3/0,0 mode: CR/CR res: 8589937079/16 bits 0x9 rrc: 37 type: IBT flags: 0x200000000020 nid: 4582@gni remote: 0xbe8a1154e07f5e4f expref: 14 pid: 28963 timeout: 4324194184 lvb_type: 0
Mar 8 21:54:13 widow-mds1 kernel: [29527.810409] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 12 previous similar messages
Mar 8 22:01:07 widow-mds1 kernel: [29941.704087] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 414s: evicting client at 4581@gni ns: mdt-ffff880e55b20000 lock: ffff880c3f697400/0x4453d516b366b0bc lrc: 3/0,0 mode: CR/CR res: 8589944660/2 bits 0x9 rrc: 11 type: IBT flags: 0x200000000020 nid: 4581@gni remote: 0x5cd35be174c032ea expref: 9 pid: 28346 timeout: 4324608334 lvb_type: 0
Mar 8 22:01:07 widow-mds1 kernel: [29941.810653] LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 35 previous similar messages

Comment by Peter Jones [ 15/Mar/13 ]

Minh

Could you please see whether you are able to reproduce this?

Thanks

Peter

Comment by Minh Diep [ 18/Mar/13 ]

Hi James,

You said you ran create directory with increasing stripe count. Then you said at around 128 strips, "files" create failed. Did you try to create files under those directories after the directories created?

Comment by Minh Diep [ 19/Mar/13 ]

Due to LU-2845 I haven't been able to setup wide-striping cluster with 200 osts from 2 oss

Comment by Minh Diep [ 20/Mar/13 ]

Hi James,

I have setup 2 oss with 300 ost each but could not reproduce the problem you are seeing. Could you please try the latest build on lustre-master since there have been many fixes went in lately.

Thanks

Comment by James A Simmons [ 20/Mar/13 ]

Setting up a smaller scale test system to reproduce.

Comment by James A Simmons [ 21/Mar/13 ]

While attempting to setup a system to reproduce this I ran into LU-3004 which is blocking me.

Comment by James A Simmons [ 22/Mar/13 ]

Managed to get a 224 stripe count system up. So far I haven't been able to reproduce the problem. Please don't close the ticket until our next test shot in the latter part of April when we can make sure that this is fixed.

Comment by Minh Diep [ 22/Mar/13 ]

James, just curious, what is the memory size on your oss? I am hitting OOM when I tried to mount 300 OSTs, stopping at around 241

Comment by James A Simmons [ 22/Mar/13 ]

I originally tried 448 OSTs but I also hit a OOM as well. I reduced my to 224 and that worked for me. The OSS I'm working with has 16GB of ram.

Comment by Peter Jones [ 25/Mar/13 ]

Dropping in priority as unable to reproduce this issue on the latest master. Will raise it in priority again if it reoccurs.

Comment by James A Simmons [ 02/Apr/13 ]

Good news and bad news. I can now duplicate this problem with the latest 2.3.63. What is the best debug settings to track down this problem?

Comment by Minh Diep [ 02/Apr/13 ]

James, could you provide lustre debug log with debug=-1? how long does it take to produce this?

Comment by James A Simmons [ 02/Apr/13 ]

With my IOR job less than a minute to reproduce. I'm going to setup the debug script now.

Comment by James A Simmons [ 02/Apr/13 ]

Uploaded one clients log to ftp.whamcloud.com/uploads/LU-2963/lustre-log.c0-0c0s7n0.txt

I have more client logs if you need them.

Comment by Minh Diep [ 02/Apr/13 ]

how many oss and ost? how many client did you use for IOR? I'd like to try to reproduce this in the lab

Comment by James A Simmons [ 02/Apr/13 ]

Here are my scripts to create a file system with the config (testfs-barry-224.conf). new-build formats the file system and new-lustre-start mounts it. Hostlist is used to handle the pdsh format listing of the devices and servers. I also attached my ior job scripts I used to run my IOR job with.

Comment by James A Simmons [ 02/Apr/13 ]

4 OSS each with 56 OST to give a total of 224 OSTs. This is a LVM setup were each OSS has 7 real OSTs. The client side I ran IOR across 18 cray computes. I attached all my setup scripts and the job script I ran with. You will need to adapt for your system.

Comment by James A Simmons [ 05/Apr/13 ]

Any updates?

Comment by Minh Diep [ 05/Apr/13 ]

Sorry, no. I hit LU-3102 when I tried to reproduce. If it's easy for you to reproduce, could you upload servers and clients debug log (with debug=-1)?

Comment by James A Simmons [ 08/Apr/13 ]

I have a client log uploaded already at ftp.whamcloud.com/uploads/LU-2963/lustre-log.c0-0c0s7n0.txt. I can get the server logs as well.

Comment by Minh Diep [ 08/Apr/13 ]

yes, please upload the server logs too. thanks

Comment by James A Simmons [ 08/Apr/13 ]

I need to build a new large stripe count file system for other test so I will get you new logs.

Comment by James A Simmons [ 11/Apr/13 ]

I rebased to the latest master and now I can't reproduce this bug. I have a feeling some of the layout patches that were merged the 8th fixed this issue. Please leave this ticket open until after our test shot which will take place tomorrow.

Comment by James A Simmons [ 22/Apr/13 ]

During the last test shot we encountered this bug again. This time we got logs from the clients and servers. I uploaded all the logs to

ftp.whamcloud.com/uploads/LU-2963/testshot-4-12-2013

Comment by Peter Jones [ 22/Apr/13 ]

Yu, Jian

Could you please review this latest information from ORNL?

Thanks

Peter

Comment by Andreas Dilger [ 26/Apr/13 ]

Sorry, haven't looked at the logs yet. My gut feeling is that the -ENOSPC is being returned from the server, and either from the journal layer or from the xattrs, since these are the few places that are affected by growing stripe count.

James, first question - is the "large_xattr" feature enabled on your MDS? This is still not being enabled by default (see http://review.whamcloud.com/4315), since the patch has not been accepted upstream yet, and it makes sense to limit the feature exposure to sites that actually need it. This can be set any time during formatting (via mkfsoptions) or after (via tune2fs) with "-O large_xattr".

Comment by Di Wang [ 26/Apr/13 ]

James: I just checked the debug log, I did not find mds log there? Just want to confirm, the bug you hit in the last test is still "-NOSPC when you try to create a file with 224 stripes?

Comment by Jian Yu [ 27/Apr/13 ]

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1441/
Distro/Arch: RHEL6.4/x86_64 (kernel version: 2.6.32-358.2.1.el6)
Network: TCP (1GigE)
OSSCOUNT=4
OSTCOUNT=224 (with 56 OSTs per OSS)

MDSOPT="--mkfsoptions='-O large_xattr'"

The parallel-scale test iorssf passed with 224 OSTs:
https://maloo.whamcloud.com/test_sessions/ce2253de-af21-11e2-8f8e-52540035b04c

As per run_ior() in lustre/tests/functions.sh, "$LFS setstripe $testdir -c -1" was performed before running the IOR command.

Another test run with MDSOPT="--mkfsoptions='-O large_xattr -J size=1024'" also passed:
https://maloo.whamcloud.com/test_sessions/94998110-af57-11e2-8f8e-52540035b04c

+ /usr/bin/lfs setstripe /mnt/lustre/d0.ior.ssf -c -1
+ /usr/bin/lfs getstripe -d /mnt/lustre/d0.ior.ssf
stripe_count:   -1 stripe_size:    1048576 stripe_offset:  -1 
+ /usr/bin/IOR -a POSIX -C -g -b 1g -o /mnt/lustre/d0.ior.ssf/iorData -t 4m -v -e -w -r -i 5 -k

More tests passed:

# ls -l /mnt/lustre/
total 0
# lfs setstripe -c 224 /mnt/lustre/file
# lfs getstripe -i -c -s /mnt/lustre/file
lmm_stripe_count:   224
lmm_stripe_size:    1048576
lmm_stripe_offset:  133
# yes | dd bs=1024 count=1048576 of=/mnt/lustre/file
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 1288.4 s, 833 kB/s
# lfs getstripe -i -c -s /mnt/lustre/file
lmm_stripe_count:   224
lmm_stripe_size:    1048576
lmm_stripe_offset:  133

# mkdir /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
# lfs setstripe -c 224 /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   224 stripe_size:    1048576 stripe_offset:  -1
# touch /mnt/lustre/dir/file
# lfs getstripe -i -c -s /mnt/lustre/dir/file
lmm_stripe_count:   224
lmm_stripe_size:    1048576
lmm_stripe_offset:  189
# yes | dd bs=1024 count=1048576 of=/mnt/lustre/dir/file
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 1359.48 s, 790 kB/s
# lfs getstripe -i -c -s /mnt/lustre/dir/file
lmm_stripe_count:   224
lmm_stripe_size:    1048576
lmm_stripe_offset:  189
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   224 stripe_size:    1048576 stripe_offset:  -1
Comment by James A Simmons [ 29/Apr/13 ]

Sorry about the confusion with this ticket. When I created this ticket for our first test shot this problem was only observed during our hero wide stripe test with 367 ost at the time. After that test shot I opened this ticket and prepared a scaling job that would create directories with powers of two stripe count. So for the second test shot we ran this scaling job to discover that the failure happened around 128 stripes which is below the old 160 stripe limit. For this last test shot run we again saw this problem not only at larger stripe count (128 stripes again) but also for single shared file that was stripe across 4 OSTs. This shared file was being written to by 18K number of nodes. So I don't think it is a general wide stripe problem we are seeing but some other issue. We thought it might of been a grant issues since the OSTs are only 250 GB in size but Oleg told me during LUG this is unlikely the case.

P.S
I can't seem to find the MDS ldump Will talk to the admin tomorrow.

Comment by Andreas Dilger [ 29/Apr/13 ]

James,
this ENOSPC problem may only be related to your test configuration, if there are individual OSTs that are out of space for some reason. Creating a file with specific striping will fail if it can't allocate at least 3/4 of the requested stripes (some margin is allowed so that applications don't get failures when a small number of OSTs are offline).

Is it possible that earlier in your testing that some OSTs were filled up?

Comment by James A Simmons [ 01/May/13 ]

For the last test shot we had to reformat the file system due to the changes in the fid format. After mounting the file system I always run the large stripe job first.

Comment by James Nunez (Inactive) [ 18/Feb/14 ]

James,

Have you run this large stripe job recently and, if so, are you still seeing this problem?

Thanks,
James

Comment by James A Simmons [ 19/Feb/14 ]

The problem was large_xattr was not set on the MDS. That was resolved. What is not resolved is that when large stripe count is not set then the default LOV_MAX_STRIPE is not 160 but something less due to changes in the data being sent over wire.

Comment by James Nunez (Inactive) [ 10/Apr/14 ]

James,

The patch for LU-4791 has landed to master and there is a b2_4 patch available that has not landed yet. If you are able, please test with the LU-4791 patch and see if it fixes this issue.

Thank you.

Comment by James A Simmons [ 06/May/14 ]

Excellent news. The patch from LU-4791 fixes this issue. You can close this ticket.

Comment by Peter Jones [ 06/May/14 ]

That is excellent news - thanks James!

Generated at Sat Feb 10 01:29:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.