[LU-4340] conf-sanity test_69: error: File too large Created: 03/Dec/13 Updated: 28/Mar/17 Resolved: 28/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
client and server: lustre-master build #1783 RHEL6.4 ldiskfs |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 11869 | ||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/77c48f3e-59f3-11e3-98fc-52540035b04c. The sub-test test_69 failed with the following error:
test log shows: CMD: client-16vm3 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
- created 10000 (time 1385749407.25 total 47.32 last 47.32)
- created 20000 (time 1385749454.94 total 95.01 last 47.69)
- created 30000 (time 1385749503.05 total 143.12 last 48.11)
- created 40000 (time 1385749551.55 total 191.62 last 48.50)
open(/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-49787) error: File too large
total: 49787 creates in 240.39 seconds: 207.11 creates/second
stop ost1 service on client-16vm4
CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
Stopping /mnt/ost1 (opts:-f) on client-16vm4
CMD: client-16vm4 umount -d -f /mnt/ost1
CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: client-16vm4 mkfs.lustre --mgsnode=client-16vm3@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-E lazy_itable_init\" --reformat --replace /dev/lvm-Role_OSS/P1
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x42
(OST update )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=10.10.4.122@tcp sys.timeout=20
device size = 2048MB
formatting backing filesystem ldiskfs on /dev/lvm-Role_OSS/P1
target name lustre-OST0000
4k blocks 50000
options -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-OST0000 -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F /dev/lvm-Role_OSS/P1 50000
Writing CONFIGS/mountdata
start ost1 service on client-16vm4
CMD: client-16vm4 mkdir -p /mnt/ost1
CMD: client-16vm4 test -b /dev/lvm-Role_OSS/P1
Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: client-16vm4 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: client-16vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4
CMD: client-16vm4 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
Started lustre-OST0000
CMD: client-16vm3 /usr/sbin/lctl get_param -n version
CMD: client-16vm3 /usr/sbin/lctl get_param -n version
CMD: client-16vm3 lctl list_param osc.lustre-OST*-osc > /dev/null 2>&1
CMD: client-16vm3 lctl get_param -n at_min
can't get osc.lustre-OST0000-osc-MDT0000.ost_server_uuid by list_param in 40 secs
Go with osc.lustre-OST0000-osc-MDT0000.ost_server_uuid directly
CMD: client-16vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state FULL osc.lustre-OST0000-osc-MDT0000.ost_server_uuid 40
client-16vm3: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
touch: cannot touch `/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-last': File too large
conf-sanity test_69: @@@@@@ FAIL: create file after reformat
|
| Comments |
| Comment by Oleg Drokin [ 09/Dec/13 ] |
|
This is probably like Is the test running out of space somehow? |
| Comment by Sarah Liu [ 11/Dec/13 ] |
|
Found following error in the MDS dmesg, rc=-28 is no space: LustreError: 17127:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28 LustreError: 17109:0:(lod_qos.c:946:lod_alloc_specific()) can't lstripe objid [0x240000bd0:0xc27f:0x0]: have 0 want 1 LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.4.123@tcp, operation ost_destroy failed with -107. LustreError: Skipped 3 previous similar messages |
| Comment by Jian Yu [ 21/Dec/13 ] |
|
Lustre Tag: 2.4.2 RC1 The same failure occurred: |
| Comment by Jian Yu [ 23/Dec/13 ] |
|
Lustre Tag: 2.4.2 RC2 The same failure occurred: |
| Comment by Jian Yu [ 06/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Sarah Liu [ 27/Jan/14 ] |
|
also seen in interop test between 2.5 client and 2.6 server: https://maloo.whamcloud.com/test_sets/fa9b2594-8333-11e3-a5fa-52540035b04c |
| Comment by Jian Yu [ 13/Feb/14 ] |
|
More instance on master branch: |
| Comment by Jian Yu [ 19/Feb/14 ] |
|
This failure kept occurring on master branch in full group test sessions and in review group test sessions (with SLOW=yes). |
| Comment by James Nunez (Inactive) [ 23/May/14 ] |
|
I'm hitting this "File too large" error and there is enough space on OST0 and the MDS, but I've used up all my inodes: # lfs df -i UUID Inodes IUsed IFree IUse% Mounted on lscratch-MDT0000_UUID 100000 49642 50358 50% /lustre/scratch[MDT:0] lscratch-MDT0001_UUID 100000 201 99799 0% /lustre/scratch[MDT:1] lscratch-MDT0002_UUID 100000 201 99799 0% /lustre/scratch[MDT:2] lscratch-MDT0003_UUID 100000 201 99799 0% /lustre/scratch[MDT:3] lscratch-OST0000_UUID 50016 50016 0 100% /lustre/scratch[OST:0] filesystem summary: 400000 50245 349755 13% /lustre/scratch We use the small OST and MDS size for all tests in conf-sanity. So, why doesn't this error always occur? From conf-sanity: # use small MDS + OST size to speed formatting time # do not use too small MDSSIZE/OSTSIZE, which affect the default jouranl size # STORED_MDSSIZE is used in test_18 STORED_MDSSIZE=$MDSSIZE STORED_OSTSIZE=$OSTSIZE MDSSIZE=200000 OSTSIZE=200000 |
| Comment by Sarah Liu [ 15/Jul/14 ] |
|
Hit the error again in b2_6 build # 2 (2.6.0-RC2) MDSCOUNT=1 https://testing.hpdd.intel.com/test_sets/181e5ee6-0c47-11e4-b749-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 17/Nov/14 ] |
|
Hit this bug on master, tag 2.6.90. Test results at: https://testing.hpdd.intel.com/test_sessions/7f927c9a-6ccf-11e4-a452-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 17/Dec/14 ] |
|
I experienced this bug with lustre-master tag 2.6.91 build # 2771. Results are at https://testing.hpdd.intel.com/test_sets/f326bd9e-8618-11e4-ac52-5254006e85c2 |
| Comment by Jian Yu [ 14/Jan/15 ] |
|
More failure instances on master branch: |
| Comment by James Nunez (Inactive) [ 06/Mar/15 ] |
|
Hit this issue with 2.7.0-RC4. Results are at: https://testing.hpdd.intel.com/test_sessions/193dce6a-c42f-11e4-a0ef-5254006e85c2 |
| Comment by Gerrit Updater [ 03/Jul/15 ] |
|
Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/15487 |
| Comment by Gerrit Updater [ 19/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15487/ |
| Comment by Peter Jones [ 19/Jul/15 ] |
|
Landed for 2.8 |
| Comment by Jian Yu [ 28/Jul/15 ] |
|
The failure occurred consistently on master branch: |
| Comment by Gerrit Updater [ 12/Aug/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15966 |
| Comment by James Nunez (Inactive) [ 14/Aug/15 ] |
|
We ran conf-sanity five times in a row to see if we could reproduce this error on master (pre-2.8). The logs are at https://testing.hpdd.intel.com/test_sessions/b3556da0-41e3-11e5-9e18-5254006e85c2. Looking back at previous failures of conf-sanity test 69, the createmany routine fails with two different errors; ‘File too large’ and ‘No space left on device’, failure codes 27 and 28. From createmany, this is the part of the code that is failing: if (do_open) { int fd = open(filename, O_CREAT|O_RDWR, 0644); if (fd < 0) { printf("open(%s) error: %s\n", filename, strerror(errno)); rc = errno; break; } In the recent debug tests I ran, I only hit the “No space left on device” error: - created 10000 (time 1439485515.71 total 61.79 last 61.79) - created 20000 (time 1439485577.47 total 123.55 last 61.76) - created 30000 (time 1439485640.13 total 186.21 last 62.66) - created 40000 (time 1439485702.88 total 248.96 last 62.76) open(/mnt/lustre/d69.conf-sanity/f69.conf-sanity-49585) error: No space left on device total: 49585 creates in 310.77 seconds: 159.56 creates/second In test 69, before we start creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output): UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 133560 1752 122084 1% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 133560 1604 122232 1% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 133560 1608 122228 1% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 133560 1608 122228 1% /mnt/lustre[MDT:3] lustre-OST0000_UUID 171080 9812 150848 6% /mnt/lustre[OST:0] filesystem summary: 171080 9812 150848 6% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 100000 250 99750 0% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 100000 230 99770 0% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 100000 230 99770 0% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 100000 230 99770 0% /mnt/lustre[MDT:3] lustre-OST0000_UUID 50016 463 49553 1% /mnt/lustre[OST:0] filesystem summary: 50493 940 49553 2% /mnt/lustre So, there are no space issues before the test is run. NOTE: As previously pointed out, for conf-sanity, the size of the MDT and OSTs are reduced to 200000. After creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output): UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 133560 7536 116312 6% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 133560 1604 122232 1% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 133560 1608 122228 1% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 133560 1608 122228 1% /mnt/lustre[MDT:3] lustre-OST0000_UUID 171080 10964 149696 7% /mnt/lustre[OST:0] filesystem summary: 171080 10964 149696 7% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 100000 49836 50164 50% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 100000 230 99770 0% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 100000 230 99770 0% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 100000 230 99770 0% /mnt/lustre[MDT:3] lustre-OST0000_UUID 50016 50016 0 100% /mnt/lustre[OST:0] filesystem summary: 50526 50526 0 100% /mnt/lustre Thus, we are running out of inodes on OST0. The test currently tries to write 3* OST_MAX_PRECREATE, where OST_MAX_PRECREATE = 20000, with the stated goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE. From the error messages of the failed tests, we know that the test fails a little over 49,000 files. So, wouldn’t changing the number of files to create to 2 * OST_MAX_PRECREATE solve this issues and still achieve the goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE? If we limit the number of files to be created to 2*OST_MAX_PRECREATE, the test should pass and does pass. Logs are at: https://testing.hpdd.intel.com/test_sessions/d8222034-425a-11e5-af45-5254006e85c2 |
| Comment by Andreas Dilger [ 18/Aug/15 ] |
|
I also figured out where the -EFBIG (-27) vs -ENOSPC difference was coming from. That is returned by lod_alloc_specific() returning an error when a specific layout is requested but there are no OST objects available. Normally that makes sense because a specific layout specifies the stripe count, and if that cannot be satisfied then the file may grow too large for the available number of stripes. In the case of this test, there is only one stripe but the directory specifies it must be on OST0000, so it triggers this condition: /* If we were passed specific striping params, then a failure to
* meet those requirements is an error, since we can't reallocate
* that memory (it might be part of a larger array or something).
*/
CERROR("can't lstripe objid "DFID": have %d want %u\n",
PFID(lu_object_fid(lod2lu_obj(lo))), stripe_num,
lo->ldo_stripenr);
rc = stripe_num == 0 ? -ENOSPC : -EFBIG;
Previously it always returned -EFBIG, but I changed it in http://review.whamcloud.com/12937 " |
| Comment by Gerrit Updater [ 28/Aug/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/15966/ |
| Comment by Andreas Dilger [ 28/Aug/15 ] |
|
Patch landed to master for 2.8.0. |