[LU-4340] conf-sanity test_69: error: File too large Created: 03/Dec/13  Updated: 28/Mar/17  Resolved: 28/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

client and server: lustre-master build #1783 RHEL6.4 ldiskfs


Issue Links:
Duplicate
is duplicated by LU-4490 Test failure conf-sanity test_69: cre... Resolved
is duplicated by LU-4338 Failure on test suite conf-sanity tes... Closed
is duplicated by LU-4339 Failure on test suite conf-sanity tes... Closed
Related
is related to LU-4204 typo in new conf-sanity subtest Resolved
is related to LU-5246 Failure on test suite sanity test_220... Resolved
is related to LU-8158 conf-sanity test_69: create file afte... Open
is related to LU-6123 conf-sanity test_72: FAIL: mount clie... Resolved
Severity: 3
Rank (Obsolete): 11869

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/77c48f3e-59f3-11e3-98fc-52540035b04c.

The sub-test test_69 failed with the following error:

create file after reformat

test log shows:

CMD: client-16vm3 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
 - created 10000 (time 1385749407.25 total 47.32 last 47.32)
 - created 20000 (time 1385749454.94 total 95.01 last 47.69)
 - created 30000 (time 1385749503.05 total 143.12 last 48.11)
 - created 40000 (time 1385749551.55 total 191.62 last 48.50)
open(/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-49787) error: File too large
total: 49787 creates in 240.39 seconds: 207.11 creates/second
stop ost1 service on client-16vm4
CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
Stopping /mnt/ost1 (opts:-f) on client-16vm4
CMD: client-16vm4 umount -d -f /mnt/ost1
CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: client-16vm4 mkfs.lustre --mgsnode=client-16vm3@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-E lazy_itable_init\" --reformat --replace /dev/lvm-Role_OSS/P1

   Permanent disk data:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x42
              (OST update )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=10.10.4.122@tcp sys.timeout=20

device size = 2048MB
formatting backing filesystem ldiskfs on /dev/lvm-Role_OSS/P1
	target name  lustre-OST0000
	4k blocks     50000
	options        -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-OST0000  -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F /dev/lvm-Role_OSS/P1 50000
Writing CONFIGS/mountdata
start ost1 service on client-16vm4
CMD: client-16vm4 mkdir -p /mnt/ost1
CMD: client-16vm4 test -b /dev/lvm-Role_OSS/P1
Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: client-16vm4 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: client-16vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4 
CMD: client-16vm4 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
Started lustre-OST0000
CMD: client-16vm3 /usr/sbin/lctl get_param -n version
CMD: client-16vm3 /usr/sbin/lctl get_param -n version
CMD: client-16vm3 lctl list_param osc.lustre-OST*-osc             > /dev/null 2>&1
CMD: client-16vm3 lctl get_param -n at_min
can't get osc.lustre-OST0000-osc-MDT0000.ost_server_uuid by list_param in 40 secs
Go with osc.lustre-OST0000-osc-MDT0000.ost_server_uuid directly
CMD: client-16vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state FULL osc.lustre-OST0000-osc-MDT0000.ost_server_uuid 40 
client-16vm3: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
touch: cannot touch `/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-last': File too large
 conf-sanity test_69: @@@@@@ FAIL: create file after reformat 


 Comments   
Comment by Oleg Drokin [ 09/Dec/13 ]

This is probably like LU-2113 - when we are out of space, we sometimes get file too large in error instead.

Is the test running out of space somehow?

Comment by Sarah Liu [ 11/Dec/13 ]

Found following error in the MDS dmesg, rc=-28 is no space:

LustreError: 17127:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
LustreError: 17109:0:(lod_qos.c:946:lod_alloc_specific()) can't lstripe objid [0x240000bd0:0xc27f:0x0]: have 0 want 1
LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.4.123@tcp, operation ost_destroy failed with -107.
LustreError: Skipped 3 previous similar messages
Comment by Jian Yu [ 21/Dec/13 ]

Lustre Tag: 2.4.2 RC1
Lustre Client: CentOS 6.5/x86_64 (kernel version: 2.6.32-431.1.2.0.1.el6.x86_64)
Lustre Server: CentOS 6.4/x86_64 (kernel version: 2.6.32-358.23.2.el6_lustre.x86_64)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/5f44a050-6a2b-11e3-81c0-52540035b04c

Comment by Jian Yu [ 23/Dec/13 ]

Lustre Tag: 2.4.2 RC2
Lustre Client: CentOS 6.5/x86_64 (kernel version: 2.6.32-431.1.2.0.1.el6.x86_64)
Lustre Server: CentOS 6.4/x86_64 (kernel version: 2.6.32-358.23.2.el6_lustre.x86_64)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/c897cc3c-6b82-11e3-91a4-52540035b04c

Comment by Jian Yu [ 06/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Lustre Client: CentOS 6.5/x86_64 (kernel version: 2.6.32-431.1.2.0.1.el6)
Lustre Server: CentOS 6.4/x86_64 (kernel version: 2.6.32-358.18.1.el6)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/7daa132e-768b-11e3-a081-52540035b04c

Comment by Sarah Liu [ 27/Jan/14 ]

also seen in interop test between 2.5 client and 2.6 server:

https://maloo.whamcloud.com/test_sets/fa9b2594-8333-11e3-a5fa-52540035b04c

Comment by Jian Yu [ 13/Feb/14 ]

More instance on master branch:
https://maloo.whamcloud.com/test_sets/b9ee0148-9430-11e3-854a-52540035b04c
https://maloo.whamcloud.com/test_sets/86ec6c20-9666-11e3-a009-52540035b04c
https://maloo.whamcloud.com/test_sets/ce6c0350-98f8-11e3-968c-52540035b04c

Comment by Jian Yu [ 19/Feb/14 ]

This failure kept occurring on master branch in full group test sessions and in review group test sessions (with SLOW=yes).

Comment by James Nunez (Inactive) [ 23/May/14 ]

I'm hitting this "File too large" error and there is enough space on OST0 and the MDS, but I've used up all my inodes:

# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
lscratch-MDT0000_UUID      100000       49642       50358  50% /lustre/scratch[MDT:0]
lscratch-MDT0001_UUID      100000         201       99799   0% /lustre/scratch[MDT:1]
lscratch-MDT0002_UUID      100000         201       99799   0% /lustre/scratch[MDT:2]
lscratch-MDT0003_UUID      100000         201       99799   0% /lustre/scratch[MDT:3]
lscratch-OST0000_UUID       50016       50016           0 100% /lustre/scratch[OST:0]

filesystem summary:       400000       50245      349755  13% /lustre/scratch

We use the small OST and MDS size for all tests in conf-sanity. So, why doesn't this error always occur? From conf-sanity:

# use small MDS + OST size to speed formatting time
# do not use too small MDSSIZE/OSTSIZE, which affect the default jouranl size
# STORED_MDSSIZE is used in test_18
STORED_MDSSIZE=$MDSSIZE
STORED_OSTSIZE=$OSTSIZE
MDSSIZE=200000
OSTSIZE=200000
Comment by Sarah Liu [ 15/Jul/14 ]

Hit the error again in b2_6 build # 2 (2.6.0-RC2)
server and client: RHEL6 ldiskfs

MDSCOUNT=1

https://testing.hpdd.intel.com/test_sets/181e5ee6-0c47-11e4-b749-5254006e85c2

Comment by James Nunez (Inactive) [ 17/Nov/14 ]

Hit this bug on master, tag 2.6.90. Test results at: https://testing.hpdd.intel.com/test_sessions/7f927c9a-6ccf-11e4-a452-5254006e85c2

Comment by James Nunez (Inactive) [ 17/Dec/14 ]

I experienced this bug with lustre-master tag 2.6.91 build # 2771. Results are at https://testing.hpdd.intel.com/test_sets/f326bd9e-8618-11e4-ac52-5254006e85c2

Comment by Jian Yu [ 14/Jan/15 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/1261560a-9bdc-11e4-a352-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3e12c382-9bec-11e4-afb8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3e12c382-9bec-11e4-afb8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/79089440-9bbe-11e4-b679-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e379ea16-9b9d-11e4-9d4a-5254006e85c2

Comment by James Nunez (Inactive) [ 06/Mar/15 ]

Hit this issue with 2.7.0-RC4. Results are at: https://testing.hpdd.intel.com/test_sessions/193dce6a-c42f-11e4-a0ef-5254006e85c2

Comment by Gerrit Updater [ 03/Jul/15 ]

Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/15487
Subject: LU-4340 tests: Fix test_69 of conf-sanity test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: abc7765b941fcebe7c489137af0e27d2212c71de

Comment by Gerrit Updater [ 19/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15487/
Subject: LU-4340 tests: Fix test_69 of conf-sanity test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 34f94efa4847ebd84b2fa42b7a0fc85bd7f6f8e3

Comment by Peter Jones [ 19/Jul/15 ]

Landed for 2.8

Comment by Jian Yu [ 28/Jul/15 ]

The failure occurred consistently on master branch:
https://testing.hpdd.intel.com/test_sets/1f034d6c-34ea-11e5-be21-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7a5d164a-34ed-11e5-b875-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ba0dd0aa-34a6-11e5-a9b3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3aaaa0b4-3437-11e5-be70-5254006e85c2

Comment by Gerrit Updater [ 12/Aug/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15966
Subject: LU-4340 tests: Adding debug to conf-sanity test 69
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cca19d41a6b39ae2020655923290939cbe21bade

Comment by James Nunez (Inactive) [ 14/Aug/15 ]

We ran conf-sanity five times in a row to see if we could reproduce this error on master (pre-2.8). The logs are at https://testing.hpdd.intel.com/test_sessions/b3556da0-41e3-11e5-9e18-5254006e85c2.

Looking back at previous failures of conf-sanity test 69, the createmany routine fails with two different errors; ‘File too large’ and ‘No space left on device’, failure codes 27 and 28.

From createmany, this is the part of the code that is failing:

                     if (do_open) {
                        int fd = open(filename, O_CREAT|O_RDWR, 0644);
                        if (fd < 0) {
                                printf("open(%s) error: %s\n", filename,
                                       strerror(errno));
                                rc = errno;
                               break;
                        }

In the recent debug tests I ran, I only hit the “No space left on device” error:

- created 10000 (time 1439485515.71 total 61.79 last 61.79)
- created 20000 (time 1439485577.47 total 123.55 last 61.76)
- created 30000 (time 1439485640.13 total 186.21 last 62.66)
- created 40000 (time 1439485702.88 total 248.96 last 62.76)
open(/mnt/lustre/d69.conf-sanity/f69.conf-sanity-49585) error: No space left on device
total: 49585 creates in 310.77 seconds: 159.56 creates/second

In test 69, before we start creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output):

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       133560        1752      122084   1% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       133560        1604      122232   1% /mnt/lustre[MDT:1]
lustre-MDT0002_UUID       133560        1608      122228   1% /mnt/lustre[MDT:2]
lustre-MDT0003_UUID       133560        1608      122228   1% /mnt/lustre[MDT:3]
lustre-OST0000_UUID       171080        9812      150848   6% /mnt/lustre[OST:0]

filesystem summary:       171080        9812      150848   6% /mnt/lustre

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       100000         250       99750   0% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       100000         230       99770   0% /mnt/lustre[MDT:1]
lustre-MDT0002_UUID       100000         230       99770   0% /mnt/lustre[MDT:2]
lustre-MDT0003_UUID       100000         230       99770   0% /mnt/lustre[MDT:3]
lustre-OST0000_UUID        50016         463       49553   1% /mnt/lustre[OST:0]

filesystem summary:        50493         940       49553   2% /mnt/lustre

So, there are no space issues before the test is run.

NOTE: As previously pointed out, for conf-sanity, the size of the MDT and OSTs are reduced to 200000.

After creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output):

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       133560        7536      116312   6% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       133560        1604      122232   1% /mnt/lustre[MDT:1]
lustre-MDT0002_UUID       133560        1608      122228   1% /mnt/lustre[MDT:2]
lustre-MDT0003_UUID       133560        1608      122228   1% /mnt/lustre[MDT:3]
lustre-OST0000_UUID       171080       10964      149696   7% /mnt/lustre[OST:0]

filesystem summary:       171080       10964      149696   7% /mnt/lustre

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       100000       49836       50164  50% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       100000         230       99770   0% /mnt/lustre[MDT:1]
lustre-MDT0002_UUID       100000         230       99770   0% /mnt/lustre[MDT:2]
lustre-MDT0003_UUID       100000         230       99770   0% /mnt/lustre[MDT:3]
lustre-OST0000_UUID        50016       50016           0 100% /mnt/lustre[OST:0]

filesystem summary:        50526       50526           0 100% /mnt/lustre

Thus, we are running out of inodes on OST0.

The test currently tries to write 3* OST_MAX_PRECREATE, where OST_MAX_PRECREATE = 20000, with the stated goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE. From the error messages of the failed tests, we know that the test fails a little over 49,000 files. So, wouldn’t changing the number of files to create to 2 * OST_MAX_PRECREATE solve this issues and still achieve the goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE?

If we limit the number of files to be created to 2*OST_MAX_PRECREATE, the test should pass and does pass. Logs are at: https://testing.hpdd.intel.com/test_sessions/d8222034-425a-11e5-af45-5254006e85c2

Comment by Andreas Dilger [ 18/Aug/15 ]

I also figured out where the -EFBIG (-27) vs -ENOSPC difference was coming from. That is returned by lod_alloc_specific() returning an error when a specific layout is requested but there are no OST objects available. Normally that makes sense because a specific layout specifies the stripe count, and if that cannot be satisfied then the file may grow too large for the available number of stripes. In the case of this test, there is only one stripe but the directory specifies it must be on OST0000, so it triggers this condition:

        /* If we were passed specific striping params, then a failure to
         * meet those requirements is an error, since we can't reallocate
         * that memory (it might be part of a larger array or something).
         */
        CERROR("can't lstripe objid "DFID": have %d want %u\n",
               PFID(lu_object_fid(lod2lu_obj(lo))), stripe_num,
               lo->ldo_stripenr);
        rc = stripe_num == 0 ? -ENOSPC : -EFBIG;

Previously it always returned -EFBIG, but I changed it in http://review.whamcloud.com/12937 "LU-5246 tests: create OST objects on correct MDT" to return -ENOSPC in case no objects could be created at all. That patch was only landed on Aug 9, so I suspect the cases of -EFBIG being returned will decline and -ENOSPC will be returned instead (as it should be).

Comment by Gerrit Updater [ 28/Aug/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/15966/
Subject: LU-4340 tests: Correct conf-sanity test 69
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 22d0d85f5405859fb31104485ccfd55e4484d365

Comment by Andreas Dilger [ 28/Aug/15 ]

Patch landed to master for 2.8.0.

Generated at Sat Feb 10 01:41:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.