Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4340

conf-sanity test_69: error: File too large

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0
    • client and server: lustre-master build #1783 RHEL6.4 ldiskfs
    • 3
    • 11869

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/77c48f3e-59f3-11e3-98fc-52540035b04c.

      The sub-test test_69 failed with the following error:

      create file after reformat

      test log shows:

      CMD: client-16vm3 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
       - created 10000 (time 1385749407.25 total 47.32 last 47.32)
       - created 20000 (time 1385749454.94 total 95.01 last 47.69)
       - created 30000 (time 1385749503.05 total 143.12 last 48.11)
       - created 40000 (time 1385749551.55 total 191.62 last 48.50)
      open(/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-49787) error: File too large
      total: 49787 creates in 240.39 seconds: 207.11 creates/second
      stop ost1 service on client-16vm4
      CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
      Stopping /mnt/ost1 (opts:-f) on client-16vm4
      CMD: client-16vm4 umount -d -f /mnt/ost1
      CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      CMD: client-16vm4 grep -c /mnt/ost1' ' /proc/mounts
      CMD: client-16vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      CMD: client-16vm4 mkfs.lustre --mgsnode=client-16vm3@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-E lazy_itable_init\" --reformat --replace /dev/lvm-Role_OSS/P1
      
         Permanent disk data:
      Target:     lustre-OST0000
      Index:      0
      Lustre FS:  lustre
      Mount type: ldiskfs
      Flags:      0x42
                    (OST update )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=10.10.4.122@tcp sys.timeout=20
      
      device size = 2048MB
      formatting backing filesystem ldiskfs on /dev/lvm-Role_OSS/P1
      	target name  lustre-OST0000
      	4k blocks     50000
      	options        -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F
      mkfs_cmd = mke2fs -j -b 4096 -L lustre-OST0000  -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F /dev/lvm-Role_OSS/P1 50000
      Writing CONFIGS/mountdata
      start ost1 service on client-16vm4
      CMD: client-16vm4 mkdir -p /mnt/ost1
      CMD: client-16vm4 test -b /dev/lvm-Role_OSS/P1
      Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/ost1
      CMD: client-16vm4 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
      CMD: client-16vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \"all -lnet -lnd -pinger\" 4 
      CMD: client-16vm4 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
      Started lustre-OST0000
      CMD: client-16vm3 /usr/sbin/lctl get_param -n version
      CMD: client-16vm3 /usr/sbin/lctl get_param -n version
      CMD: client-16vm3 lctl list_param osc.lustre-OST*-osc             > /dev/null 2>&1
      CMD: client-16vm3 lctl get_param -n at_min
      can't get osc.lustre-OST0000-osc-MDT0000.ost_server_uuid by list_param in 40 secs
      Go with osc.lustre-OST0000-osc-MDT0000.ost_server_uuid directly
      CMD: client-16vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state FULL osc.lustre-OST0000-osc-MDT0000.ost_server_uuid 40 
      client-16vm3: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
      touch: cannot touch `/mnt/lustre/d0.conf-sanity/d69/f.conf-sanity.69-last': File too large
       conf-sanity test_69: @@@@@@ FAIL: create file after reformat 
      

      Attachments

        Issue Links

          Activity

            [LU-4340] conf-sanity test_69: error: File too large

            Patch landed to master for 2.8.0.

            adilger Andreas Dilger added a comment - Patch landed to master for 2.8.0.

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/15966/
            Subject: LU-4340 tests: Correct conf-sanity test 69
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 22d0d85f5405859fb31104485ccfd55e4484d365

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/15966/ Subject: LU-4340 tests: Correct conf-sanity test 69 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 22d0d85f5405859fb31104485ccfd55e4484d365

            I also figured out where the -EFBIG (-27) vs -ENOSPC difference was coming from. That is returned by lod_alloc_specific() returning an error when a specific layout is requested but there are no OST objects available. Normally that makes sense because a specific layout specifies the stripe count, and if that cannot be satisfied then the file may grow too large for the available number of stripes. In the case of this test, there is only one stripe but the directory specifies it must be on OST0000, so it triggers this condition:

                    /* If we were passed specific striping params, then a failure to
                     * meet those requirements is an error, since we can't reallocate
                     * that memory (it might be part of a larger array or something).
                     */
                    CERROR("can't lstripe objid "DFID": have %d want %u\n",
                           PFID(lu_object_fid(lod2lu_obj(lo))), stripe_num,
                           lo->ldo_stripenr);
                    rc = stripe_num == 0 ? -ENOSPC : -EFBIG;
            

            Previously it always returned -EFBIG, but I changed it in http://review.whamcloud.com/12937 "LU-5246 tests: create OST objects on correct MDT" to return -ENOSPC in case no objects could be created at all. That patch was only landed on Aug 9, so I suspect the cases of -EFBIG being returned will decline and -ENOSPC will be returned instead (as it should be).

            adilger Andreas Dilger added a comment - I also figured out where the -EFBIG (-27) vs -ENOSPC difference was coming from. That is returned by lod_alloc_specific() returning an error when a specific layout is requested but there are no OST objects available. Normally that makes sense because a specific layout specifies the stripe count, and if that cannot be satisfied then the file may grow too large for the available number of stripes. In the case of this test, there is only one stripe but the directory specifies it must be on OST0000, so it triggers this condition: /* If we were passed specific striping params, then a failure to * meet those requirements is an error, since we can't reallocate * that memory (it might be part of a larger array or something). */ CERROR( "can't lstripe objid " DFID ": have %d want %u\n" , PFID(lu_object_fid(lod2lu_obj(lo))), stripe_num, lo->ldo_stripenr); rc = stripe_num == 0 ? -ENOSPC : -EFBIG; Previously it always returned -EFBIG, but I changed it in http://review.whamcloud.com/12937 " LU-5246 tests: create OST objects on correct MDT" to return -ENOSPC in case no objects could be created at all. That patch was only landed on Aug 9, so I suspect the cases of -EFBIG being returned will decline and -ENOSPC will be returned instead (as it should be).
            jamesanunez James Nunez (Inactive) added a comment - - edited

            We ran conf-sanity five times in a row to see if we could reproduce this error on master (pre-2.8). The logs are at https://testing.hpdd.intel.com/test_sessions/b3556da0-41e3-11e5-9e18-5254006e85c2.

            Looking back at previous failures of conf-sanity test 69, the createmany routine fails with two different errors; ‘File too large’ and ‘No space left on device’, failure codes 27 and 28.

            From createmany, this is the part of the code that is failing:

                                 if (do_open) {
                                    int fd = open(filename, O_CREAT|O_RDWR, 0644);
                                    if (fd < 0) {
                                            printf("open(%s) error: %s\n", filename,
                                                   strerror(errno));
                                            rc = errno;
                                           break;
                                    }
            

            In the recent debug tests I ran, I only hit the “No space left on device” error:

            - created 10000 (time 1439485515.71 total 61.79 last 61.79)
            - created 20000 (time 1439485577.47 total 123.55 last 61.76)
            - created 30000 (time 1439485640.13 total 186.21 last 62.66)
            - created 40000 (time 1439485702.88 total 248.96 last 62.76)
            open(/mnt/lustre/d69.conf-sanity/f69.conf-sanity-49585) error: No space left on device
            total: 49585 creates in 310.77 seconds: 159.56 creates/second
            

            In test 69, before we start creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output):

            UUID                   1K-blocks        Used   Available Use% Mounted on
            lustre-MDT0000_UUID       133560        1752      122084   1% /mnt/lustre[MDT:0]
            lustre-MDT0001_UUID       133560        1604      122232   1% /mnt/lustre[MDT:1]
            lustre-MDT0002_UUID       133560        1608      122228   1% /mnt/lustre[MDT:2]
            lustre-MDT0003_UUID       133560        1608      122228   1% /mnt/lustre[MDT:3]
            lustre-OST0000_UUID       171080        9812      150848   6% /mnt/lustre[OST:0]
            
            filesystem summary:       171080        9812      150848   6% /mnt/lustre
            
            UUID                      Inodes       IUsed       IFree IUse% Mounted on
            lustre-MDT0000_UUID       100000         250       99750   0% /mnt/lustre[MDT:0]
            lustre-MDT0001_UUID       100000         230       99770   0% /mnt/lustre[MDT:1]
            lustre-MDT0002_UUID       100000         230       99770   0% /mnt/lustre[MDT:2]
            lustre-MDT0003_UUID       100000         230       99770   0% /mnt/lustre[MDT:3]
            lustre-OST0000_UUID        50016         463       49553   1% /mnt/lustre[OST:0]
            
            filesystem summary:        50493         940       49553   2% /mnt/lustre
            

            So, there are no space issues before the test is run.

            NOTE: As previously pointed out, for conf-sanity, the size of the MDT and OSTs are reduced to 200000.

            After creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output):

            UUID                   1K-blocks        Used   Available Use% Mounted on
            lustre-MDT0000_UUID       133560        7536      116312   6% /mnt/lustre[MDT:0]
            lustre-MDT0001_UUID       133560        1604      122232   1% /mnt/lustre[MDT:1]
            lustre-MDT0002_UUID       133560        1608      122228   1% /mnt/lustre[MDT:2]
            lustre-MDT0003_UUID       133560        1608      122228   1% /mnt/lustre[MDT:3]
            lustre-OST0000_UUID       171080       10964      149696   7% /mnt/lustre[OST:0]
            
            filesystem summary:       171080       10964      149696   7% /mnt/lustre
            
            UUID                      Inodes       IUsed       IFree IUse% Mounted on
            lustre-MDT0000_UUID       100000       49836       50164  50% /mnt/lustre[MDT:0]
            lustre-MDT0001_UUID       100000         230       99770   0% /mnt/lustre[MDT:1]
            lustre-MDT0002_UUID       100000         230       99770   0% /mnt/lustre[MDT:2]
            lustre-MDT0003_UUID       100000         230       99770   0% /mnt/lustre[MDT:3]
            lustre-OST0000_UUID        50016       50016           0 100% /mnt/lustre[OST:0]
            
            filesystem summary:        50526       50526           0 100% /mnt/lustre
            

            Thus, we are running out of inodes on OST0.

            The test currently tries to write 3* OST_MAX_PRECREATE, where OST_MAX_PRECREATE = 20000, with the stated goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE. From the error messages of the failed tests, we know that the test fails a little over 49,000 files. So, wouldn’t changing the number of files to create to 2 * OST_MAX_PRECREATE solve this issues and still achieve the goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE?

            If we limit the number of files to be created to 2*OST_MAX_PRECREATE, the test should pass and does pass. Logs are at: https://testing.hpdd.intel.com/test_sessions/d8222034-425a-11e5-af45-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - - edited We ran conf-sanity five times in a row to see if we could reproduce this error on master (pre-2.8). The logs are at https://testing.hpdd.intel.com/test_sessions/b3556da0-41e3-11e5-9e18-5254006e85c2 . Looking back at previous failures of conf-sanity test 69, the createmany routine fails with two different errors; ‘File too large’ and ‘No space left on device’, failure codes 27 and 28. From createmany, this is the part of the code that is failing: if (do_open) { int fd = open(filename, O_CREAT|O_RDWR, 0644); if (fd < 0) { printf( "open(%s) error: %s\n" , filename, strerror(errno)); rc = errno; break ; } In the recent debug tests I ran, I only hit the “No space left on device” error: - created 10000 (time 1439485515.71 total 61.79 last 61.79) - created 20000 (time 1439485577.47 total 123.55 last 61.76) - created 30000 (time 1439485640.13 total 186.21 last 62.66) - created 40000 (time 1439485702.88 total 248.96 last 62.76) open(/mnt/lustre/d69.conf-sanity/f69.conf-sanity-49585) error: No space left on device total: 49585 creates in 310.77 seconds: 159.56 creates/second In test 69, before we start creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output): UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 133560 1752 122084 1% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 133560 1604 122232 1% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 133560 1608 122228 1% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 133560 1608 122228 1% /mnt/lustre[MDT:3] lustre-OST0000_UUID 171080 9812 150848 6% /mnt/lustre[OST:0] filesystem summary: 171080 9812 150848 6% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 100000 250 99750 0% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 100000 230 99770 0% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 100000 230 99770 0% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 100000 230 99770 0% /mnt/lustre[MDT:3] lustre-OST0000_UUID 50016 463 49553 1% /mnt/lustre[OST:0] filesystem summary: 50493 940 49553 2% /mnt/lustre So, there are no space issues before the test is run. NOTE: As previously pointed out, for conf-sanity, the size of the MDT and OSTs are reduced to 200000. After creating files, the file system capacity looks like (‘lfs df’ and ‘lfs df -i’ output): UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 133560 7536 116312 6% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 133560 1604 122232 1% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 133560 1608 122228 1% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 133560 1608 122228 1% /mnt/lustre[MDT:3] lustre-OST0000_UUID 171080 10964 149696 7% /mnt/lustre[OST:0] filesystem summary: 171080 10964 149696 7% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 100000 49836 50164 50% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 100000 230 99770 0% /mnt/lustre[MDT:1] lustre-MDT0002_UUID 100000 230 99770 0% /mnt/lustre[MDT:2] lustre-MDT0003_UUID 100000 230 99770 0% /mnt/lustre[MDT:3] lustre-OST0000_UUID 50016 50016 0 100% /mnt/lustre[OST:0] filesystem summary: 50526 50526 0 100% /mnt/lustre Thus, we are running out of inodes on OST0. The test currently tries to write 3* OST_MAX_PRECREATE, where OST_MAX_PRECREATE = 20000, with the stated goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE. From the error messages of the failed tests, we know that the test fails a little over 49,000 files. So, wouldn’t changing the number of files to create to 2 * OST_MAX_PRECREATE solve this issues and still achieve the goal of making LAST_ID over 1.5 * OST_MAX_PRECREATE? If we limit the number of files to be created to 2*OST_MAX_PRECREATE, the test should pass and does pass. Logs are at: https://testing.hpdd.intel.com/test_sessions/d8222034-425a-11e5-af45-5254006e85c2

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15966
            Subject: LU-4340 tests: Adding debug to conf-sanity test 69
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cca19d41a6b39ae2020655923290939cbe21bade

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15966 Subject: LU-4340 tests: Adding debug to conf-sanity test 69 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cca19d41a6b39ae2020655923290939cbe21bade
            yujian Jian Yu added a comment - The failure occurred consistently on master branch: https://testing.hpdd.intel.com/test_sets/1f034d6c-34ea-11e5-be21-5254006e85c2 https://testing.hpdd.intel.com/test_sets/7a5d164a-34ed-11e5-b875-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ba0dd0aa-34a6-11e5-a9b3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/3aaaa0b4-3437-11e5-be70-5254006e85c2
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15487/
            Subject: LU-4340 tests: Fix test_69 of conf-sanity test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 34f94efa4847ebd84b2fa42b7a0fc85bd7f6f8e3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15487/ Subject: LU-4340 tests: Fix test_69 of conf-sanity test Project: fs/lustre-release Branch: master Current Patch Set: Commit: 34f94efa4847ebd84b2fa42b7a0fc85bd7f6f8e3

            Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/15487
            Subject: LU-4340 tests: Fix test_69 of conf-sanity test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: abc7765b941fcebe7c489137af0e27d2212c71de

            gerrit Gerrit Updater added a comment - Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/15487 Subject: LU-4340 tests: Fix test_69 of conf-sanity test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: abc7765b941fcebe7c489137af0e27d2212c71de
            jamesanunez James Nunez (Inactive) added a comment - Hit this issue with 2.7.0-RC4. Results are at: https://testing.hpdd.intel.com/test_sessions/193dce6a-c42f-11e4-a0ef-5254006e85c2

            People

              jamesanunez James Nunez (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: