Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6747

Intermittent rc=-EROFS from lod_statfs_and_check

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • lustre 2.7.54
      spl/zfs 0.6.4.1
      single MDS with one MDT and MGS
      two OSS's with one OST each
    • 3
    • 9223372036854775807

    Description

      Running mdtest I get failures like this:

      06/18/2015 16:22:12: Process 26(zwicky29): FAILED in create_remove_items_helper, unable to create file file.mdtest.26.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.26.0): No space left on device
      06/18/2015 16:22:12: Process 19(zwicky22): FAILED in create_remove_items_helper, unable to create file file.mdtest.19.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.19.0): No space left on device

      The three servers involved are using the zfs backend and their pools have lots of free space; all are <1% full.

      In the lustre debug log on the MDS, I'm seeing
      lod_qos.c:238:lod_statfs_and_check() return -30 (-LUSTRE_EROFS)
      lod_qos.c:1016:lod_alloc_rr() return -28 (-ENOSPC)

      Many other functions report exiting with -28 as well:

      lod_object.c:2104:lod_declare_xattr_set()
      lod_object.c:3352:lod_declare_striped_object()
      lod_object.c:3384:lod_declare_striped_object()
      lod_object.c:3463:lod_declare_object_create()
      lod_qos.c:1913:lod_qos_prep_create()
      mdd_dir.c:1786:mdd_create_data()
      mdd_dir.c:1807:mdd_create_data()
      mdd_dir.c:1983:mdd_declare_object_create()
      mdd_dir.c:2054:mdd_declare_create()
      mdd_dir.c:2354:mdd_create()
      mdd_object.c:352:mdd_declare_object_create_internal()
      mdt_open.c:1105:mdt_open_by_fid_lock()
      mdt_open.c:1255:mdt_reint_open()
      mdt_open.c:1374:mdt_reint_open()
      mdt_open.c:138:mdt_create_data()
      mdt_open.c:347:mdt_mfd_open()
      mdt_open.c:607:mdt_finish_open()
      mdt_reint.c:1997:mdt_reint_rec()

      I've attached a few thousand lines of debug output from the mds with both debug and debug_subsys set to -1. I can reproduce easily, so I can get debug output with specific subsystems turned off or on.

      Attachments

        Issue Links

          Activity

            [LU-6747] Intermittent rc=-EROFS from lod_statfs_and_check
            ofaaland Olaf Faaland added a comment -

            Nathaniel,

            The patch had an error in its usage of offsetof(), which I fixed, but the patch still fails to build on SLES12. It builds fine on RHEL. I looked at the SLES12 build console output but am not able to tell what the problem is. Can you take a look?

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Nathaniel, The patch had an error in its usage of offsetof(), which I fixed, but the patch still fails to build on SLES12. It builds fine on RHEL. I looked at the SLES12 build console output but am not able to tell what the problem is. Can you take a look? thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Both OSTs are indeed reporting a nonzero state intermittently:

            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434584977.251447:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176813342 avail; 5658325578 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585002.260232:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176821809 avail; 5658327093 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585002.890186:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176805734 avail; 5658327093 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585007.890294:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821988 free, 176812034 avail; 5658330734 objects: 5658303616 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585017.266203:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585017.891472:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585022.267691:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585022.912333:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585027.269289:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585027.913373:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585032.270956:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585032.913344:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585037.272373:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585037.914346:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585042.273878:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585042.914405:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584984.055391:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821934 free, 176821857 avail; 5658316107 objects: 5658301888 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584989.057246:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821943 free, 176821866 avail; 5658315946 objects: 5658302176 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585014.066824:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585019.068752:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585029.072516:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821921 free, 176821844 avail; 5658339248 objects: 5658301472 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:8.0:1434585031.588445:0:12764:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821922 free, 176790984 avail; 5658338073 objects: 5658301504 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585034.074483:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821924 free, 176795098 avail; 5658334103 objects: 5658301568 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585044.078171:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821940 free, 176820539 avail; 4863994422 objects: 4863984771 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585049.080189:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821945 free, 176821872 avail; 3106822606 objects: 3106816830 free; state c9c4f476

            I'll try the patch.

            ofaaland Olaf Faaland added a comment - Both OSTs are indeed reporting a nonzero state intermittently: zwicky-lburn-oss1.out:00002000:00000020:0.0:1434584977.251447:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176813342 avail; 5658325578 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585002.260232:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176821809 avail; 5658327093 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585002.890186:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176805734 avail; 5658327093 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585007.890294:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821988 free, 176812034 avail; 5658330734 objects: 5658303616 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585017.266203:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585017.891472:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585022.267691:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585022.912333:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585027.269289:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585027.913373:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585032.270956:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585032.913344:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585037.272373:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585037.914346:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585042.273878:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585042.914405:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584984.055391:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821934 free, 176821857 avail; 5658316107 objects: 5658301888 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584989.057246:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821943 free, 176821866 avail; 5658315946 objects: 5658302176 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585014.066824:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585019.068752:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585029.072516:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821921 free, 176821844 avail; 5658339248 objects: 5658301472 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:8.0:1434585031.588445:0:12764:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821922 free, 176790984 avail; 5658338073 objects: 5658301504 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585034.074483:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821924 free, 176795098 avail; 5658334103 objects: 5658301568 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585044.078171:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821940 free, 176820539 avail; 4863994422 objects: 4863984771 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585049.080189:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821945 free, 176821872 avail; 3106822606 objects: 3106816830 free; state c9c4f476 I'll try the patch.
            pjones Peter Jones added a comment -

            Assigning to Nathaniel for any follow on questions

            pjones Peter Jones added a comment - Assigning to Nathaniel for any follow on questions

            Totally untested patch, but I think it is pretty reasonable and will be tested out soon enough.

            adilger Andreas Dilger added a comment - Totally untested patch, but I think it is pretty reasonable and will be tested out soon enough.

            The -EROFS return is because the OST returned os_state=OS_STATE_READONLY to the MDS, and if all of the OSTs are returning this then there are no available OST objects and the MDS needs to return -ENOSPC. The main point of investigation should be why the OST(s) are returning OS_STATE_READONLY.

            As far as I can see, the only place that I can see which sets OS_STATE_READONLY is in osd-ldiskfs, but you are testing ZFS OSTs so it isn't clear where that is coming from (there is only a commented-out section in osd-zfs/osd_handler.c which references it).

            If you run with D_CACHE enabled on the OSS (lctl set_param debug=+cache) it would print the os_state field on the OST during statfs processing.

            Looking at the osd-zfs osd_statfs() method it seems possible that os_state is not being initialized properly and is just holding random data? I'd think that this would have triggered earlier if that were the case, or maybe I'm mistaken and it is initialized somewhere else, but in the osd-ldiskfs case it is initialized within its osd_statfs() call.

            adilger Andreas Dilger added a comment - The -EROFS return is because the OST returned os_state=OS_STATE_READONLY to the MDS, and if all of the OSTs are returning this then there are no available OST objects and the MDS needs to return -ENOSPC . The main point of investigation should be why the OST(s) are returning OS_STATE_READONLY. As far as I can see, the only place that I can see which sets OS_STATE_READONLY is in osd-ldiskfs, but you are testing ZFS OSTs so it isn't clear where that is coming from (there is only a commented-out section in osd-zfs/osd_handler.c which references it). If you run with D_CACHE enabled on the OSS ( lctl set_param debug=+cache ) it would print the os_state field on the OST during statfs processing. Looking at the osd-zfs osd_statfs() method it seems possible that os_state is not being initialized properly and is just holding random data? I'd think that this would have triggered earlier if that were the case, or maybe I'm mistaken and it is initialized somewhere else, but in the osd-ldiskfs case it is initialized within its osd_statfs() call.

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/15346
            Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8d468ec53f908d8a2d067eabab817d04d334d737

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/15346 Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8d468ec53f908d8a2d067eabab817d04d334d737
            ofaaland Olaf Faaland added a comment -

            osp_statfs() reports many free blocks and many free files for OST0000 when called by lod_statfs_and_check:

            274 00020000:00000001:19.0:1434585022.784732:0:63759:0:(lod_qos.c:193:lod_statfs_and_check()) Process entered
            275 00000004:00000001:19.0:1434585022.784733:0:63759:0:(osp_dev.c:631:osp_statfs()) Process entered
            276 00000004:00001000:19.0:1434585022.784733:0:63759:0:(osp_dev.c:662:osp_statfs()) lburn-OST0000-osc-MDT0000: 5658306464 blocks, 5658303744 free, 5658295552 avail, 5658321151 files, 5658303744 free files
            277 00000004:00000001:19.0:1434585022.784735:0:63759:0:(osp_dev.c:663:osp_statfs()) Process leaving (rc=0 : 0 : 0)
            278 00020000:00000001:19.0:1434585022.784736:0:63759:0:(lod_qos.c:238:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2)

            ofaaland Olaf Faaland added a comment - osp_statfs() reports many free blocks and many free files for OST0000 when called by lod_statfs_and_check: 274 00020000:00000001:19.0:1434585022.784732:0:63759:0:(lod_qos.c:193:lod_statfs_and_check()) Process entered 275 00000004:00000001:19.0:1434585022.784733:0:63759:0:(osp_dev.c:631:osp_statfs()) Process entered 276 00000004:00001000:19.0:1434585022.784733:0:63759:0:(osp_dev.c:662:osp_statfs()) lburn-OST0000-osc-MDT0000: 5658306464 blocks, 5658303744 free, 5658295552 avail, 5658321151 files, 5658303744 free files 277 00000004:00000001:19.0:1434585022.784735:0:63759:0:(osp_dev.c:663:osp_statfs()) Process leaving (rc=0 : 0 : 0) 278 00020000:00000001:19.0:1434585022.784736:0:63759:0:(lod_qos.c:238:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2)

            mdtest is being run as follows:

            mdtest-1.8.3 was launched with 32 total task(s) on 32 nodes
            Command line used: /g/g0/faaland1/projects/zfs-crada/mdtest/mdtest -v -d /p/lburn/faaland1/zfs-crada/mdtest/2/ -u -t -w 2048 -i 100 -n 1000
            Path: /p/lburn/faaland1/zfs-crada/mdtest/2
            FS: 42.2 TiB Used FS: 0.0% Inodes: 7.2 Mi Used Inodes: 0.1%

            32 tasks, 32000 files/directories

            Operation Duration Rate
            --------- -------- ----

            • iteration 1 06/18/2015 16:22:01 *
              Tree creation : 0.010 sec, 100.583 ops/sec
              Directory creation: 3.964 sec, 8072.493 ops/sec
              Directory stat : 1.346 sec, 23773.545 ops/sec
              Directory removal : 5.736 sec, 5578.833 ops/sec
              06/18/2015 16:22:12: Process 26(zwicky29): FAILED in create_remove_items_helper, unable to create file file.mdtest.26.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.26.0): No space left on device
              06/18/2015 16:22:12: Process 19(zwicky22): FAILED in create_remove_items_helper, unable to create file file.mdtest.19.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.19.0): No space left on device

            This occurs intermittently; sometimes mdtest will create tens of thousands of files before this occurs, other times none or very few are created before it happens. When it does occur, all the clients in the mdtest job (32 in the example I gave above) report the same failure.

            There is no console log output on any of the servers, between the time I start running mdtest and after the error has occurred.

            The problem disappears on its own; re-running mdtest after it fails, with a new target directory on the same filesystem, resumes with no problem. This is true whether mdtest runs on any of the same nodes as the previous invocation or not.

            I sampled prealloc_status, prealloc_next_id, and prealloc_last_id for each of the two OSTs on a 1-second interval starting when I ran mdtest and stopping after the job failed. prealloc_status was 0 the entire time for both OSTs, and $(cat prealloc_last_id) - $(cat prealloc_next_id) got as low as 5030 for one OST.

            I also noticed that files are not created evenly on the OSTs according to ltop; it shows the lock count increasing on one OST for several seconds, then that OST's lock count stops changing and the other OST's lock count starts increasing. In the case of one 32-node mdtest run, it created all 32000 files on OST0001 and none on OST0000. Other times it finishes with the same number of locks on each OST, but they see-saw up.

            ofaaland Olaf Faaland added a comment - mdtest is being run as follows: mdtest-1.8.3 was launched with 32 total task(s) on 32 nodes Command line used: /g/g0/faaland1/projects/zfs-crada/mdtest/mdtest -v -d /p/lburn/faaland1/zfs-crada/mdtest/2/ -u -t -w 2048 -i 100 -n 1000 Path: /p/lburn/faaland1/zfs-crada/mdtest/2 FS: 42.2 TiB Used FS: 0.0% Inodes: 7.2 Mi Used Inodes: 0.1% 32 tasks, 32000 files/directories Operation Duration Rate --------- -------- ---- iteration 1 06/18/2015 16:22:01 * Tree creation : 0.010 sec, 100.583 ops/sec Directory creation: 3.964 sec, 8072.493 ops/sec Directory stat : 1.346 sec, 23773.545 ops/sec Directory removal : 5.736 sec, 5578.833 ops/sec 06/18/2015 16:22:12: Process 26(zwicky29): FAILED in create_remove_items_helper, unable to create file file.mdtest.26.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.26.0): No space left on device 06/18/2015 16:22:12: Process 19(zwicky22): FAILED in create_remove_items_helper, unable to create file file.mdtest.19.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.19.0): No space left on device This occurs intermittently; sometimes mdtest will create tens of thousands of files before this occurs, other times none or very few are created before it happens. When it does occur, all the clients in the mdtest job (32 in the example I gave above) report the same failure. There is no console log output on any of the servers, between the time I start running mdtest and after the error has occurred. The problem disappears on its own; re-running mdtest after it fails, with a new target directory on the same filesystem, resumes with no problem. This is true whether mdtest runs on any of the same nodes as the previous invocation or not. I sampled prealloc_status, prealloc_next_id, and prealloc_last_id for each of the two OSTs on a 1-second interval starting when I ran mdtest and stopping after the job failed. prealloc_status was 0 the entire time for both OSTs, and $(cat prealloc_last_id) - $(cat prealloc_next_id) got as low as 5030 for one OST. I also noticed that files are not created evenly on the OSTs according to ltop; it shows the lock count increasing on one OST for several seconds, then that OST's lock count stops changing and the other OST's lock count starts increasing. In the case of one 32-node mdtest run, it created all 32000 files on OST0001 and none on OST0000. Other times it finishes with the same number of locks on each OST, but they see-saw up.

            People

              utopiabound Nathaniel Clark
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: