Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6747

Intermittent rc=-EROFS from lod_statfs_and_check

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • lustre 2.7.54
      spl/zfs 0.6.4.1
      single MDS with one MDT and MGS
      two OSS's with one OST each
    • 3
    • 9223372036854775807

    Description

      Running mdtest I get failures like this:

      06/18/2015 16:22:12: Process 26(zwicky29): FAILED in create_remove_items_helper, unable to create file file.mdtest.26.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.26.0): No space left on device
      06/18/2015 16:22:12: Process 19(zwicky22): FAILED in create_remove_items_helper, unable to create file file.mdtest.19.0 (cwd=/p/lburn/faaland1/zfs-crada/mdtest/2/#test-dir.0/mdtest_tree.19.0): No space left on device

      The three servers involved are using the zfs backend and their pools have lots of free space; all are <1% full.

      In the lustre debug log on the MDS, I'm seeing
      lod_qos.c:238:lod_statfs_and_check() return -30 (-LUSTRE_EROFS)
      lod_qos.c:1016:lod_alloc_rr() return -28 (-ENOSPC)

      Many other functions report exiting with -28 as well:

      lod_object.c:2104:lod_declare_xattr_set()
      lod_object.c:3352:lod_declare_striped_object()
      lod_object.c:3384:lod_declare_striped_object()
      lod_object.c:3463:lod_declare_object_create()
      lod_qos.c:1913:lod_qos_prep_create()
      mdd_dir.c:1786:mdd_create_data()
      mdd_dir.c:1807:mdd_create_data()
      mdd_dir.c:1983:mdd_declare_object_create()
      mdd_dir.c:2054:mdd_declare_create()
      mdd_dir.c:2354:mdd_create()
      mdd_object.c:352:mdd_declare_object_create_internal()
      mdt_open.c:1105:mdt_open_by_fid_lock()
      mdt_open.c:1255:mdt_reint_open()
      mdt_open.c:1374:mdt_reint_open()
      mdt_open.c:138:mdt_create_data()
      mdt_open.c:347:mdt_mfd_open()
      mdt_open.c:607:mdt_finish_open()
      mdt_reint.c:1997:mdt_reint_rec()

      I've attached a few thousand lines of debug output from the mds with both debug and debug_subsys set to -1. I can reproduce easily, so I can get debug output with specific subsystems turned off or on.

      Attachments

        Issue Links

          Activity

            [LU-6747] Intermittent rc=-EROFS from lod_statfs_and_check
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15346/
            Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1ef0185e8c12aa11a4c87c4956b1ba408c0e3d08

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15346/ Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ef0185e8c12aa11a4c87c4956b1ba408c0e3d08

            The patch appears to have resolved the problem; mdtest has been running now for several hours without encountering the issue, and ran for several hours last night as well.

            I'm surprised I didn't see this problem earlier. I ran mdtest in the same manner, against the same SPL/ZFS/Lustre versions, on the same clients and servers, with SPL and ZFS built using the --enable-debug configure option. I didn't observe this problem. I then rebuilt SPL and ZFS without debug, and rebuilt lustre against it, which is when I began to see this.

            ofaaland Olaf Faaland added a comment - The patch appears to have resolved the problem; mdtest has been running now for several hours without encountering the issue, and ran for several hours last night as well. I'm surprised I didn't see this problem earlier. I ran mdtest in the same manner, against the same SPL/ZFS/Lustre versions, on the same clients and servers, with SPL and ZFS built using the --enable-debug configure option. I didn't observe this problem. I then rebuilt SPL and ZFS without debug, and rebuilt lustre against it, which is when I began to see this.
            ofaaland Olaf Faaland added a comment -

            Nathaniel,

            The patch had an error in its usage of offsetof(), which I fixed, but the patch still fails to build on SLES12. It builds fine on RHEL. I looked at the SLES12 build console output but am not able to tell what the problem is. Can you take a look?

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Nathaniel, The patch had an error in its usage of offsetof(), which I fixed, but the patch still fails to build on SLES12. It builds fine on RHEL. I looked at the SLES12 build console output but am not able to tell what the problem is. Can you take a look? thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Both OSTs are indeed reporting a nonzero state intermittently:

            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434584977.251447:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176813342 avail; 5658325578 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585002.260232:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176821809 avail; 5658327093 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585002.890186:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176805734 avail; 5658327093 objects: 5658303680 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585007.890294:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821988 free, 176812034 avail; 5658330734 objects: 5658303616 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585017.266203:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585017.891472:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585022.267691:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585022.912333:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585027.269289:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585027.913373:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585032.270956:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585032.913344:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585037.272373:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585037.914346:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585042.273878:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585042.914405:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584984.055391:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821934 free, 176821857 avail; 5658316107 objects: 5658301888 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584989.057246:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821943 free, 176821866 avail; 5658315946 objects: 5658302176 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585014.066824:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585019.068752:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585029.072516:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821921 free, 176821844 avail; 5658339248 objects: 5658301472 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:8.0:1434585031.588445:0:12764:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821922 free, 176790984 avail; 5658338073 objects: 5658301504 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585034.074483:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821924 free, 176795098 avail; 5658334103 objects: 5658301568 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585044.078171:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821940 free, 176820539 avail; 4863994422 objects: 4863984771 free; state c9c4f476
            zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585049.080189:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821945 free, 176821872 avail; 3106822606 objects: 3106816830 free; state c9c4f476

            I'll try the patch.

            ofaaland Olaf Faaland added a comment - Both OSTs are indeed reporting a nonzero state intermittently: zwicky-lburn-oss1.out:00002000:00000020:0.0:1434584977.251447:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176813342 avail; 5658325578 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585002.260232:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176821809 avail; 5658327093 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585002.890186:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821990 free, 176805734 avail; 5658327093 objects: 5658303680 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:9.0:1434585007.890294:0:12554:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821988 free, 176812034 avail; 5658330734 objects: 5658303616 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585017.266203:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585017.891472:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585022.267691:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585022.912333:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585027.269289:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585027.913373:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585032.270956:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585032.913344:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585037.272373:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585037.914346:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:0.0:1434585042.273878:0:12631:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821815 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss1.out:00002000:00000020:10.0:1434585042.914405:0:12649:0:(ofd_obd.c:841:ofd_statfs()) 176822077 blocks: 176821992 free, 176821736 avail; 5658321151 objects: 5658303744 free; state 9125230e zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584984.055391:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821934 free, 176821857 avail; 5658316107 objects: 5658301888 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434584989.057246:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821943 free, 176821866 avail; 5658315946 objects: 5658302176 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585014.066824:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585019.068752:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821940 free, 176821863 avail; 5658319856 objects: 5658302080 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585029.072516:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821921 free, 176821844 avail; 5658339248 objects: 5658301472 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:8.0:1434585031.588445:0:12764:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821922 free, 176790984 avail; 5658338073 objects: 5658301504 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:1.0:1434585034.074483:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821924 free, 176795098 avail; 5658334103 objects: 5658301568 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585044.078171:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822073 blocks: 176821940 free, 176820539 avail; 4863994422 objects: 4863984771 free; state c9c4f476 zwicky-lburn-oss2.out:00002000:00000020:4.0:1434585049.080189:0:12541:0:(ofd_obd.c:841:ofd_statfs()) 176822076 blocks: 176821945 free, 176821872 avail; 3106822606 objects: 3106816830 free; state c9c4f476 I'll try the patch.
            pjones Peter Jones added a comment -

            Assigning to Nathaniel for any follow on questions

            pjones Peter Jones added a comment - Assigning to Nathaniel for any follow on questions

            Totally untested patch, but I think it is pretty reasonable and will be tested out soon enough.

            adilger Andreas Dilger added a comment - Totally untested patch, but I think it is pretty reasonable and will be tested out soon enough.

            The -EROFS return is because the OST returned os_state=OS_STATE_READONLY to the MDS, and if all of the OSTs are returning this then there are no available OST objects and the MDS needs to return -ENOSPC. The main point of investigation should be why the OST(s) are returning OS_STATE_READONLY.

            As far as I can see, the only place that I can see which sets OS_STATE_READONLY is in osd-ldiskfs, but you are testing ZFS OSTs so it isn't clear where that is coming from (there is only a commented-out section in osd-zfs/osd_handler.c which references it).

            If you run with D_CACHE enabled on the OSS (lctl set_param debug=+cache) it would print the os_state field on the OST during statfs processing.

            Looking at the osd-zfs osd_statfs() method it seems possible that os_state is not being initialized properly and is just holding random data? I'd think that this would have triggered earlier if that were the case, or maybe I'm mistaken and it is initialized somewhere else, but in the osd-ldiskfs case it is initialized within its osd_statfs() call.

            adilger Andreas Dilger added a comment - The -EROFS return is because the OST returned os_state=OS_STATE_READONLY to the MDS, and if all of the OSTs are returning this then there are no available OST objects and the MDS needs to return -ENOSPC . The main point of investigation should be why the OST(s) are returning OS_STATE_READONLY. As far as I can see, the only place that I can see which sets OS_STATE_READONLY is in osd-ldiskfs, but you are testing ZFS OSTs so it isn't clear where that is coming from (there is only a commented-out section in osd-zfs/osd_handler.c which references it). If you run with D_CACHE enabled on the OSS ( lctl set_param debug=+cache ) it would print the os_state field on the OST during statfs processing. Looking at the osd-zfs osd_statfs() method it seems possible that os_state is not being initialized properly and is just holding random data? I'd think that this would have triggered earlier if that were the case, or maybe I'm mistaken and it is initialized somewhere else, but in the osd-ldiskfs case it is initialized within its osd_statfs() call.

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/15346
            Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8d468ec53f908d8a2d067eabab817d04d334d737

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/15346 Subject: LU-6747 osd-zfs: initialize obd_statfs in osd_statfs() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8d468ec53f908d8a2d067eabab817d04d334d737
            ofaaland Olaf Faaland added a comment -

            osp_statfs() reports many free blocks and many free files for OST0000 when called by lod_statfs_and_check:

            274 00020000:00000001:19.0:1434585022.784732:0:63759:0:(lod_qos.c:193:lod_statfs_and_check()) Process entered
            275 00000004:00000001:19.0:1434585022.784733:0:63759:0:(osp_dev.c:631:osp_statfs()) Process entered
            276 00000004:00001000:19.0:1434585022.784733:0:63759:0:(osp_dev.c:662:osp_statfs()) lburn-OST0000-osc-MDT0000: 5658306464 blocks, 5658303744 free, 5658295552 avail, 5658321151 files, 5658303744 free files
            277 00000004:00000001:19.0:1434585022.784735:0:63759:0:(osp_dev.c:663:osp_statfs()) Process leaving (rc=0 : 0 : 0)
            278 00020000:00000001:19.0:1434585022.784736:0:63759:0:(lod_qos.c:238:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2)

            ofaaland Olaf Faaland added a comment - osp_statfs() reports many free blocks and many free files for OST0000 when called by lod_statfs_and_check: 274 00020000:00000001:19.0:1434585022.784732:0:63759:0:(lod_qos.c:193:lod_statfs_and_check()) Process entered 275 00000004:00000001:19.0:1434585022.784733:0:63759:0:(osp_dev.c:631:osp_statfs()) Process entered 276 00000004:00001000:19.0:1434585022.784733:0:63759:0:(osp_dev.c:662:osp_statfs()) lburn-OST0000-osc-MDT0000: 5658306464 blocks, 5658303744 free, 5658295552 avail, 5658321151 files, 5658303744 free files 277 00000004:00000001:19.0:1434585022.784735:0:63759:0:(osp_dev.c:663:osp_statfs()) Process leaving (rc=0 : 0 : 0) 278 00020000:00000001:19.0:1434585022.784736:0:63759:0:(lod_qos.c:238:lod_statfs_and_check()) Process leaving (rc=18446744073709551586 : -30 : ffffffffffffffe2)

            People

              utopiabound Nathaniel Clark
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: