Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6247

osd-zfs: sanity-sec test_16, test_22: 27919 != 26845 + 1M after write

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.7.0
    • 3
    • 17497

    Description

      This issue was created by maloo for Isaac Huang <he.huang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f61d2cb6-b2a2-11e4-a8f5-5254006e85c2.

      The sub-test test_22 failed with the following error:

      27919 != 26845 + 1M after write
      

      It seemed that the test was using 4096 as blocksize for the ZFS OST to estimate quota, which would fail.

      On the client:

      [root@eagle-44vm2 ~]# cat /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88007a5bb800/import
      import:
          name: lustre-OST0000-osc-ffff88007a5bb800
          target: lustre-OST0000_UUID
          state: FULL
          connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type, lfsck ]
          connect_data:
             flags: 0x404af0e3440478
             instance: 1
             target_version: 2.6.94.0
             initial_grant: 1048576
             max_brw_size: 4194304
             grant_block_size: 0
             grant_inode_size: 0
             grant_extent_overhead: 0
             cksum_types: 0x2
             max_easize: 32768
             max_object_bytes: 9223372036854775807
          import_flags: [ replayable, pingable, connect_tried ]
          connection:
             failover_nids: [ 10.100.4.126@tcp ]
             current_connection: 10.100.4.126@tcp
             connection_attempts: 1
             generation: 1
             in-progress_invalidations: 0
          rpcs:
             inflight: 0
             unregistering: 0
             timeouts: 0
             avg_waittime: 1702 usec
          service_estimates:
             services: 1 sec
             network: 1 sec
          transactions:
             last_replay: 0
             peer_committed: 0
             last_checked: 0
      [root@eagle-44vm2 ~]# lctl get_param -n osc.lustre-OST0000-*.blocksize
      4096
      

      On the OSS:

      [root@eagle-44vm1 lustre]# cat /proc/fs/lustre/osd-zfs/lustre-OST0000/blocksize 
      131072
      

      This seems to be the code that resets blocksize to 4K

      int ofd_statfs(const struct lu_env *env,  struct obd_export *exp,
      ......
              if (obd->obd_self_export != exp && ofd_grant_compat(exp, ofd)) {
                      /* clients which don't support OBD_CONNECT_GRANT_PARAM
                       * should not see a block size > page size, otherwise
                       * cl_lost_grant goes mad. Therefore, we emulate a 4KB (=2^12)
                       * block size which is the biggest block size known to work
                       * with all client's page size. */
                      osfs->os_blocks <<= ofd->ofd_blockbits - COMPAT_BSIZE_SHIFT;
                      osfs->os_bfree  <<= ofd->ofd_blockbits - COMPAT_BSIZE_SHIFT;
                      osfs->os_bavail <<= ofd->ofd_blockbits - COMPAT_BSIZE_SHIFT;
                      osfs->os_bsize    = 1 << COMPAT_BSIZE_SHIFT;
              }
      

      I'm not familiar with the code, but if the code is correct then the test should be fixed or skipped for ZFS OSTs.

      Attachments

        Activity

          [LU-6247] osd-zfs: sanity-sec test_16, test_22: 27919 != 26845 + 1M after write

          Patch landed to master

          utopiabound Nathaniel Clark added a comment - Patch landed to master

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13844/
          Subject: LU-6247 tests: fix nodemap quota test to use correct blocksize
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 0233e76f3f9830210ba7012554ee7cfc41e3ab37

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13844/ Subject: LU-6247 tests: fix nodemap quota test to use correct blocksize Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0233e76f3f9830210ba7012554ee7cfc41e3ab37

          There's a fs_log_size() in test-framework.sh.

          isaac Isaac Huang (Inactive) added a comment - There's a fs_log_size() in test-framework.sh.

          Definitely there is a need for fuzz in all space accoubting tests. There are Lustre LLOG records written, object index, and other files that are not freed when a file is deleted. As long as the test file is large enough that we can be certain the remaining space is fuzz and not the file failing to be deleted (I think the current test is ok in this regard) then the accounting doesn't need to be totally accurate.

          In fact, some tests are having a fuzz much larger than a single block. It might make sense to have an "official fuzz" for each backend, or helper routines to check if a value is within the limits instead of reinventing this for every test that is checking space limits. Not sure if there is something in sanity-quota.sh in this regard already or not, but if yes then it probably makes sense to use them.

          adilger Andreas Dilger added a comment - Definitely there is a need for fuzz in all space accoubting tests. There are Lustre LLOG records written, object index, and other files that are not freed when a file is deleted. As long as the test file is large enough that we can be certain the remaining space is fuzz and not the file failing to be deleted (I think the current test is ok in this regard) then the accounting doesn't need to be totally accurate. In fact, some tests are having a fuzz much larger than a single block. It might make sense to have an "official fuzz" for each backend, or helper routines to check if a value is within the limits instead of reinventing this for every test that is checking space limits. Not sure if there is something in sanity-quota.sh in this regard already or not, but if yes then it probably makes sense to use them.

          It looks like there was an issue with quota reclaiming in the last test - the root user was unable to reclaim 2k (reclaimed 1022k vs the 1024k expected).

          For ldiskfs, it seems like this might happen if the directory record needed to be expanded due to the number of deleted dentries. Or do directory records ever reclaim deleted dentries? Do you have any thoughts as to why this might happen for ZFS?

          What do you think about having a 4k lower fuzz?

          kit.westneat Kit Westneat (Inactive) added a comment - It looks like there was an issue with quota reclaiming in the last test - the root user was unable to reclaim 2k (reclaimed 1022k vs the 1024k expected). For ldiskfs, it seems like this might happen if the directory record needed to be expanded due to the number of deleted dentries. Or do directory records ever reclaim deleted dentries? Do you have any thoughts as to why this might happen for ZFS? What do you think about having a 4k lower fuzz?

          Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/13844
          Subject: LU-6247 tests: fix nodemap quota test to use correct blocksize
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 525c3d8143acd5d1cd2612a858ba9f872792e0d4

          gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/13844 Subject: LU-6247 tests: fix nodemap quota test to use correct blocksize Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 525c3d8143acd5d1cd2612a858ba9f872792e0d4

          The lower fuzz doesn't make sense for ldiskfs, though in theory it might make sense for ZFS if it has block compression enabled, but we don't do that now, so may as well remove it. I though I saw ldiskfs test failures when I was looking, but I can't check right now.

          I didn't check yet what the ZFS OSTs return for the blocksize. It may be 128KB now, and up to 1MB in the future.

          As for the /proc filename, the OSC one should be ok on the client,but it may be that this was removed by John H. recently? Definitely the osd one will work on the server.

          adilger Andreas Dilger added a comment - The lower fuzz doesn't make sense for ldiskfs, though in theory it might make sense for ZFS if it has block compression enabled, but we don't do that now, so may as well remove it. I though I saw ldiskfs test failures when I was looking, but I can't check right now. I didn't check yet what the ZFS OSTs return for the blocksize. It may be 128KB now, and up to 1MB in the future. As for the /proc filename, the OSC one should be ok on the client,but it may be that this was removed by John H. recently? Definitely the osd one will work on the server.
          kit.westneat Kit Westneat (Inactive) added a comment - - edited

          I looked on maloo for test failures on ldiskfs with this cause, but I couldn't find any. It looks like all the sanity-sec failures on ldiskfs are LU-5423 or LU-6106. Do you think it's worth trying to just modify the fuzz on ZFS? I think on one of the OpenSFS calls that's what we decided to do, but it looks like I used the wrong blocksize parameter (osc.$FSNAME-OST0000-*.blocksize vs osd-*.$FSNAME-OST0000.blocksize).

          I can invert the check if you prefer it that way. The fuzz parameters were actually added to account for the indirect blocks and overhead. I have it fuzz the low and high, but I suppose it really only makes sense to fuzz if the quota is higher than expected. If the quota used is lower than expected, it can't really be accounted for as filesystem overhead, right? If that's the case, I can just remove the qused_low parameter, as it seems like it's confusing.

          kit.westneat Kit Westneat (Inactive) added a comment - - edited I looked on maloo for test failures on ldiskfs with this cause, but I couldn't find any. It looks like all the sanity-sec failures on ldiskfs are LU-5423 or LU-6106 . Do you think it's worth trying to just modify the fuzz on ZFS? I think on one of the OpenSFS calls that's what we decided to do, but it looks like I used the wrong blocksize parameter (osc.$FSNAME-OST0000-*.blocksize vs osd-*.$FSNAME-OST0000.blocksize). I can invert the check if you prefer it that way. The fuzz parameters were actually added to account for the indirect blocks and overhead. I have it fuzz the low and high, but I suppose it really only makes sense to fuzz if the quota is higher than expected. If the quota used is lower than expected, it can't really be accounted for as filesystem overhead, right? If that's the case, I can just remove the qused_low parameter, as it seems like it's confusing.

          It looks like test_22 a new test added for nodemap in 2.7, and this has been failing intermittently in testing on both ldiskfs and zfs. The test is checking that writing 1MB of data causes exactly 1MB of increase in quota usage for the specified (mapped) UID.

          It appears that this is causing 1074 * 1024 bytes of quota to be consumed in this case, 50KB more than expected (between 32-70KB in other failures), almost certainly due to indirect blocks. It seems that the test is trying to give some margin for error, but I don't think it is quite right, or large enough:

          do_fops_quota_test() {
                  # define fuzz as 2x ost block size in K
                  local quota_fuzz=$(($(lctl get_param -n \
                          osc.$FSNAME-OST0000-*.blocksize | head -1) / 512))
                  local qused_orig=$(nodemap_check_quota "$run_u")
                  local qused_low=$((qused_orig - quota_fuzz))
                  local qused_high=$((qused_orig + quota_fuzz))
                  $run_u dd if=/dev/zero of=$testfile bs=1M count=1 >& /dev/null
                  sync; sync_all_data || true
          
                  local qused_new=$(nodemap_check_quota "$run_u")
                  [ $((qused_low + 1024)) -le $((qused_new)) \
                          -a $((qused_high + 1024)) -ge $((qused_new)) ] ||
                          error "$qused_new != $qused_orig + 1M after write"
          

          I think the check as written is a bit confusing (i.e. inverted logic from what I'd expect), and should also consider indirect blocks and other minor overhead:

                  # allow 128KB margin for indirect blocks and other overhead
                  [ $qused_new -lt $((qused_low + 1024))  \
                          -o $qused_new -gt $((qused_high + 1024 + 128)) ] &&
                          error "$qused_new != $qused_orig + 1M after write"
          

          The other related problem is that once this test fails, it doesn't clean up after itself properly, and the following tests up to test_23 will also fail with "adding fops nodemaps failed 1". The tests should add a trap for a cleanup helper so that the later tests can run properly.

          adilger Andreas Dilger added a comment - It looks like test_22 a new test added for nodemap in 2.7, and this has been failing intermittently in testing on both ldiskfs and zfs. The test is checking that writing 1MB of data causes exactly 1MB of increase in quota usage for the specified (mapped) UID. It appears that this is causing 1074 * 1024 bytes of quota to be consumed in this case, 50KB more than expected (between 32-70KB in other failures), almost certainly due to indirect blocks. It seems that the test is trying to give some margin for error, but I don't think it is quite right, or large enough: do_fops_quota_test() { # define fuzz as 2x ost block size in K local quota_fuzz=$(($(lctl get_param -n \ osc.$FSNAME-OST0000-*.blocksize | head -1) / 512)) local qused_orig=$(nodemap_check_quota "$run_u") local qused_low=$((qused_orig - quota_fuzz)) local qused_high=$((qused_orig + quota_fuzz)) $run_u dd if=/dev/zero of=$testfile bs=1M count=1 >& /dev/null sync; sync_all_data || true local qused_new=$(nodemap_check_quota "$run_u") [ $((qused_low + 1024)) -le $((qused_new)) \ -a $((qused_high + 1024)) -ge $((qused_new)) ] || error "$qused_new != $qused_orig + 1M after write" I think the check as written is a bit confusing (i.e. inverted logic from what I'd expect), and should also consider indirect blocks and other minor overhead: # allow 128KB margin for indirect blocks and other overhead [ $qused_new -lt $((qused_low + 1024)) \ -o $qused_new -gt $((qused_high + 1024 + 128)) ] && error "$qused_new != $qused_orig + 1M after write" The other related problem is that once this test fails, it doesn't clean up after itself properly, and the following tests up to test_23 will also fail with "adding fops nodemaps failed 1". The tests should add a trap for a cleanup helper so that the later tests can run properly.

          People

            utopiabound Nathaniel Clark
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: