Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5375

Failure on test suite sanity test_151 test_156: roc_hit is not safe to use

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.6.0, Lustre 2.8.0
    • None
    • client and server: lustre-b2_6-rc2 ldiskfs
      client is SLES11 SP3
    • 3
    • 14980

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/97672104-0dca-11e4-b3f5-5254006e85c2.

      The sub-test test_151 failed with the following error:

      roc_hit is not safe to use

      == sanity test 151: test cache on oss and controls ================================= 19:31:03 (1405477863)
      CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
      CMD: onyx-40vm8 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=1 		osd-*.lustre-OST*.writethrough_cache_enable=1 2>&1
      CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.writethrough_cache_enable 		osd-*.lustre-OST*.writethrough_cache_enable 2>&1
      4+0 records in
      4+0 records out
      16384 bytes (16 kB) copied, 0.00947514 s, 1.7 MB/s
      CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
      CMD: onyx-40vm8 /usr/sbin/lctl get_param -n obdfilter.*OST*0000.stats 		osd-*.*OST*0000.stats 2>&1
      BEFORE:11 AFTER:12
       sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use 
      

      Attachments

        Issue Links

          Activity

            [LU-5375] Failure on test suite sanity test_151 test_156: roc_hit is not safe to use
            simmonsja James A Simmons added a comment - - edited

            I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work?

            do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \
            -              osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match'
            +              osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ]
            

            Sorry not the greatest bash scripter.

            simmonsja James A Simmons added a comment - - edited I'm thinking the "grep -v 'Found no match'" test might not always work. I'm exploring testing the return value "$?" of the command. I like to test to see if "$?" is less than zero. Would something like this work? do_nodes $nodes "$LCTL set_param -n obdfilter.$device.$name=$value \ - osd-*.$device.$name=$value 2>&1" | grep -v 'Found no match' + osd-*.$device.$name=$value 2>&1" || return [ $? -lt 0 ] Sorry not the greatest bash scripter.

            The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives:

            root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable
            error: list_param: obdfilter/lustre-OST0000/read_cache_enable: No such file or directory

            Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the LU-5030 changes.

            simmonsja James A Simmons added a comment - The source of the failure is get_osd_param(). It should be reporting that those parameters don't exist. Its doing a 'grep -v 'Found no match' but that is not being reported by lctl get_param. Running the command manually gives: root@ninja11 lustre-OST0000]# lctl get_param -n obdfilter.lustre-OST0000.read_cache_enable error: list_param: obdfilter/lustre-OST0000/read_cache_enable: No such file or directory Ah yes I moved from the custom globerrstr() to the standard strerror(...). The failure in this case is due to the LU-5030 changes.

            James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable, though it does "return 0" instead of "skip" as it probably should.

            adilger Andreas Dilger added a comment - James, wouldn't that cause review-zfs to fail all the time? It appears that test_151 has checks for read_cache_enable and writethrough_cache_enable , though it does "return 0" instead of "skip" as it probably should.

            I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS.

            simmonsja James A Simmons added a comment - I know why this test is failing. Symlinks are being created by the obdfilter into the osd-* layer for writehthrough_cache_enable, read cache_max_filesize, read_cache_enable and brw_stats. This works for ldiskfs but not ZFS. ZFS only has brw_stats but lacks the rest. This is why sanity test 151 fails for ZFS.

            Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7
            Client: 2.5.5, b2_5_fe/62
            https://testing.hpdd.intel.com/test_sets/e4f27f18-9fff-11e5-a33d-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Server: Master, Build# 3266, Tag 2.7.64 , RHEL 7 Client: 2.5.5, b2_5_fe/62 https://testing.hpdd.intel.com/test_sets/e4f27f18-9fff-11e5-a33d-5254006e85c2

            Encountered another instance for Interop master<->2.5.5
            Server: Master, Build# 3266, Tag 2.7.64
            Client: 2.5.5, b2_5_fe/62
            https://testing.hpdd.intel.com/test_sets/ac332386-9fcc-11e5-a33d-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Encountered another instance for Interop master<->2.5.5 Server: Master, Build# 3266, Tag 2.7.64 Client: 2.5.5, b2_5_fe/62 https://testing.hpdd.intel.com/test_sets/ac332386-9fcc-11e5-a33d-5254006e85c2
            sarah Sarah Liu added a comment -

            also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6

            sarah Sarah Liu added a comment - also hit this issue after rolling downgrade from master/3264 RHEL6.7 to 2.5.5 RHEL6.6
            adilger Andreas Dilger added a comment - - edited

            This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to LU-5030 breaking /proc access completely) are of the form:

            BEFORE:18720 AFTER:18721
             sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use 
            

            so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like:

                            if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then
                                    rm -rf $dir
                                    error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER"
                            fi
            

            The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop.

            adilger Andreas Dilger added a comment - - edited This is failing between 0-5 times per day, maybe twice per day on average. It looks like most of these recent failures (excluding those attributable to LU-5030 breaking /proc access completely) are of the form: BEFORE:18720 AFTER:18721 sanity test_151: @@@@@@ FAIL: roc_hit is not safe to use so the before/after values are only off by one. I suspect this is just a problem with the test script - the roc_hit_init() function is just using cat $DIR/$tfile to read the file and with proper readahead of files smaller than max_readahead_whole it should only do a single read. So roc_hit_init() should be changed to use something like: if (( AFTER - BEFORE == 0 || AFTER - BEFORE > 4)); then rm -rf $dir error "roc_hit is not safe to use: BEFORE=$BEFORE, AFTER=$AFTER" fi The rm -rf $dir at the end of roc_hit_init() should also be changed to just use rmdir $dir since this directory should be empty at this point because $file is deleted for each loop.
            sarah Sarah Liu added a comment -

            hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client

            https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2

            sarah Sarah Liu added a comment - hit this again in interop testing with lustre-master server(EL7) and 2.5.3 client https://testing.hpdd.intel.com/test_sets/e006681c-1250-11e5-bec9-5254006e85c2

            Haven't seen this in a long time.

            adilger Andreas Dilger added a comment - Haven't seen this in a long time.
            sarah Sarah Liu added a comment - - edited

            Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6.
            After MDS and OSS were upgraded to 2.6, both clients keep 2.5 and then run the sanity test_151 failed as the same error.

            before upgrade: 2.5.2
            after upgrade: b2_6-rc2

            sarah Sarah Liu added a comment - - edited Also seen this during rolling upgrade from 2.5 ldiskfs to 2.6. After MDS and OSS were upgraded to 2.6, both clients keep 2.5 and then run the sanity test_151 failed as the same error. before upgrade: 2.5.2 after upgrade: b2_6-rc2

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: