Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9438

sanity-lfsck test_17: (1.2) f1 (wrong) size should be 1048576, but got

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • trevis-35, full, SLES12 clients
        EL7, master branch, v2.9.56.11, b3565
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824

      From test_log:

      CMD: trevis-35vm7 /usr/sbin/lctl set_param fail_val=0 fail_loc=0x1614
      fail_val=0
      fail_loc=0x1614
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00496945 s, 211 MB/s
      total: 1 open/close in 0.00 seconds: 479.18 ops/second
      error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-MDT0000-mdc-ffff88002c485000/lru_size=clear: Invalid argument
      ldlm.namespaces.lustre-MDT0000-mdc-ffff88002c485000.lock_unused_count=2
      CMD: trevis-35vm7 /usr/sbin/lctl set_param fail_loc=0 fail_val=0
      fail_loc=0
      fail_val=0
      /mnt/lustre/d17.sanity-lfsck/f0 and /mnt/lustre/d17.sanity-lfsck/guard use the same OST-objects
      /mnt/lustre/d17.sanity-lfsck/f1 and /mnt/lustre/d17.sanity-lfsck/guard use the same OST-objects
      ls: cannot access '/mnt/lustre/d17.sanity-lfsck/f1': Input/output error
      /usr/lib64/lustre/tests/sanity-lfsck.sh: line 1906: [: -eq: unary operator expected
       sanity-lfsck test_17: @@@@@@ FAIL: (1.2) f1 (wrong) size should be 1048576, but got  
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:1907:test_17()
        = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
        = /usr/lib64/lustre/tests/sanity-lfsck.sh:1940:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9438] sanity-lfsck test_17: (1.2) f1 (wrong) size should be 1048576, but got

            No instances of this failure have been seen since June 2017.

            bogl Bob Glossman (Inactive) added a comment - No instances of this failure have been seen since June 2017.

            no progress. low priority. seems like a test only problem most likely,

            I can make the 'clear' operation fail, but the test passes anyway. Don't think those 2 effects are directly related.
            Not close to a root cause on either.

            bogl Bob Glossman (Inactive) added a comment - no progress. low priority. seems like a test only problem most likely, I can make the 'clear' operation fail, but the test passes anyway. Don't think those 2 effects are directly related. Not close to a root cause on either.

            Well, if f1 doesn't exist it would return ENOENT instead of EIO, so the problem isn't that the file is missing. John pointed out the earlier error message:

            error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-MDT0000-mdc-ffff88002c485000/lru_size=clear: Invalid argument
            ldlm.namespaces.lustre-MDT0000-mdc-ffff88002c485000.lock_unused_count=2
            

            So it isn't clear why the "clear" failed to cancel the locks? That is something that could easily be attributed to a change in /proc handling for SLES12 and needs to be investigated.

            adilger Andreas Dilger added a comment - Well, if f1 doesn't exist it would return ENOENT instead of EIO , so the problem isn't that the file is missing. John pointed out the earlier error message: error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-MDT0000-mdc-ffff88002c485000/lru_size=clear: Invalid argument ldlm.namespaces.lustre-MDT0000-mdc-ffff88002c485000.lock_unused_count=2 So it isn't clear why the " clear " failed to cancel the locks? That is something that could easily be attributed to a change in /proc handling for SLES12 and needs to be investigated.
            bogl Bob Glossman (Inactive) added a comment - - edited

            looking at the specific code in the test it looks like if the file $DIR/$tdir/f1 doesn't exist it would have precisely the effects captured in the fail logs.

            I'm unable to determine why this file would exist when running on RHEL but not on SLES.

            Might be the fail output would make more sense if there were conditional tests to check for and correctly report on the absence of files like f0 and f1 and not just assume that ls command on such files would get reasonable reports on stdout with a size that can be parsed.

            bogl Bob Glossman (Inactive) added a comment - - edited looking at the specific code in the test it looks like if the file $DIR/$tdir/f1 doesn't exist it would have precisely the effects captured in the fail logs. I'm unable to determine why this file would exist when running on RHEL but not on SLES. Might be the fail output would make more sense if there were conditional tests to check for and correctly report on the absence of files like f0 and f1 and not just assume that ls command on such files would get reasonable reports on stdout with a size that can be parsed.
            pjones Peter Jones added a comment -

            Bob

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bob Could you please look into this one? Thanks Peter

            sanity-lfsck test 17 started failing with this error on April 13, 2017. In all cases of this test failure, the client is SLES11SP* or SLES12SP*. There are no failures of this test with this error for RHEL/CentOS clients.

            The logs for the earliest failures are at:
            https://testing.hpdd.intel.com/test_sets/400cd9f8-2113-11e7-8920-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/c3b83dfe-212e-11e7-9073-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/c1bc4fd4-2130-11e7-9de9-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - sanity-lfsck test 17 started failing with this error on April 13, 2017. In all cases of this test failure, the client is SLES11SP* or SLES12SP*. There are no failures of this test with this error for RHEL/CentOS clients. The logs for the earliest failures are at: https://testing.hpdd.intel.com/test_sets/400cd9f8-2113-11e7-8920-5254006e85c2 https://testing.hpdd.intel.com/test_sets/c3b83dfe-212e-11e7-9073-5254006e85c2 https://testing.hpdd.intel.com/test_sets/c1bc4fd4-2130-11e7-9de9-5254006e85c2

            People

              bogl Bob Glossman (Inactive)
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: