Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.14.0, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2.

      The sub-test test_32a failed with the following error:

      wrong file size
      

      Please provide additional information about the failure here.

      Info required for matching: sanityn 32a

      Attachments

        Issue Links

          Activity

            [LU-10059] sanityn test_32a: wrong file size
            adilger Andreas Dilger made changes -
            Labels Original: DNE always_except New: DNE
            paf0186 Patrick Farrell made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Test removed in 

            paf0186 Patrick Farrell added a comment - Test removed in  LU-14838
            xiaolinzang Xiaolin Zang added a comment -

            We see the failure occasionally.  Behind the error message

            Input/output error
             sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size    

            is the failure of lstat("/mnt/lustre2/f32a.sanityn").  But the file's size, stat and layout look normal, observed after the test failure.

            From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn)

            00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107

            Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue):

            00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp

            Also from the dmesg on mds0:

            [ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107
            [ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
            [ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp)
            [ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size

            It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times.

            Uploaded the following files.  The debug logs are denoted "xxxxx".

            sanityn.test_32a.debug_log.mds0.32a_only    sanityn.test_32a.dmesg.mds0
            sanityn.test_32a.debug_log.oss0.32a_only    sanityn.test_32a.test_log.mds0

             

            xiaolinzang Xiaolin Zang added a comment - We see the failure occasionally.  Behind the error message Input/output error  sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size     is the failure of lstat("/mnt/lustre2/f32a.sanityn").  But the file's size, stat and layout look normal, observed after the test failure. From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn) 00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue): 00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp Also from the dmesg on mds0: [ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 [ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp) [ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times. Uploaded the following files.  The debug logs are denoted "xxxxx". sanityn.test_32a.debug_log.mds0.32a_only    sanityn.test_32a.dmesg.mds0 sanityn.test_32a.debug_log.oss0.32a_only    sanityn.test_32a.test_log.mds0  
            xiaolinzang Xiaolin Zang made changes -
            Attachment New: sanityn.test_32a.debug_log.mds0.32a_only [ 41926 ]
            Attachment New: sanityn.test_32a.debug_log.oss0.32a_only [ 41927 ]
            Attachment New: sanityn.test_32a.dmesg.mds0 [ 41928 ]
            Attachment New: sanityn.test_32a.test_log.mds0 [ 41929 ]
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 25704 ]

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/
            Subject: LU-10059 tests: sanityn 32a error messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3274e573957e8b8a067ae28c3f7d7788d40f310e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/ Subject: LU-10059 tests: sanityn 32a error messages Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3274e573957e8b8a067ae28c3f7d7788d40f310e
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 24791 ]

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496
            Subject: LU-10059 tests: sanityn 32a error messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496 Subject: LU-10059 tests: sanityn 32a error messages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1
            jamesanunez James Nunez (Inactive) made changes -
            Affects Version/s New: Lustre 2.14.0 [ 14490 ]
            Affects Version/s New: Lustre 2.12.6 [ 14707 ]

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: