Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.14.0, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2.

      The sub-test test_32a failed with the following error:

      wrong file size
      

      Please provide additional information about the failure here.

      Info required for matching: sanityn 32a

      Attachments

        Issue Links

          Activity

            [LU-10059] sanityn test_32a: wrong file size

            Test removed in 

            paf0186 Patrick Farrell added a comment - Test removed in  LU-14838
            xiaolinzang Xiaolin Zang added a comment -

            We see the failure occasionally.  Behind the error message

            Input/output error
             sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size    

            is the failure of lstat("/mnt/lustre2/f32a.sanityn").  But the file's size, stat and layout look normal, observed after the test failure.

            From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn)

            00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107

            Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue):

            00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp

            Also from the dmesg on mds0:

            [ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107
            [ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
            [ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp)
            [ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size

            It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times.

            Uploaded the following files.  The debug logs are denoted "xxxxx".

            sanityn.test_32a.debug_log.mds0.32a_only    sanityn.test_32a.dmesg.mds0
            sanityn.test_32a.debug_log.oss0.32a_only    sanityn.test_32a.test_log.mds0

             

            xiaolinzang Xiaolin Zang added a comment - We see the failure occasionally.  Behind the error message Input/output error  sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size     is the failure of lstat("/mnt/lustre2/f32a.sanityn").  But the file's size, stat and layout look normal, observed after the test failure. From the debug logs on mds0 (test driver), ldlm fails to connect to oss0: (-107 is no-conn) 00000100:02020000:11.0:1642177889.168635:0:9731:0:(client.c:1371:ptlrpc_check_status()) 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 Slightly earlier by timestamp, the oss0 debug log has the following (op 101 is ldlm_enqueue): 00000020:00080000:10.0:1642177889.168118:0:14028:0:(tgt_handler.c:770:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.6.4.19@tcp Also from the dmesg on mds0: [ 1346.778759] LustreError: 11-0: lustre-OST0000-osc-ffff895ad1180800: operation ldlm_enqueue to node 10.6.4.23@tcp failed: rc = -107 [ 1346.785968] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection to lustre-OST0000 (at 10.6.4.23@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1346.787088] LustreError: 167-0: lustre-OST0000-osc-ffff895ad1180800: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [ 1346.796872] Lustre: lustre-OST0000-osc-ffff895ad1180800: Connection restored to 10.6.4.23@tcp (at 10.6.4.23@tcp) [ 1346.951394] Lustre: DEBUG MARKER: sanityn test_32a: @@@@@@ FAIL: cached truncate - wrong file size It seems the mds0 and oss0 have a temporary connection error. It's unlikely due to a random network issue because other tests are OK when test_32a's fails, as we have observed many times. Uploaded the following files.  The debug logs are denoted "xxxxx". sanityn.test_32a.debug_log.mds0.32a_only    sanityn.test_32a.dmesg.mds0 sanityn.test_32a.debug_log.oss0.32a_only    sanityn.test_32a.test_log.mds0  

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/
            Subject: LU-10059 tests: sanityn 32a error messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3274e573957e8b8a067ae28c3f7d7788d40f310e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40496/ Subject: LU-10059 tests: sanityn 32a error messages Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3274e573957e8b8a067ae28c3f7d7788d40f310e

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496
            Subject: LU-10059 tests: sanityn 32a error messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40496 Subject: LU-10059 tests: sanityn 32a error messages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1ec9dafe19fe31b5a19151a33cdb388f359fa7c1

            It looks like we are still seeing this issue. We see two different errors/situations in the suite log.

            One of the errors is ‘can’t lstat’with no complaint from truncate

            == sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044)
            CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
            CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
            			 [[ -z \"\" ]] && param= ||
            			 param=\$(grep  <<< \"\$params\");
            			 [[ -z \$param ]] && param=\"\$params\";
            			 while read s; do echo client \$s;
            			 done <<< \"\$param\"
            checking cached lockless truncate
            Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error
             sanityn test_32a: @@@@@@ FAIL: wrong file size 
            

            We see this error for
            2.12.5.67 - https://testing.whamcloud.com/test_sets/9881eb9e-1130-4da5-9312-a4451d67c59c
            2.13.55.104 - https://testing.whamcloud.com/test_sets/7ad28649-b4a5-458a-8b3f-a08820a4b85c

            The other error we are seeing is a truncate error and the report on different size

            == sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408)
            CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
            CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate);
            			 [[ -z \"\" ]] && param= ||
            			 param=\$(grep  <<< \"\$params\");
            			 [[ -z \$param ]] && param=\"\$params\";
            			 while read s; do echo client \$s;
            			 done <<< \"\$param\"
            checking cached lockless truncate
            truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error
            /mnt/lustre2/f32a.sanityn has size 7340032, not 8000000
             sanityn test_32a: @@@@@@ FAIL: wrong file size 
            

            We see this error for
            2.12.5.50 - https://testing.whamcloud.com/test_sets/7c82a5a3-67f9-4d9e-996b-e6584cbad2d3

            jamesanunez James Nunez (Inactive) added a comment - It looks like we are still seeing this issue. We see two different errors/situations in the suite log. One of the errors is ‘can’t lstat’with no complaint from truncate == sanityn test 32a: lockless truncate =============================================================== 17:20:44 (1603992044) CMD: trevis-9vm6 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-9vm3.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate Can't lstat /mnt/lustre2/f32a.sanityn: Input/output error sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for 2.12.5.67 - https://testing.whamcloud.com/test_sets/9881eb9e-1130-4da5-9312-a4451d67c59c 2.13.55.104 - https://testing.whamcloud.com/test_sets/7ad28649-b4a5-458a-8b3f-a08820a4b85c The other error we are seeing is a truncate error and the report on different size == sanityn test 32a: lockless truncate =============================================================== 18:43:28 (1601491408) CMD: trevis-65vm4 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize CMD: trevis-65vm1.trevis.whamcloud.com params=\$(/usr/sbin/lctl get_param osc.*.lockless_truncate); [[ -z \"\" ]] && param= || param=\$(grep <<< \"\$params\"); [[ -z \$param ]] && param=\"\$params\"; while read s; do echo client \$s; done <<< \"\$param\" checking cached lockless truncate truncate: cannot truncate '/mnt/lustre/f32a.sanityn' to length 8000000: Input/output error /mnt/lustre2/f32a.sanityn has size 7340032, not 8000000 sanityn test_32a: @@@@@@ FAIL: wrong file size We see this error for 2.12.5.50 - https://testing.whamcloud.com/test_sets/7c82a5a3-67f9-4d9e-996b-e6584cbad2d3
            emoly.liu Emoly Liu added a comment - +1 on master: https://testing.whamcloud.com/test_sets/a345510a-c777-4bc1-8c30-2413be63a24a

            This hit 6x in the past week.

            adilger Andreas Dilger added a comment - This hit 6x in the past week.
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/e9e439c2-eedd-11e9-add9-52540065bddc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/
            Subject: LU-10059 tests: sanityn 32a restore parameters
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 62b57e34d9a0df1ce4b82650d7e328db5d048b39

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34070/ Subject: LU-10059 tests: sanityn 32a restore parameters Project: fs/lustre-release Branch: master Current Patch Set: Commit: 62b57e34d9a0df1ce4b82650d7e328db5d048b39

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081
            Subject: LU-10059 tests: Disable lockless truncate test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d7f4f322514b522d2a23ce3a698e56a768e4bfbb

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34081 Subject: LU-10059 tests: Disable lockless truncate test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d7f4f322514b522d2a23ce3a698e56a768e4bfbb

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: