Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18368

interop: recovery-small test_17b: FAIL: read failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e51696de-5c00-48d7-a726-ba3b5a3cb0d6

      test_17b failed with the following error:

      CMD: onyx-147vm3 /usr/sbin/lctl set_param fail_loc=0xa0000520 fail_val=1
      fail_loc=0xa0000520
      fail_val=1
      1+0 records in
      1+0 records out
      2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.250987 s, 8.4 MB/s
      CMD: onyx-147vm2 dd if=/mnt/lustre/f17b.recovery-small of=/dev/null bs=1M count=1
      onyx-147vm2: dd: failed to open '/mnt/lustre/f17b.recovery-small': No such file or directory
      pdsh@onyx-147vm1: onyx-147vm2: ssh exited with exit code 1
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0239161 s, 43.8 MB/s
      service estimate dropped to 5
       recovery-small test_17b: @@@@@@ FAIL: read failed 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4584 - 4.18.0-553.16.1.el8_10.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_17b - read failed

      Attachments

        Issue Links

          Activity

            [LU-18368] interop: recovery-small test_17b: FAIL: read failed
            yujian Jian Yu added a comment -

            recovery-small passed in all of the Lustre 2.16.0 RC5 clients with 2.15.5 servers full-part-1 test sessions.

            Does this also fix the various "lfs migrate" and "lfs mirror" and sanity test_56x interop issues?

            These issues are also fixed in Lustre 2.16.0 RC5.

            yujian Jian Yu added a comment - recovery-small passed in all of the Lustre 2.16.0 RC5 clients with 2.15.5 servers full-part-1 test sessions. Does this also fix the various "lfs migrate" and "lfs mirror" and sanity test_56x interop issues? These issues are also fixed in Lustre 2.16.0 RC5.

            Does this also fix the various "lfs migrate" and "lfs mirror" and sanity test_56x interop issues?

            The strange thing is that this patch should only be affecting interop testing with older servers that do not support unaligned DIO. There are some test script changes that look like they only affect sanity.sh, and the one code change to ll_direct_IO_impl() to remove the compatibility support for UDIO.

            The only thing I can think of, given the limited changes in this patch, is that there are some rare cases where ci_allow_unaligned_dio is not being set, and this was hidden because of the "compat" code so it was always ignored with ldiskfs servers until the patch was landed.

            Even so, I don't understand how the change in ll_direct_IO_impl() could cause the failure reported by this ticket.

            adilger Andreas Dilger added a comment - Does this also fix the various "lfs migrate" and "lfs mirror" and sanity test_56x interop issues? The strange thing is that this patch should only be affecting interop testing with older servers that do not support unaligned DIO. There are some test script changes that look like they only affect sanity.sh, and the one code change to ll_direct_IO_impl() to remove the compatibility support for UDIO. The only thing I can think of, given the limited changes in this patch, is that there are some rare cases where ci_allow_unaligned_dio is not being set, and this was hidden because of the "compat" code so it was always ignored with ldiskfs servers until the patch was landed. Even so, I don't understand how the change in ll_direct_IO_impl() could cause the failure reported by this ticket.
            yujian Jian Yu added a comment -

            After reverting commit ff018bb77a37 (LU-18284 llite: disallow udio exceptions), the recovery-small regression failures disappeared:
            https://testing.whamcloud.com/test_sets/db87591c-6370-4bc6-bfa4-1e7cc4b68c27

            yujian Jian Yu added a comment - After reverting commit ff018bb77a37 ( LU-18284 llite: disallow udio exceptions), the recovery-small regression failures disappeared: https://testing.whamcloud.com/test_sets/db87591c-6370-4bc6-bfa4-1e7cc4b68c27
            yujian Jian Yu added a comment -

            Lustre 2.16.0 RC1 client + 2.15.5 server:
            https://testing.whamcloud.com/test_sets/d326c3f2-38f1-4672-828a-e5f28f67238a
            recovery-small test 154b failed with LU-18322, which was fixed in RC3.

            Lustre 2.16.0 RC2 client + 2.15.5 server:
            https://testing.whamcloud.com/test_sets/e51696de-5c00-48d7-a726-ba3b5a3cb0d6

            recovery-small test_17b: @@@@@@ FAIL: read failed 
            recovery-small test_102: @@@@@@ FAIL: Cannot mount client 
            recovery-small test_105: @@@@@@ FAIL: mount failed 
            recovery-small test_134: @@@@@@ FAIL: mv failed
            recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5 
            recovery-small test 154b hung
            
            $ git log --oneline 2.16.0-RC1..2.16.0-RC2 | grep -v 'tests:'
            69a079d51f93 New RC 2.16.0-RC2
            659bb1d70431 LU-18070 sec: clear ACL caches if ACL empty
            209607fd7957 LU-18096 enc: ll_get_symlink overlay function
            2a5e8e355498 LU-18247 nodemap: initialize unused fields on disk
            6fe522d3d4f9 LU-17906 pltrpc: don't use non-uptodate peer at connect
            ff018bb77a37 LU-18284 llite: disallow udio exceptions
            13fd5ebef3a7 LU-18101 sec: fix ACL handling on recent kernels again
            66d93ce3e4fc LU-17251 test: improve parallel-scale rr_alloc test
            cf2c5fe27e90 LU-4315 doc: remove usage of lgroff-macros
            
            yujian Jian Yu added a comment - Lustre 2.16.0 RC1 client + 2.15.5 server: https://testing.whamcloud.com/test_sets/d326c3f2-38f1-4672-828a-e5f28f67238a recovery-small test 154b failed with LU-18322 , which was fixed in RC3. Lustre 2.16.0 RC2 client + 2.15.5 server: https://testing.whamcloud.com/test_sets/e51696de-5c00-48d7-a726-ba3b5a3cb0d6 recovery-small test_17b: @@@@@@ FAIL: read failed recovery-small test_102: @@@@@@ FAIL: Cannot mount client recovery-small test_105: @@@@@@ FAIL: mount failed recovery-small test_134: @@@@@@ FAIL: mv failed recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5 recovery-small test 154b hung $ git log --oneline 2.16.0-RC1..2.16.0-RC2 | grep -v 'tests:' 69a079d51f93 New RC 2.16.0-RC2 659bb1d70431 LU-18070 sec: clear ACL caches if ACL empty 209607fd7957 LU-18096 enc: ll_get_symlink overlay function 2a5e8e355498 LU-18247 nodemap: initialize unused fields on disk 6fe522d3d4f9 LU-17906 pltrpc: don't use non-uptodate peer at connect ff018bb77a37 LU-18284 llite: disallow udio exceptions 13fd5ebef3a7 LU-18101 sec: fix ACL handling on recent kernels again 66d93ce3e4fc LU-17251 test: improve parallel-scale rr_alloc test cf2c5fe27e90 LU-4315 doc: remove usage of lgroff-macros

            whole master vs b2_15 were started to be tested after ATM-3308, so we just have no data prior RC2, we can't say it start failing since RC2 - it may start behave so long ago. We need to start series of runs different masters checkpoints/tags against b2_15 server to see where that started.

            So far it looks like all failures are about second client mount issue

            tappro Mikhail Pershin added a comment - whole master vs b2_15 were started to be tested after ATM-3308, so we just have no data prior RC2, we can't say it start failing since RC2 - it may start behave so long ago. We need to start series of runs different masters checkpoints/tags against b2_15 server to see where that started. So far it looks like all failures are about second client mount issue

            It looks like this only started failing this way on 2024-10-12 (RC2?), and failed again on 2024-10-22 (RC4), but not in between or before. There are other failures of this subtest, but they were not during interop, and have a different error message:
            https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=bb9e1202-4f23-11e6-bf87-5254006e85c2&start_date=2024-02-01&end_date=2024-10-31&source=sub_tests#redirect

            It isn't clear why it didn't fail with RC3, unless this depends on a specific test config that was only run manually by Jian for those specific tags?

            adilger Andreas Dilger added a comment - It looks like this only started failing this way on 2024-10-12 (RC2?), and failed again on 2024-10-22 (RC4), but not in between or before. There are other failures of this subtest, but they were not during interop, and have a different error message: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=bb9e1202-4f23-11e6-bf87-5254006e85c2&start_date=2024-02-01&end_date=2024-10-31&source=sub_tests#redirect It isn't clear why it didn't fail with RC3, unless this depends on a specific test config that was only run manually by Jian for those specific tags?
            pjones Peter Jones added a comment -

            Seems to have occurred in RC4 also, so not related to LU-17906 I guess...

            pjones Peter Jones added a comment - Seems to have occurred in RC4 also, so not related to LU-17906 I guess...
            pjones Peter Jones added a comment -

            Mike

            Another one related to LU-17906?

            Peter

            pjones Peter Jones added a comment - Mike Another one related to LU-17906 ? Peter
            yujian Jian Yu added a comment -

            More recovery-small regression failures:

            recovery-small test_102: @@@@@@ FAIL: Cannot mount client
            recovery-small test_105: @@@@@@ FAIL: mount failed
            recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5
            
            yujian Jian Yu added a comment - More recovery-small regression failures: recovery-small test_102: @@@@@@ FAIL: Cannot mount client recovery-small test_105: @@@@@@ FAIL: mount failed recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5
            yujian Jian Yu added a comment -

            This is a regression failure in Lustre 2.16.0 RC2.

            yujian Jian Yu added a comment - This is a regression failure in Lustre 2.16.0 RC2.

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: