Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18368

interop: recovery-small test_17b: FAIL: read failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e51696de-5c00-48d7-a726-ba3b5a3cb0d6

      test_17b failed with the following error:

      CMD: onyx-147vm3 /usr/sbin/lctl set_param fail_loc=0xa0000520 fail_val=1
      fail_loc=0xa0000520
      fail_val=1
      1+0 records in
      1+0 records out
      2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.250987 s, 8.4 MB/s
      CMD: onyx-147vm2 dd if=/mnt/lustre/f17b.recovery-small of=/dev/null bs=1M count=1
      onyx-147vm2: dd: failed to open '/mnt/lustre/f17b.recovery-small': No such file or directory
      pdsh@onyx-147vm1: onyx-147vm2: ssh exited with exit code 1
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0239161 s, 43.8 MB/s
      service estimate dropped to 5
       recovery-small test_17b: @@@@@@ FAIL: read failed 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4584 - 4.18.0-553.16.1.el8_10.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_15/94 - 4.18.0-553.5.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_17b - read failed

      Attachments

        Issue Links

          Activity

            [LU-18368] interop: recovery-small test_17b: FAIL: read failed

            whole master vs b2_15 were started to be tested after ATM-3308, so we just have no data prior RC2, we can't say it start failing since RC2 - it may start behave so long ago. We need to start series of runs different masters checkpoints/tags against b2_15 server to see where that started.

            So far it looks like all failures are about second client mount issue

            tappro Mikhail Pershin added a comment - whole master vs b2_15 were started to be tested after ATM-3308, so we just have no data prior RC2, we can't say it start failing since RC2 - it may start behave so long ago. We need to start series of runs different masters checkpoints/tags against b2_15 server to see where that started. So far it looks like all failures are about second client mount issue

            It looks like this only started failing this way on 2024-10-12 (RC2?), and failed again on 2024-10-22 (RC4), but not in between or before. There are other failures of this subtest, but they were not during interop, and have a different error message:
            https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=bb9e1202-4f23-11e6-bf87-5254006e85c2&start_date=2024-02-01&end_date=2024-10-31&source=sub_tests#redirect

            It isn't clear why it didn't fail with RC3, unless this depends on a specific test config that was only run manually by Jian for those specific tags?

            adilger Andreas Dilger added a comment - It looks like this only started failing this way on 2024-10-12 (RC2?), and failed again on 2024-10-22 (RC4), but not in between or before. There are other failures of this subtest, but they were not during interop, and have a different error message: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=bb9e1202-4f23-11e6-bf87-5254006e85c2&start_date=2024-02-01&end_date=2024-10-31&source=sub_tests#redirect It isn't clear why it didn't fail with RC3, unless this depends on a specific test config that was only run manually by Jian for those specific tags?
            pjones Peter Jones added a comment -

            Seems to have occurred in RC4 also, so not related to LU-17906 I guess...

            pjones Peter Jones added a comment - Seems to have occurred in RC4 also, so not related to LU-17906 I guess...
            pjones Peter Jones added a comment -

            Mike

            Another one related to LU-17906?

            Peter

            pjones Peter Jones added a comment - Mike Another one related to LU-17906 ? Peter
            yujian Jian Yu added a comment -

            More recovery-small regression failures:

            recovery-small test_102: @@@@@@ FAIL: Cannot mount client
            recovery-small test_105: @@@@@@ FAIL: mount failed
            recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5
            
            yujian Jian Yu added a comment - More recovery-small regression failures: recovery-small test_102: @@@@@@ FAIL: Cannot mount client recovery-small test_105: @@@@@@ FAIL: mount failed recovery-small test_138: @@@@@@ FAIL: test_138 failed with 5
            yujian Jian Yu added a comment -

            This is a regression failure in Lustre 2.16.0 RC2.

            yujian Jian Yu added a comment - This is a regression failure in Lustre 2.16.0 RC2.

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: