Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17777

runtests test_1: FAIL: old and new files are different: rc=22

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/72dfd1b1-7efa-4dcc-a366-043549333756
      https://testing.whamcloud.com/test_sets/1dde1cd4-24fd-4ae3-ab20-d7261788833f

      test_1 failed with the following error:

      CMD: onyx-106vm11 /usr/sbin/lctl set_param -P lod.*.mdt_hash=crush
      comparing 520 previously copied files
      Files /etc/yum.repos.d/redhat.repo and /mnt/lustre/d1.runtests//etc/yum.repos.d/redhat.repo differ
      Files /etc/pki/entitlement/2519028287967039457.pem and /mnt/lustre/d1.runtests//etc/pki/entitlement/2519028287967039457.pem differ
       runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-next/784 - 5.14.0-362.18.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-next/784 - 5.14.0-362.18.1_lustre.el9.x86_64

      This failed for the first time with this error on 2024-04-24 for on two separate test runs, one an unlanded patch, and one a "full" test run on master. Strangely, both failures were reported on the same two files. There don't appear to be any Lustre console errors immediately before this failure (a few back when the filesystem is remounted in the test).

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      runtests test_1 - old and new files are different: rc=22

      Attachments

        Issue Links

          Activity

            [LU-17777] runtests test_1: FAIL: old and new files are different: rc=22

            +1 on b2_15:
            https://testing.whamcloud.com/test_sets/6ffef452-6051-4199-90f0-cec559c8aaf6

            colmstea, is there some RHEL license manager that is running which is randomly updating these files in /etc while the test is running? We can make the test more robust by having it re-check and exclude files in /etc that have been modified since the test was started, but that is not retroactive for all of the branches and interop tests, so it would be nice if whatever is triggering this was turned off, or done during provisioning instead while the test was running...

            adilger Andreas Dilger added a comment - +1 on b2_15: https://testing.whamcloud.com/test_sets/6ffef452-6051-4199-90f0-cec559c8aaf6 colmstea , is there some RHEL license manager that is running which is randomly updating these files in /etc while the test is running? We can make the test more robust by having it re-check and exclude files in /etc that have been modified since the test was started, but that is not retroactive for all of the branches and interop tests, so it would be nice if whatever is triggering this was turned off, or done during provisioning instead while the test was running...
            adilger Andreas Dilger added a comment - - edited

            Unfortunately this was hit again on another test run, with the same files causing issues:
            https://testing.whamcloud.com/test_sets/abfde0a8-37fa-4801-8f76-3d75d434d243

            Files /etc/pki/consumer/cert.pem and /mnt/lustre/d1.runtests//etc/pki/consumer/cert.pem differ
            diff: /etc/pki/entitlement/3741564328631791613.pem: No such file or directory
            diff: /etc/pki/entitlement/3741564328631791613-key.pem: No such file or directory
             runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22 
            

            but it hit during interop testing with a 2.12.9 client, so that eliminates any changes on master clients. It appears the clients were running RHEL, so it seems like something that RHEL is doing itself.

            We can partly work around this by excluding those files from the copy list, but that will only fix new clients and not old ones having issues with interop testing. It would likely be better to eliminate whatever is causing this process to run inside the VM to stop.

            adilger Andreas Dilger added a comment - - edited Unfortunately this was hit again on another test run, with the same files causing issues: https://testing.whamcloud.com/test_sets/abfde0a8-37fa-4801-8f76-3d75d434d243 Files /etc/pki/consumer/cert.pem and /mnt/lustre/d1.runtests//etc/pki/consumer/cert.pem differ diff: /etc/pki/entitlement/3741564328631791613.pem: No such file or directory diff: /etc/pki/entitlement/3741564328631791613-key.pem: No such file or directory runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22 but it hit during interop testing with a 2.12.9 client, so that eliminates any changes on master clients. It appears the clients were running RHEL, so it seems like something that RHEL is doing itself. We can partly work around this by excluding those files from the copy list, but that will only fix new clients and not old ones having issues with interop testing. It would likely be better to eliminate whatever is causing this process to run inside the VM to stop.
            pjones Peter Jones added a comment -

            If I understand correctly, this is no longer happening

            pjones Peter Jones added a comment - If I understand correctly, this is no longer happening

            It might not have been an "rpm update" but rather something to do with enabling a "genuine RHEL" license on this newly-installed system so that it can run. If this starts failing with any frequency, we can exclude those files, and/or re-check after failure if the affected files have been modified since the start of the test, but I don't want to make the code more complex if this only ever happene the one time when we were messing with RHEL licenses...

            adilger Andreas Dilger added a comment - It might not have been an "rpm update" but rather something to do with enabling a "genuine RHEL" license on this newly-installed system so that it can run. If this starts failing with any frequency, we can exclude those files, and/or re-check after failure if the affected files have been modified since the start of the test, but I don't want to make the code more complex if this only ever happene the one time when we were messing with RHEL licenses...

            It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky.

            yes, I understand that, but even if this was not a reason for the failure, it still makes sense to disable that - no need to waste any CPU/memory/network to update just instantiated image?

            bzzz Alex Zhuravlev added a comment - It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky. yes, I understand that, but even if this was not a reason for the failure, it still makes sense to disable that - no need to waste any CPU/memory/network to update just instantiated image?

            should we ask DCO to disable this as it doesn't make any sense for our purpose

            It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky.

            I think any work to diagnose this issue should stop until it starts being an issue again. The more I look at it, the more I think that this is unrelated to Lustre, but that wasn't clear originally when it hit 2/20 test runs right after a branch landing.

            adilger Andreas Dilger added a comment - should we ask DCO to disable this as it doesn't make any sense for our purpose It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the " /etc/pki/entitlement/ " directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky. I think any work to diagnose this issue should stop until it starts being an issue again. The more I look at it, the more I think that this is unrelated to Lustre, but that wasn't clear originally when it hit 2/20 test runs right after a branch landing.

            It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM?

            should we ask DCO to disable this as it doesn't make any sense for our purpose just a resource wastage?

            bzzz Alex Zhuravlev added a comment - It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM? should we ask DCO to disable this as it doesn't make any sense for our purpose just a resource wastage?

            The reason runtests is using existing files from /etc is because they have a variety of sizes, directory layouts, symlinks, etc.

            A couple options if we think the files are legitimately changing:

            • show the diff of the file content so we can compare why it changed
            • generate checksums of the files during copy and compare at the end
            • check the source file timestamps if there was a comparison failure
            • generate a synthetic source dataset each time

            It looks like one of the failures was on master-next, and because of unrelated test issues no tests have completed since yesterday.

            It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM?

            Let's wait for a day or two if this repeats, and how often.

            adilger Andreas Dilger added a comment - The reason runtests is using existing files from /etc is because they have a variety of sizes, directory layouts, symlinks, etc. A couple options if we think the files are legitimately changing: show the diff of the file content so we can compare why it changed generate checksums of the files during copy and compare at the end check the source file timestamps if there was a comparison failure generate a synthetic source dataset each time It looks like one of the failures was on master-next, and because of unrelated test issues no tests have completed since yesterday. It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM? Let's wait for a day or two if this repeats, and how often.

            "        # let's do a little more, with files that haven't changed in the
                    # last day (hopefully they don't change during test)
                    local findfiles=$TMP/runtests.files"

            So it was the same file both times...  I can't seem to reproduce this locally.  I wonder if the files are changing during the test?  (It feels like there should be a better way to do this than copy a bunch of files that haven't changed lately and hope they don't change soon...)

            By the way, I'd like to look at more, but that search only finds 1 failure...  I can't see anything wrong with the search, so I'm not sure why.

            paf Patrick Farrell (Inactive) added a comment - "        # let's do a little more, with files that haven't changed in the         # last day (hopefully they don't change during test)         local findfiles=$TMP/runtests.files" So it was the same file both times...  I can't seem to reproduce this locally.  I wonder if the files are changing during the test?  (It feels like there should be a better way to do this than copy a bunch of files that haven't changed lately and hope they don't change soon...) By the way, I'd like to look at more, but that search only finds 1 failure...  I can't see anything wrong with the search, so I'm not sure why.

            Yeah, I'm also trying to reproduce locally - we'll see if I can.  If I can, I'll just try to bisect.

            Otherwise the set is small enough we could conceivably bisect even with the slow Maloo turnaround times.

            paf Patrick Farrell (Inactive) added a comment - Yeah, I'm also trying to reproduce locally - we'll see if I can.  If I can, I'll just try to bisect. Otherwise the set is small enough we could conceivably bisect even with the slow Maloo turnaround times.
            adilger Andreas Dilger added a comment - - edited

            paf, I just wanted to bring this issue to your attention, since this relatively simple test is now showing signs of data corruption when doing a simple copy/comparison of 500 files.

            It just started failing since the most recent patch landing on the 23rd, so very likely related to one of the patches in that batch, which include several CLIO patches from you (I've taken out a few patches that only changed unrelated test scripts, and even many of the remainder could be eliminated with 99% certainty):

            # git log --oneline --after 2024-04-23
            69eb7b89c7 LU-17630 osc: add cond_resched() to osc_lru_shrink()
            985f49cf2e LU-17422 clio: use page pools for UDIO/hybrid
            e804112e34 LU-17422 osc: Clear PageChecked on bounce pages
            a5db30ad97 LU-17422 obdclass: rename sptlrpc pool and move init
            77dde8d4c5 LU-17422 obdclass: rename ptlrpc_page_pool
            f61a8c9f5f LU-17422 obdclass: move page pools to obdclass
            81611d7a7c LU-17550 lov: check for empty layout on DIO
            43afdc0ca2 LU-17533 llite: call merge attr on all writes
            49730821c4 LU-16736 quota: set revoke time to avoid endless wait
            e8d3180540 LU-17269 obdclass: fix locking for class_register/deregister
            bcb954c140 LU-13374 mdd: fix close time update race with set-in-past
            7e01787863 LU-12452 lnet: allow to set IP ToS value per-NI
            2b210f3905 LU-17440 lnet: prevent errorneous decref for asym route
            28c366cee6 LU-17218 ofd: improve filter_fid upgrade compatibility
            0661102dea LU-13802 ptlrpc: correctly remove inflight request
            c633a42f11 LU-13802 llite: add fail loc to force bio-dio switch
            4529e5939b LU-13814 osc: add osc transient page ops
            1430119769 LU-13814 clio: Remove owner for transient pages
            807b5650dc LU-13814 clio: add io to cio_submit
            

            It's a bit too early to have good statistics on the failure rate (so far 2/20 failures, but most patches will not have been rebased onto this tree yet). The related test runs will appear in the following Maloo search:
            https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&test_set_script_id=f946ba8e-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=9ae144d2-6181-11e2-be04-52540035b04c&start_date=2024-04-24&end_date=2024-04-30&source=sub_tests#redirect

            adilger Andreas Dilger added a comment - - edited paf , I just wanted to bring this issue to your attention, since this relatively simple test is now showing signs of data corruption when doing a simple copy/comparison of 500 files. It just started failing since the most recent patch landing on the 23rd, so very likely related to one of the patches in that batch, which include several CLIO patches from you (I've taken out a few patches that only changed unrelated test scripts, and even many of the remainder could be eliminated with 99% certainty): # git log --oneline --after 2024-04-23 69eb7b89c7 LU-17630 osc: add cond_resched() to osc_lru_shrink() 985f49cf2e LU-17422 clio: use page pools for UDIO/hybrid e804112e34 LU-17422 osc: Clear PageChecked on bounce pages a5db30ad97 LU-17422 obdclass: rename sptlrpc pool and move init 77dde8d4c5 LU-17422 obdclass: rename ptlrpc_page_pool f61a8c9f5f LU-17422 obdclass: move page pools to obdclass 81611d7a7c LU-17550 lov: check for empty layout on DIO 43afdc0ca2 LU-17533 llite: call merge attr on all writes 49730821c4 LU-16736 quota: set revoke time to avoid endless wait e8d3180540 LU-17269 obdclass: fix locking for class_register/deregister bcb954c140 LU-13374 mdd: fix close time update race with set-in-past 7e01787863 LU-12452 lnet: allow to set IP ToS value per-NI 2b210f3905 LU-17440 lnet: prevent errorneous decref for asym route 28c366cee6 LU-17218 ofd: improve filter_fid upgrade compatibility 0661102dea LU-13802 ptlrpc: correctly remove inflight request c633a42f11 LU-13802 llite: add fail loc to force bio-dio switch 4529e5939b LU-13814 osc: add osc transient page ops 1430119769 LU-13814 clio: Remove owner for transient pages 807b5650dc LU-13814 clio: add io to cio_submit It's a bit too early to have good statistics on the failure rate (so far 2/20 failures, but most patches will not have been rebased onto this tree yet). The related test runs will appear in the following Maloo search: https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&test_set_script_id=f946ba8e-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=9ae144d2-6181-11e2-be04-52540035b04c&start_date=2024-04-24&end_date=2024-04-30&source=sub_tests#redirect

            People

              arshad512 Arshad Hussain
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: