[LU-17777] runtests test_1: FAIL: old and new files are different: rc=22 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.17.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/72dfd1b1-7efa-4dcc-a366-043549333756
https://testing.whamcloud.com/test_sets/1dde1cd4-24fd-4ae3-ab20-d7261788833f

test_1 failed with the following error:

CMD: onyx-106vm11 /usr/sbin/lctl set_param -P lod.*.mdt_hash=crush
comparing 520 previously copied files
Files /etc/yum.repos.d/redhat.repo and /mnt/lustre/d1.runtests//etc/yum.repos.d/redhat.repo differ
Files /etc/pki/entitlement/2519028287967039457.pem and /mnt/lustre/d1.runtests//etc/pki/entitlement/2519028287967039457.pem differ
 runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22

Test session details:
clients: https://build.whamcloud.com/job/lustre-master-next/784 - 5.14.0-362.18.1.el9_3.x86_64
servers: https://build.whamcloud.com/job/lustre-master-next/784 - 5.14.0-362.18.1_lustre.el9.x86_64

This failed for the first time with this error on 2024-04-24 for on two separate test runs, one an unlanded patch, and one a "full" test run on master. Strangely, both failures were reported on the same two files. There don't appear to be any Lustre console errors immediately before this failure (a few back when the filesystem is remounted in the test).

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
runtests test_1 - old and new files are different: rc=22

Attachments

Issue Links

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-17777] runtests test_1: FAIL: old and new files are different: rc=22

Andreas Dilger added a comment - 11/May/24 10:11 PM

+1 on b2_15:
https://testing.whamcloud.com/test_sets/6ffef452-6051-4199-90f0-cec559c8aaf6

colmstea, is there some RHEL license manager that is running which is randomly updating these files in /etc while the test is running? We can make the test more robust by having it re-check and exclude files in /etc that have been modified since the test was started, but that is not retroactive for all of the branches and interop tests, so it would be nice if whatever is triggering this was turned off, or done during provisioning instead while the test was running...

Andreas Dilger added a comment - 11/May/24 10:11 PM +1 on b2_15: https://testing.whamcloud.com/test_sets/6ffef452-6051-4199-90f0-cec559c8aaf6 colmstea , is there some RHEL license manager that is running which is randomly updating these files in /etc while the test is running? We can make the test more robust by having it re-check and exclude files in /etc that have been modified since the test was started, but that is not retroactive for all of the branches and interop tests, so it would be nice if whatever is triggering this was turned off, or done during provisioning instead while the test was running...

Andreas Dilger added a comment - 27/Apr/24 9:42 AM - edited

Unfortunately this was hit again on another test run, with the same files causing issues:
https://testing.whamcloud.com/test_sets/abfde0a8-37fa-4801-8f76-3d75d434d243

Files /etc/pki/consumer/cert.pem and /mnt/lustre/d1.runtests//etc/pki/consumer/cert.pem differ
diff: /etc/pki/entitlement/3741564328631791613.pem: No such file or directory
diff: /etc/pki/entitlement/3741564328631791613-key.pem: No such file or directory
 runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22

but it hit during interop testing with a 2.12.9 client, so that eliminates any changes on master clients. It appears the clients were running RHEL, so it seems like something that RHEL is doing itself.

We can partly work around this by excluding those files from the copy list, but that will only fix new clients and not old ones having issues with interop testing. It would likely be better to eliminate whatever is causing this process to run inside the VM to stop.

Andreas Dilger added a comment - 27/Apr/24 9:42 AM - edited Unfortunately this was hit again on another test run, with the same files causing issues: https://testing.whamcloud.com/test_sets/abfde0a8-37fa-4801-8f76-3d75d434d243 Files /etc/pki/consumer/cert.pem and /mnt/lustre/d1.runtests//etc/pki/consumer/cert.pem differ diff: /etc/pki/entitlement/3741564328631791613.pem: No such file or directory diff: /etc/pki/entitlement/3741564328631791613-key.pem: No such file or directory runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22 but it hit during interop testing with a 2.12.9 client, so that eliminates any changes on master clients. It appears the clients were running RHEL, so it seems like something that RHEL is doing itself. We can partly work around this by excluding those files from the copy list, but that will only fix new clients and not old ones having issues with interop testing. It would likely be better to eliminate whatever is causing this process to run inside the VM to stop.

Peter Jones added a comment - 26/Apr/24 7:30 PM

If I understand correctly, this is no longer happening

Peter Jones added a comment - 26/Apr/24 7:30 PM If I understand correctly, this is no longer happening

Andreas Dilger added a comment - 26/Apr/24 5:56 AM

It might not have been an "rpm update" but rather something to do with enabling a "genuine RHEL" license on this newly-installed system so that it can run. If this starts failing with any frequency, we can exclude those files, and/or re-check after failure if the affected files have been modified since the start of the test, but I don't want to make the code more complex if this only ever happene the one time when we were messing with RHEL licenses...

Andreas Dilger added a comment - 26/Apr/24 5:56 AM It might not have been an "rpm update" but rather something to do with enabling a "genuine RHEL" license on this newly-installed system so that it can run. If this starts failing with any frequency, we can exclude those files, and/or re-check after failure if the affected files have been modified since the start of the test, but I don't want to make the code more complex if this only ever happene the one time when we were messing with RHEL licenses...

Alex Zhuravlev added a comment - 26/Apr/24 4:45 AM

It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky.

yes, I understand that, but even if this was not a reason for the failure, it still makes sense to disable that - no need to waste any CPU/memory/network to update just instantiated image?

Alex Zhuravlev added a comment - 26/Apr/24 4:45 AM It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky. yes, I understand that, but even if this was not a reason for the failure, it still makes sense to disable that - no need to waste any CPU/memory/network to update just instantiated image?

Andreas Dilger added a comment - 25/Apr/24 8:22 PM

should we ask DCO to disable this as it doesn't make any sense for our purpose

It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the "/etc/pki/entitlement/" directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky.

I think any work to diagnose this issue should stop until it starts being an issue again. The more I look at it, the more I think that this is unrelated to Lustre, but that wasn't clear originally when it hit 2/20 test runs right after a branch landing.

Andreas Dilger added a comment - 25/Apr/24 8:22 PM should we ask DCO to disable this as it doesn't make any sense for our purpose It is pure speculation on my part why those files were updated. From their filenames, it looks like they may be related to RHEL licensing (I don't have the " /etc/pki/entitlement/ " directory on my Rocky/Alma VMs), and we've been having RHEL license issues recently, so yet one more reason to move over to Rocky. I think any work to diagnose this issue should stop until it starts being an issue again. The more I look at it, the more I think that this is unrelated to Lustre, but that wasn't clear originally when it hit 2/20 test runs right after a branch landing.

Alex Zhuravlev added a comment - 25/Apr/24 7:39 PM

It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM?

should we ask DCO to disable this as it doesn't make any sense for our purpose just a resource wastage?

Alex Zhuravlev added a comment - 25/Apr/24 7:39 PM It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM? should we ask DCO to disable this as it doesn't make any sense for our purpose just a resource wastage?

Andreas Dilger added a comment - 25/Apr/24 5:31 PM

The reason runtests is using existing files from /etc is because they have a variety of sizes, directory layouts, symlinks, etc.

A couple options if we think the files are legitimately changing:

show the diff of the file content so we can compare why it changed
generate checksums of the files during copy and compare at the end
check the source file timestamps if there was a comparison failure
generate a synthetic source dataset each time

It looks like one of the failures was on master-next, and because of unrelated test issues no tests have completed since yesterday.

It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM?

Let's wait for a day or two if this repeats, and how often.

Andreas Dilger added a comment - 25/Apr/24 5:31 PM The reason runtests is using existing files from /etc is because they have a variety of sizes, directory layouts, symlinks, etc. A couple options if we think the files are legitimately changing: show the diff of the file content so we can compare why it changed generate checksums of the files during copy and compare at the end check the source file timestamps if there was a comparison failure generate a synthetic source dataset each time It looks like one of the failures was on master-next, and because of unrelated test issues no tests have completed since yesterday. It could be that the RPMs in the VMs were being updated while this test was running, but I'd think that would be done on a source image a not the live VM? Let's wait for a day or two if this repeats, and how often.

Patrick Farrell (Inactive) added a comment - 25/Apr/24 4:31 PM

" # let's do a little more, with files that haven't changed in the
# last day (hopefully they don't change during test)
local findfiles=$TMP/runtests.files"

So it was the same file both times... I can't seem to reproduce this locally. I wonder if the files are changing during the test? (It feels like there should be a better way to do this than copy a bunch of files that haven't changed lately and hope they don't change soon...)

By the way, I'd like to look at more, but that search only finds 1 failure... I can't see anything wrong with the search, so I'm not sure why.

Patrick Farrell (Inactive) added a comment - 25/Apr/24 4:31 PM " # let's do a little more, with files that haven't changed in the # last day (hopefully they don't change during test) local findfiles=$TMP/runtests.files" So it was the same file both times... I can't seem to reproduce this locally. I wonder if the files are changing during the test? (It feels like there should be a better way to do this than copy a bunch of files that haven't changed lately and hope they don't change soon...) By the way, I'd like to look at more, but that search only finds 1 failure... I can't see anything wrong with the search, so I'm not sure why.

Patrick Farrell (Inactive) added a comment - 25/Apr/24 3:55 PM

Yeah, I'm also trying to reproduce locally - we'll see if I can. If I can, I'll just try to bisect.

Otherwise the set is small enough we could conceivably bisect even with the slow Maloo turnaround times.

Patrick Farrell (Inactive) added a comment - 25/Apr/24 3:55 PM Yeah, I'm also trying to reproduce locally - we'll see if I can. If I can, I'll just try to bisect. Otherwise the set is small enough we could conceivably bisect even with the slow Maloo turnaround times.

Andreas Dilger added a comment - 25/Apr/24 7:27 AM - edited

paf, I just wanted to bring this issue to your attention, since this relatively simple test is now showing signs of data corruption when doing a simple copy/comparison of 500 files.

It just started failing since the most recent patch landing on the 23rd, so very likely related to one of the patches in that batch, which include several CLIO patches from you (I've taken out a few patches that only changed unrelated test scripts, and even many of the remainder could be eliminated with 99% certainty):

# git log --oneline --after 2024-04-23
69eb7b89c7 LU-17630 osc: add cond_resched() to osc_lru_shrink()
985f49cf2e LU-17422 clio: use page pools for UDIO/hybrid
e804112e34 LU-17422 osc: Clear PageChecked on bounce pages
a5db30ad97 LU-17422 obdclass: rename sptlrpc pool and move init
77dde8d4c5 LU-17422 obdclass: rename ptlrpc_page_pool
f61a8c9f5f LU-17422 obdclass: move page pools to obdclass
81611d7a7c LU-17550 lov: check for empty layout on DIO
43afdc0ca2 LU-17533 llite: call merge attr on all writes
49730821c4 LU-16736 quota: set revoke time to avoid endless wait
e8d3180540 LU-17269 obdclass: fix locking for class_register/deregister
bcb954c140 LU-13374 mdd: fix close time update race with set-in-past
7e01787863 LU-12452 lnet: allow to set IP ToS value per-NI
2b210f3905 LU-17440 lnet: prevent errorneous decref for asym route
28c366cee6 LU-17218 ofd: improve filter_fid upgrade compatibility
0661102dea LU-13802 ptlrpc: correctly remove inflight request
c633a42f11 LU-13802 llite: add fail loc to force bio-dio switch
4529e5939b LU-13814 osc: add osc transient page ops
1430119769 LU-13814 clio: Remove owner for transient pages
807b5650dc LU-13814 clio: add io to cio_submit

It's a bit too early to have good statistics on the failure rate (so far 2/20 failures, but most patches will not have been rebased onto this tree yet). The related test runs will appear in the following Maloo search:
https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&test_set_script_id=f946ba8e-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=9ae144d2-6181-11e2-be04-52540035b04c&start_date=2024-04-24&end_date=2024-04-30&source=sub_tests#redirect

Andreas Dilger added a comment - 25/Apr/24 7:27 AM - edited paf , I just wanted to bring this issue to your attention, since this relatively simple test is now showing signs of data corruption when doing a simple copy/comparison of 500 files. It just started failing since the most recent patch landing on the 23rd, so very likely related to one of the patches in that batch, which include several CLIO patches from you (I've taken out a few patches that only changed unrelated test scripts, and even many of the remainder could be eliminated with 99% certainty): # git log --oneline --after 2024-04-23 69eb7b89c7 LU-17630 osc: add cond_resched() to osc_lru_shrink() 985f49cf2e LU-17422 clio: use page pools for UDIO/hybrid e804112e34 LU-17422 osc: Clear PageChecked on bounce pages a5db30ad97 LU-17422 obdclass: rename sptlrpc pool and move init 77dde8d4c5 LU-17422 obdclass: rename ptlrpc_page_pool f61a8c9f5f LU-17422 obdclass: move page pools to obdclass 81611d7a7c LU-17550 lov: check for empty layout on DIO 43afdc0ca2 LU-17533 llite: call merge attr on all writes 49730821c4 LU-16736 quota: set revoke time to avoid endless wait e8d3180540 LU-17269 obdclass: fix locking for class_register/deregister bcb954c140 LU-13374 mdd: fix close time update race with set-in-past 7e01787863 LU-12452 lnet: allow to set IP ToS value per-NI 2b210f3905 LU-17440 lnet: prevent errorneous decref for asym route 28c366cee6 LU-17218 ofd: improve filter_fid upgrade compatibility 0661102dea LU-13802 ptlrpc: correctly remove inflight request c633a42f11 LU-13802 llite: add fail loc to force bio-dio switch 4529e5939b LU-13814 osc: add osc transient page ops 1430119769 LU-13814 clio: Remove owner for transient pages 807b5650dc LU-13814 clio: add io to cio_submit It's a bit too early to have good statistics on the failure rate (so far 2/20 failures, but most patches will not have been rebased onto this tree yet). The related test runs will appear in the following Maloo search: https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&test_set_script_id=f946ba8e-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=9ae144d2-6181-11e2-be04-52540035b04c&start_date=2024-04-24&end_date=2024-04-30&source=sub_tests#redirect

People

Assignee:: Arshad Hussain

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 25/Apr/24 4:20 AM

Updated:: 2 days ago 12:22 PM

Resolved:: 2 days ago 12:22 PM