[LU-11058] sanity test_77k: checksum(1) != 0 Created: 25/May/18 Updated: 02/Aug/19 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/4155343e-5de1-11e8-abc3-52540065bddc test_77k failed with the following error: checksum(1) != 0 This subtest was added by the recently landed change https://review.whamcloud.com/32095 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 27/May/18 ] |
|
Emoly, it looks like your previous patch broke sanity.sh test_77k. |
| Comment by Andreas Dilger [ 27/May/18 ] |
|
This test failed about 30 times so far. |
| Comment by Emoly Liu [ 28/May/18 ] |
|
I will look into it. |
| Comment by Emoly Liu [ 28/May/18 ] |
|
I suspect the failure may be related to the processing speed of llog thread. Although it is called before client_common_fill_super(), but is it possible to set checksum by set_param later than obd_set_info_async? |
| Comment by Bob Glossman (Inactive) [ 01/Jun/18 ] |
|
more on master: test 77k is failing a lot & it's only been there a few days |
| Comment by Alexander Boyko [ 08/Jun/18 ] |
|
Error: 'checksum(1) != 0' |
| Comment by Andreas Dilger [ 08/Jun/18 ] |
|
Please submit a patch to move this test into ALWAYS_EXCEPT for now. . |
| Comment by Gerrit Updater [ 08/Jun/18 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32685 |
| Comment by Gerrit Updater [ 09/Jun/18 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/32685/ |
| Comment by Andreas Dilger [ 09/Jun/18 ] |
|
This bug is not fixed with this landing, do not close it. |
| Comment by John Hammond [ 12/Jun/18 ] |
|
I looked at this a bit. When it works the params log is processed normally. When it doesn't work mgc_process_log() skips the params log since cld->cld_stopping is set. So maybe due finding an old CLD of the params log. |
| Comment by Emoly Liu [ 31/Aug/18 ] |
|
I can easily produce this issue on ldiskfs locally. The log shows it's related to the following code in process_param2_config(): if (kobj) {
char *value = param;
char *envp[3];
int i;
param = strsep(&value, "=");
envp[0] = kasprintf(GFP_KERNEL, "PARAM=%s", param);
envp[1] = kasprintf(GFP_KERNEL, "SETTING=%s", value);
envp[2] = NULL;
rc = kobject_uevent_env(kobj, KOBJ_CHANGE, envp);
for (i = 0; i < ARRAY_SIZE(envp); i++)
kfree(envp[i]);
kobject_put(kobj);
RETURN(rc);
}
|
| Comment by James A Simmons [ 31/Aug/18 ] |
|
What shows up in the log exactly? It is just sending a uevent. Is the memory allocation failing? This code also only gets executed with lctl set_param -P. Are you running with that command? Alrighty let me take a look. |
| Comment by Gerrit Updater [ 31/Aug/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33099 |
| Comment by James A Simmons [ 31/Aug/18 ] |
|
I can't seem to reproduce the issue in my test bed. I pushed a patch so I can collect debug logs. Assuming it still fails If you run: udevadm monitor --subsystem-match=lustre --property You should see the uevents being sent and what values. |
| Comment by James A Simmons [ 31/Aug/18 ] |
|
I bet you are testing from the lustre source tree. In that case you lctl is missing in /usr/sbin and the uevent will be missed. If you make install into your image or copy 99-lustre.rules to /etc/udev/rules.d/ then it will go further. |
| Comment by Emoly Liu [ 03/Sep/18 ] |
|
Yes, I'm using lustre source tree for test. I will try per your comment and see. Thanks. |
| Comment by John Hammond [ 05/Sep/18 ] |
|
Just to repeat my comment, I think this is due to a llog issue. (Which may be exposed in some ray by recent changes.) I looked at this a bit. When it works the params log is processed normally. When it doesn't work mgc_process_log() skips the params log since cld->cld_stopping is set. So I think this is due finding an old CLD of the params log. |
| Comment by James A Simmons [ 04/Oct/18 ] |
|
The patch https://review.whamcloud.com/#/c/33190 seems to have resolved this bug |
| Comment by Peter Jones [ 06/Oct/18 ] |
|
ok, let's mark it a duplicate of |
| Comment by Andreas Dilger [ 08/Oct/18 ] |
|
Reopen until test is removed from ALWAYS_EXCEPT. Please remove always_except label when that patch is landed. |
| Comment by Gerrit Updater [ 10/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33099/ |
| Comment by Sebastien Buisson [ 10/Oct/18 ] |
|
Multiple failures of test_77k very recently, for instance: |
| Comment by James Nunez (Inactive) [ 10/Oct/18 ] |
|
Since removing this test from the ALWAYS_EXCEPT list and since landing the patch for So far, the three failures are in review-ldiskfs. |
| Comment by James A Simmons [ 10/Oct/18 ] |
|
I think the test is wrong with how its using lctl set_param '-P'. Now that the Working on patch |
| Comment by James A Simmons [ 10/Oct/18 ] |
|
Okay I moved to the test to the new set_persistent_param_and_check() which allows us to test both lctl set_param -P and lctl conf_param. I'm seeing failures with the lctl conf_param case. It will take a few hours to sort this out. |
| Comment by James A Simmons [ 11/Oct/18 ] |
|
So from my understanding the mount option [no]checksum over rides all other checksum tuning. For example mount --nochecksum should prevent lctl set_param -P / lctl conf_param from taking affect. This should be true of the direct sysfs tunings as well? Is this thinking correct? |
| Comment by Gerrit Updater [ 11/Oct/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33353 |
| Comment by James Nunez (Inactive) [ 11/Oct/18 ] |
|
sanity test 77k is failing about 5% of the time for review-ldiskfs, review-dne and review-dne-zfs. I prepared a patch to stop running sanity test 77k in case this is too disruptive to patch landings. |
| Comment by James A Simmons [ 11/Oct/18 ] |
|
I think I have a proper fix. I'm currently testing it. |
| Comment by Gerrit Updater [ 11/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33354 |
| Comment by Gerrit Updater [ 11/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33354/ |
| Comment by Peter Jones [ 11/Oct/18 ] |
|
James we're reverting the test being on for now so please turn it back on along with your fix when it is ready Thanks Peter |
| Comment by Gerrit Updater [ 13/Oct/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33363 |
| Comment by Gerrit Updater [ 15/Jan/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34035 |
| Comment by James A Simmons [ 15/Jan/19 ] |
|
Note the test fix is only a work around. The proper is still to be worked out. |
| Comment by Gerrit Updater [ 11/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34035/ |