[LU-11058] sanity test_77k: checksum(1) != 0 Created: 25/May/18  Updated: 02/Aug/19

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: always_except

Issue Links:
Related
is related to LU-5020 OST can be all mounted successfully i... Resolved
is related to LU-10595 Use after free in mgc_process_cfg_log Resolved
is related to LU-10906 checksums parameter not persistent af... Resolved
is related to LU-11263 sanity-hsm test 300 fails with 'hsm_... Resolved
is related to LU-9398 Mounting two file systems from one MG... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/4155343e-5de1-11e8-abc3-52540065bddc

test_77k failed with the following error:

checksum(1) != 0

This subtest was added by the recently landed change https://review.whamcloud.com/32095 LU-10906 checksum: enable/disable checksum correctly.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_77k - checksum(1) != 0



 Comments   
Comment by Andreas Dilger [ 27/May/18 ]

Emoly, it looks like your previous patch broke sanity.sh test_77k.

Comment by Andreas Dilger [ 27/May/18 ]

This test failed about 30 times so far.

Comment by Emoly Liu [ 28/May/18 ]

I will look into it.

Comment by Emoly Liu [ 28/May/18 ]

I suspect the failure may be related to the processing speed of llog thread. Although it is called before client_common_fill_super(), but is it possible to set checksum by set_param later than obd_set_info_async?

Comment by Bob Glossman (Inactive) [ 01/Jun/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/b1aa5138-65b6-11e8-907b-52540065bddc
https://testing.hpdd.intel.com/test_sets/e3174aec-65ba-11e8-907b-52540065bddc
https://testing.hpdd.intel.com/test_sets/a7dc29fe-65c7-11e8-aa24-52540065bddc
https://testing.hpdd.intel.com/test_sets/36eefcd2-65cf-11e8-a55d-52540065bddc

test 77k is failing a lot & it's only been there a few days

Comment by Alexander Boyko [ 08/Jun/18 ]

Error: 'checksum(1) != 0'
Failure Rate: 11.00% of most recent 100 runs, 0 skipped (all branches)

Comment by Andreas Dilger [ 08/Jun/18 ]

Please submit a patch to move this test into ALWAYS_EXCEPT for now. .

Comment by Gerrit Updater [ 08/Jun/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32685
Subject: LU-11058 tests: stop running sanity test 77k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b9a2bd475acdeecc3cb372c2e5b5c530979f8f2b

Comment by Gerrit Updater [ 09/Jun/18 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/32685/
Subject: LU-11058 tests: stop running sanity test 77k
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ddee05e6106d3336081cc2d6d7b681116263829

Comment by Andreas Dilger [ 09/Jun/18 ]

This bug is not fixed with this landing, do not close it.

Comment by John Hammond [ 12/Jun/18 ]

I looked at this a bit. When it works the params log is processed normally. When it doesn't work mgc_process_log() skips the params log since cld->cld_stopping is set. So maybe due finding an old CLD of the params log.

Comment by Emoly Liu [ 31/Aug/18 ]

I can easily produce this issue on ldiskfs locally. The log shows it's related to the following code in process_param2_config():

        if (kobj) {
                char *value = param;
                char *envp[3];
                int i;

                param = strsep(&value, "=");
                envp[0] = kasprintf(GFP_KERNEL, "PARAM=%s", param);
                envp[1] = kasprintf(GFP_KERNEL, "SETTING=%s", value);
                envp[2] = NULL;

                rc = kobject_uevent_env(kobj, KOBJ_CHANGE, envp);
                for (i = 0; i < ARRAY_SIZE(envp); i++)
                        kfree(envp[i]);

                kobject_put(kobj);

                RETURN(rc);
        }
Comment by James A Simmons [ 31/Aug/18 ]

What shows up in the log exactly? It is just sending a uevent. Is the memory allocation failing? This code also only gets executed with lctl set_param -P. Are you running with that command? Alrighty let me take a look.

Comment by Gerrit Updater [ 31/Aug/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33099
Subject: LU-11058 test: reenable sanity 77k test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 38041bef21baaec1486c3e31f14f7aaeee1ba6bf

Comment by James A Simmons [ 31/Aug/18 ]

I can't seem to reproduce the issue in my test bed. I pushed a patch so I can collect debug logs. Assuming it still fails

If you run:

udevadm monitor --subsystem-match=lustre --property

You should see the uevents being sent and what values.

Comment by James A Simmons [ 31/Aug/18 ]

I bet you are testing from the lustre source tree. In that case you lctl is missing in /usr/sbin and the uevent will be missed. If you make install into your image or copy 99-lustre.rules to /etc/udev/rules.d/ then it will go further.

Comment by Emoly Liu [ 03/Sep/18 ]

Yes, I'm using lustre source tree for test. I will try per your comment and see. Thanks.

Comment by John Hammond [ 05/Sep/18 ]

Just to repeat my comment, I think this is due to a llog issue. (Which may be exposed in some ray by recent changes.)

I looked at this a bit. When it works the params log is processed normally. When it doesn't work mgc_process_log() skips the params log since cld->cld_stopping is set. So I think this is due finding an old CLD of the params log.

Comment by James A Simmons [ 04/Oct/18 ]

The patch https://review.whamcloud.com/#/c/33190 seems to have resolved this bug

Comment by Peter Jones [ 06/Oct/18 ]

ok, let's mark it a duplicate of LU-10595 then

Comment by Andreas Dilger [ 08/Oct/18 ]

Reopen until test is removed from ALWAYS_EXCEPT.

Please remove always_except label when that patch is landed.

Comment by Gerrit Updater [ 10/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33099/
Subject: LU-11058 test: reenable sanity 77k test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fb1875c314cf3650aa7ba6b2cdb72ef18e812d5f

Comment by Sebastien Buisson [ 10/Oct/18 ]

Multiple failures of test_77k very recently, for instance:
https://testing.whamcloud.com/test_sets/376fee56-cc76-11e8-ad90-52540065bddc
https://testing.whamcloud.com/test_sets/e992b980-cc92-11e8-ad90-52540065bddc

Comment by James Nunez (Inactive) [ 10/Oct/18 ]

Since removing this test from the ALWAYS_EXCEPT list and since landing the patch for LU-10595 https://review.whamcloud.com/#/c/33190, sanity test 77k has failed with this error three times; the two failures Sebastien listed above and https://testing.whamcloud.com/test_sets/c1ba7a5a-cc7f-11e8-ad90-52540065bddc .

So far, the three failures are in review-ldiskfs.

Comment by James A Simmons [ 10/Oct/18 ]

I think the test is wrong with how its using lctl set_param '-P'. Now that the LU-7004 test suite patch landed I think we can move this test to the new function and it will work.

Working on patch

Comment by James A Simmons [ 10/Oct/18 ]

Okay I moved to the test to the new set_persistent_param_and_check() which allows us to test both lctl set_param -P and lctl conf_param. I'm seeing failures with the lctl conf_param case. It will take a few hours to sort this out.

Comment by James A Simmons [ 11/Oct/18 ]

So from my understanding the mount option [no]checksum over rides all other checksum tuning. For example mount --nochecksum should prevent lctl set_param -P / lctl conf_param from taking affect. This should be true of the direct sysfs tunings as well? Is this thinking correct?

Comment by Gerrit Updater [ 11/Oct/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33353
Subject: LU-11058 tests: stop running sanity test 77k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f78e8ff46acb649ecb76607fe13a227a70c28c62

Comment by James Nunez (Inactive) [ 11/Oct/18 ]

sanity test 77k is failing about 5% of the time for review-ldiskfs, review-dne and review-dne-zfs.

I prepared a patch to stop running sanity test 77k in case this is too disruptive to patch landings.

Comment by James A Simmons [ 11/Oct/18 ]

I think I have a proper fix. I'm currently testing it.

Comment by Gerrit Updater [ 11/Oct/18 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33354
Subject: Revert "LU-11058 test: reenable sanity 77k test"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 179b1a83e6ce4eb2310fef67b23dd16b7c32a9b1

Comment by Gerrit Updater [ 11/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33354/
Subject: Revert "LU-11058 test: reenable sanity 77k test"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4839674385f2f2cf25206ea24dd3999096c29d59

Comment by Peter Jones [ 11/Oct/18 ]

James 

we're reverting the test being on for now so please turn it back on along with your fix when it is ready

Thanks

Peter

Comment by Gerrit Updater [ 13/Oct/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33363
Subject: LU-11058 obd: manage checksum state
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0bb53a3db196d82df918f75f653859ed78a903f8

Comment by Gerrit Updater [ 15/Jan/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34035
Subject: LU-11058 tests: cleanup persistent checksum= in 77k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5a85e3040d7da834b17e311b834b3aed202d6cd4

Comment by James A Simmons [ 15/Jan/19 ]

Note the test fix is only a work around. The proper is still to be worked out.

Comment by Gerrit Updater [ 11/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34035/
Subject: LU-11058 tests: cleanup persistent checksum= in 77k
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 65a7a268251634acd81dbf14dc78f1b4f7cf5157

Generated at Sat Feb 10 02:40:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.