[LU-4341] Failure on test suite sanity test_170: expected 31 bad lines, but got 34 Created: 03/Dec/13  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 0
Labels: always_except
Environment:

server and client: lustre-master build # 1784
client is running SLES11 SP3


Issue Links:
Related
Severity: 3
Rank (Obsolete): 11880

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/7756e5f2-5bb9-11e3-8d79-52540035b04c.

The sub-test test_170 failed with the following error:

expected 31 bad lines, but got 34

== sanity test 170: test lctl df to handle corrupted log ============================================= 00:50:22 (1385974222)
 sanity test_170: @@@@@@ FAIL: expected 31 bad lines, but got 34 


 Comments   
Comment by Jian Yu [ 06/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Distro/Arch: RHEL6.4/x86_64(server), SLES11SP3/x86_64(client)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/0c73c65e-763c-11e3-b3c0-52540035b04c

Comment by Jian Yu [ 17/Jan/14 ]

More instances on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/ba50cbe6-7ecf-11e3-925a-52540035b04c
https://maloo.whamcloud.com/test_sets/c2ca6dea-908b-11e3-a134-52540035b04c

Comment by Jian Yu [ 20/Feb/14 ]

This failure kept occurring on Lustre b2_5 branch in SLES11SP3/x86_64 client test session:
https://maloo.whamcloud.com/test_sets/ee128066-990d-11e3-968c-52540035b04c

Comment by Jian Yu [ 09/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2)
Distro/Arch: RHEL6.5/x86_64(server), SLES11SP3/x86_64(client)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/97d92b6a-a663-11e3-aac5-52540035b04c

Comment by Bob Glossman (Inactive) [ 25/Apr/14 ]

another in b2_5
https://maloo.whamcloud.com/test_sessions/bb218674-cc67-11e3-bda1-52540035b04c

Comment by Peter Jones [ 25/Apr/14 ]

Yu, Jian

This seems to be occurring sporadically and when it does it causes review failures. Could you please look into what kind of circumstances trigger these failures?

Thanks

Peter

Comment by Jian Yu [ 27/Apr/14 ]

The failure occurred on the following test sessions on Lustre b2_5 and master branches:

SLES11SP2 client + RHEL6.5 server
SLES11SP3 client + RHEL6.5 server
SLES11SP3 client + SLES11SP3 server (only on master branch)

I'll look into the failure.

Comment by Bob Glossman (Inactive) [ 28/Apr/14 ]

I think this is another, but says: Error: 'expected 24 bad lines, but got 27' instead of expected 31 bad lines, but got 34,

sles11sp3 client in b2_5:
https://maloo.whamcloud.com/test_sets/63de2e74-cf07-11e3-a250-52540035b04c

Comment by Bob Glossman (Inactive) [ 28/Apr/14 ]

starting to wonder if this is a high rate failure, maybe even 100%, in any sles client.

Comment by Bob Glossman (Inactive) [ 30/Apr/14 ]

another sles11sp3 client in master:
https://maloo.whamcloud.com/test_sets/a9add412-d0ac-11e3-b9d4-52540035b04c

Comment by Bob Glossman (Inactive) [ 01/May/14 ]

another sles11sp3 client in master:
https://maloo.whamcloud.com/test_sets/fd386500-d167-11e3-91ff-52540035b04c

Comment by Jian Yu [ 12/May/14 ]

There is a defect in sanity test_170(), and here is a patch for master branch to fix it: http://review.whamcloud.com/10296
Since the failure cannot be reproduced by only running sanity test 170, I'm checking the previous sub-tests to see which one is the culprit.

Comment by Jian Yu [ 13/May/14 ]

Finally, I found that it was sanity test 150 which caused test 170 fail on SLES11SP3 client:

run_test 150 "truncate/append tests"

I've tried several ways to fix the issue but failed. Still digging.

Comment by Jian Yu [ 15/May/14 ]

Just narrowed down that it was the following operation in sanity test 150 which caused test 170 fail:

remount_client $MOUNT -> zconf_mount `hostname` $1 -> set_default_debug_nodes $client

After commenting out "set_default_debug_nodes $client", the failure disappeared.

Comment by Jian Yu [ 16/May/14 ]

It turns out that sanity test 170 is affected by the lctl debug value. With debug=-1, it passed, and with debug="rpctrace", the test failed.

In sanity.sh, debug=-1 is set before running sub-tests, which is why only running test 170 passed. In test 150, "set_default_debug_nodes $client" made the debug value change to debug="vfstrace rpctrace dlmtrace neterror ha config ioctl super", which caused test 170 fail.

Comment by Jian Yu [ 19/May/14 ]

Hi Di,

I saw that sanity test_170() was added by you in commit d9bf86ae95a599bf10bbb05818317b48eb71db1b. Could you please give me some hints about why debug="rpctrace" affects the test results of sanity test 170 on SLES client? Thanks a lot.

Comment by Di Wang [ 27/May/14 ]

Hi, Yujian

test_170 is supposed to verify "lctl df can identify and skip corrupted debug records", instead of abandon the whole debug log file. I guess there are some debug format problem for "rpctrace", though not sure what is the real reason here, I think you need get "$TMP/${file}_log_good" to have a look or follow the test step to repeat the test locally? Thanks.

Comment by Jian Yu [ 14/Jul/14 ]

Thanks, Di. I can reproduce the failure every time with debug="rpctrace". I'll look into "$TMP/${tfile}_log_good".

Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)

The same failure occurred:
https://testing.hpdd.intel.com/test_sets/b2a57a6e-30a3-11e4-9f57-5254006e85c2

Comment by Bob Glossman (Inactive) [ 26/Dec/14 ]

seen in master with sles11sp3 client/server:
https://testing.hpdd.intel.com/test_sets/14d6386e-8c7e-11e4-b81b-5254006e85c2

Comment by Sarah Liu [ 17/Feb/15 ]

hit this error in tag-2.6.94 test:

https://testing.hpdd.intel.com/test_sets/c53f0196-b22a-11e4-af8e-5254006e85c2

Comment by Bob Glossman (Inactive) [ 26/Mar/15 ]

another seen in master:
https://testing.hpdd.intel.com/test_sets/83277a58-d3cd-11e4-8c98-5254006e85c2

Comment by Sarah Liu [ 01/Apr/15 ]

another instance
https://testing.hpdd.intel.com/test_sets/834ace12-d75c-11e4-a678-5254006e85c2

Comment by Sarah Liu [ 20/May/15 ]

another instance:
https://testing.hpdd.intel.com/test_sets/02e36236-fe29-11e4-be9d-5254006e85c2

Comment by Andreas Dilger [ 31/Aug/15 ]

Can someone please definitively understand why this test is failing for SLES, and either fix it or add it to the ALWAYS_EXCEPT list for SLES. It doesn't make sense to exclude this via envdefinitions for SLES patches, when it will still fail when someone forgets to except it.

Comment by Gerrit Updater [ 31/Aug/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/16146
Subject: LU-4341 test: skip failing sanity test 170
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7f5668027d0b1393640c8185e8084c2957c8bdbe

Comment by Bob Glossman (Inactive) [ 31/Aug/15 ]

best I can do for now is push a mod to ALWAYS_EXCEPT on sles11

Comment by Gerrit Updater [ 16/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16146/
Subject: LU-4341 test: skip failing sanity test 170
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ef63c034b437d47cd10fe7ee94ed614ac1359f44

Comment by Andreas Dilger [ 16/Oct/15 ]

This appears to be failing on master on SLES11.3 tests in the past week:
https://testing.hpdd.intel.com/sub_tests/05b5baa6-73f7-11e5-ada9-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/07fb912c-73df-11e5-ab44-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/9c5611ca-73ea-11e5-ab44-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/bae5eb2c-722f-11e5-b344-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/49587898-7073-11e5-b705-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/8bfb6e00-7071-11e5-b705-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/31313890-6f24-11e5-83a9-5254006e85c2

The patch http://review.whamcloud.com/16146 has landed, but that doesn't solve the problem itself. An improved patch would only skip the "expected N bad lines, but got M" check in test_170 for SLES11.3+ and RHEL7 as well rather than the whole test_170. While in there, it should also remove the "-rf" from rm -rf $DIR/$tfile since that is a file and not a directory and shouldn't fail in any case.

One possible source of the bug is that the first cat $TMP/${tfile}_log_good >> $TMP/${tfile}_logs_corrupt is appending to a file (>>) instead of first truncating it (>) so if the ${tfile}_logs_corrupt file is lingering around from a previous test run for some reason it might cause problems. It does seem like the number of bad lines is always higher than the number of expected lines, so this seems like a candidate.

In any case, this bug cannot be closed until the actual test failure is understood and fixed.

Generated at Sat Feb 10 01:41:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.