[LU-4341] Failure on test suite sanity test_170: expected 31 bad lines, but got 34 Created: 03/Dec/13 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | always_except | ||
| Environment: |
server and client: lustre-master build # 1784 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 11880 | ||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/7756e5f2-5bb9-11e3-8d79-52540035b04c. The sub-test test_170 failed with the following error:
== sanity test 170: test lctl df to handle corrupted log ============================================= 00:50:22 (1385974222) sanity test_170: @@@@@@ FAIL: expected 31 bad lines, but got 34 |
| Comments |
| Comment by Jian Yu [ 06/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Jian Yu [ 17/Jan/14 ] |
|
More instances on Lustre b2_5 branch: |
| Comment by Jian Yu [ 20/Feb/14 ] |
|
This failure kept occurring on Lustre b2_5 branch in SLES11SP3/x86_64 client test session: |
| Comment by Jian Yu [ 09/Mar/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2) The same failure occurred: |
| Comment by Bob Glossman (Inactive) [ 25/Apr/14 ] |
|
another in b2_5 |
| Comment by Peter Jones [ 25/Apr/14 ] |
|
Yu, Jian This seems to be occurring sporadically and when it does it causes review failures. Could you please look into what kind of circumstances trigger these failures? Thanks Peter |
| Comment by Jian Yu [ 27/Apr/14 ] |
|
The failure occurred on the following test sessions on Lustre b2_5 and master branches: SLES11SP2 client + RHEL6.5 server SLES11SP3 client + RHEL6.5 server SLES11SP3 client + SLES11SP3 server (only on master branch) I'll look into the failure. |
| Comment by Bob Glossman (Inactive) [ 28/Apr/14 ] |
|
I think this is another, but says: Error: 'expected 24 bad lines, but got 27' instead of expected 31 bad lines, but got 34, sles11sp3 client in b2_5: |
| Comment by Bob Glossman (Inactive) [ 28/Apr/14 ] |
|
starting to wonder if this is a high rate failure, maybe even 100%, in any sles client. |
| Comment by Bob Glossman (Inactive) [ 30/Apr/14 ] |
|
another sles11sp3 client in master: |
| Comment by Bob Glossman (Inactive) [ 01/May/14 ] |
|
another sles11sp3 client in master: |
| Comment by Jian Yu [ 12/May/14 ] |
|
There is a defect in sanity test_170(), and here is a patch for master branch to fix it: http://review.whamcloud.com/10296 |
| Comment by Jian Yu [ 13/May/14 ] |
|
Finally, I found that it was sanity test 150 which caused test 170 fail on SLES11SP3 client: run_test 150 "truncate/append tests" I've tried several ways to fix the issue but failed. Still digging. |
| Comment by Jian Yu [ 15/May/14 ] |
|
Just narrowed down that it was the following operation in sanity test 150 which caused test 170 fail: remount_client $MOUNT -> zconf_mount `hostname` $1 -> set_default_debug_nodes $client After commenting out "set_default_debug_nodes $client", the failure disappeared. |
| Comment by Jian Yu [ 16/May/14 ] |
|
It turns out that sanity test 170 is affected by the lctl debug value. With debug=-1, it passed, and with debug="rpctrace", the test failed. In sanity.sh, debug=-1 is set before running sub-tests, which is why only running test 170 passed. In test 150, "set_default_debug_nodes $client" made the debug value change to debug="vfstrace rpctrace dlmtrace neterror ha config ioctl super", which caused test 170 fail. |
| Comment by Jian Yu [ 19/May/14 ] |
|
Hi Di, I saw that sanity test_170() was added by you in commit d9bf86ae95a599bf10bbb05818317b48eb71db1b. Could you please give me some hints about why debug="rpctrace" affects the test results of sanity test 170 on SLES client? Thanks a lot. |
| Comment by Di Wang [ 27/May/14 ] |
|
Hi, Yujian test_170 is supposed to verify "lctl df can identify and skip corrupted debug records", instead of abandon the whole debug log file. I guess there are some debug format problem for "rpctrace", though not sure what is the real reason here, I think you need get "$TMP/${file}_log_good" to have a look or follow the test step to repeat the test locally? Thanks. |
| Comment by Jian Yu [ 14/Jul/14 ] |
|
Thanks, Di. I can reproduce the failure every time with debug="rpctrace". I'll look into "$TMP/${tfile}_log_good". |
| Comment by Jian Yu [ 31/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) The same failure occurred: |
| Comment by Bob Glossman (Inactive) [ 26/Dec/14 ] |
|
seen in master with sles11sp3 client/server: |
| Comment by Sarah Liu [ 17/Feb/15 ] |
|
hit this error in tag-2.6.94 test: https://testing.hpdd.intel.com/test_sets/c53f0196-b22a-11e4-af8e-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 26/Mar/15 ] |
|
another seen in master: |
| Comment by Sarah Liu [ 01/Apr/15 ] |
|
another instance |
| Comment by Sarah Liu [ 20/May/15 ] |
|
another instance: |
| Comment by Andreas Dilger [ 31/Aug/15 ] |
|
Can someone please definitively understand why this test is failing for SLES, and either fix it or add it to the ALWAYS_EXCEPT list for SLES. It doesn't make sense to exclude this via envdefinitions for SLES patches, when it will still fail when someone forgets to except it. |
| Comment by Gerrit Updater [ 31/Aug/15 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/16146 |
| Comment by Bob Glossman (Inactive) [ 31/Aug/15 ] |
|
best I can do for now is push a mod to ALWAYS_EXCEPT on sles11 |
| Comment by Gerrit Updater [ 16/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16146/ |
| Comment by Andreas Dilger [ 16/Oct/15 ] |
|
This appears to be failing on master on SLES11.3 tests in the past week: The patch http://review.whamcloud.com/16146 has landed, but that doesn't solve the problem itself. An improved patch would only skip the "expected N bad lines, but got M" check in test_170 for SLES11.3+ and RHEL7 as well rather than the whole test_170. While in there, it should also remove the "-rf" from rm -rf $DIR/$tfile since that is a file and not a directory and shouldn't fail in any case. One possible source of the bug is that the first cat $TMP/${tfile}_log_good >> $TMP/${tfile}_logs_corrupt is appending to a file (>>) instead of first truncating it (>) so if the ${tfile}_logs_corrupt file is lingering around from a previous test run for some reason it might cause problems. It does seem like the number of bad lines is always higher than the number of expected lines, so this seems like a candidate. In any case, this bug cannot be closed until the actual test failure is understood and fixed. |