[LU-7428] conf-sanity test_84, replay-dual 0a: /dev/lvm-Role_MDS/P1 failed to initialize! Created: 15/Nov/15 Updated: 29/May/17 Resolved: 26/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4hc | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d42a610-8187-11e5-a41e-5254006e85c2. The sub-test test_84 failed with the following error: CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Update not seen after 90s: wanted '' got 'lustre:MDT0000'
conf-sanity test_84: @@@@@@ FAIL: /dev/lvm-Role_MDS/P1 failed to initialize!
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4843:error()
= /usr/lib64/lustre/tests/test-framework.sh:1270:mount_facet()
= /usr/lib64/lustre/tests/test-framework.sh:1188:mount_facets()
= /usr/lib64/lustre/tests/test-framework.sh:2513:facet_failover()
= /usr/lib64/lustre/tests/conf-sanity.sh:5594:test_84()
= /usr/lib64/lustre/tests/test-framework.sh:5090:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5127:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4992:run_test()
Please provide additional information about the failure here. Info required for matching: conf-sanity 84 |
| Comments |
| Comment by Andreas Dilger [ 15/Nov/15 ] |
|
This is failing about twice per day on master. |
| Comment by Gerrit Updater [ 27/Nov/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17371 |
| Comment by nasf (Inactive) [ 28/Nov/15 ] |
|
It is another failure instance |
| Comment by Andreas Dilger [ 28/Nov/15 ] |
|
Please don't close this bug, as I've got a patch tracked here that may fix the problem. |
| Comment by Gerrit Updater [ 30/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17371/ |
| Comment by Andreas Dilger [ 04/Dec/15 ] |
|
It looks like conf-sanity test_84 is still failing in some cases, even with this patch applied: It still fails consistently with the new e2fsprogs patches base on master build d059b3c01 " |
| Comment by Andreas Dilger [ 04/Dec/15 ] |
|
Bob, this test is failing too often (see many different linked bugs), and doesn't provide much value for testing in comparison. Could you please add it to the ALWAYS_EXCEPT list until it can be fixed. My last attempt didn't seem to resolve the problem. |
| Comment by Gerrit Updater [ 04/Dec/15 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/17482 |
| Comment by Gerrit Updater [ 06/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17482/ |
| Comment by Andreas Dilger [ 07/Dec/15 ] |
|
It looks like all of the e2fsprogs test failures are on CentOS 7. I don't know if that means the failures are only on CentOS 7, or if it is because the new e2fsprogs patches are not tested on any other distro. |
| Comment by Hongchao Zhang [ 10/Dec/15 ] |
|
Status update: |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag Hard Failover: EL7 Server/SLES11 SP3 Client |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Peter Jones [ 15/Dec/15 ] |
|
Hongchao Have the tests ever run successfully on autotest? Perhaps there is something missing from the TEI environment for running these tests that you can identify so the TEI team can correct? Peter |
| Comment by Hongchao Zhang [ 17/Dec/15 ] |
|
Yes, there are successful test on autotest, https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=3&sub_test%5Bquery_bugs%5D=&sub_test%5Bstatus%5D=&sub_test%5Bsub_test_script_id%5D=5e30cae0-7f56-11e4-b7e8-5254006e85c2&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=0dcf0e82-e30f-11e4-9cb2-5254006e85c2&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bend_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Bstart_date%5D=&test_session%5Btest_group%5D=&test_session%5Btest_host%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=7f66aa20-3db2-11e0-80c0-52540025f9af&utf8=✓&warn%5Bnotice%5D=true I'm not sure whether the problem is related to TEI or not. |
| Comment by Saurabh Tandan (Inactive) [ 19/Dec/15 ] |
|
Another instance for EL7.1 Server/EL7.1 Client - DNE |
| Comment by Andreas Dilger [ 23/Dec/15 ] |
|
Hongchao, it looks like the main problem is that the e2label command run as part of mkfs.lustre to update the filesystem label from fsname:MDT0000 to fsname-MDT0000 is failing to be written to disk. It isn't clear if that is a problem with the way the dev_rdonly code works, that prevents the superblock update from being persistent, or if there is something in the VM that is dropping the writes from the guest. What would make sense is to add some debugging to conf-sanity.sh test_84 (since this one seems to hit this failure most often) that runs e2label $MDSDEV to print the filesystem label after the initial mount + sync + sleep but before replay_barrier, and then print it again in mount_facet if the mount has failed so that we can see what the current label is. |
| Comment by parinay v kondekar (Inactive) [ 28/Dec/15 ] |
Can somebody confirm, if its the same issue ? Thanks |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL7 Server/Client |
| Comment by Gerrit Updater [ 27/Jan/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/18178 |
| Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ] |
|
Encountered same issue for tag 2.7.66 for FULL- EL7.1 Server/EL6.7 Client , master , build# 3314 for replay-vbr. Update not seen after 90s: wanted '' got 'lustre:MDT0000' replay-vbr test_1b: @@@@@@ FAIL: /dev/lvm-Role_MDS/P1 failed to initialize! |
| Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ] |
|
Another failure for master : Tag 2.7.66 FULL - EL7.1 Server/SLES11 SP3 Client, build# 3314 for replay-single. Another instance for FULL - EL7.1 Server/EL7.1 Client - DNE, master, build# 3314 |
| Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ] |
|
Another instance found for hardfailover : EL7 Server/Client, tag 2.7.66, master build 3314 Another instance found for hardfailover : EL7 Server/SLES11 SP3 Client, tag 2.7.66, master build 3316 Another instance found for Full tag 2.7.66 - EL7.1 Server/EL6.7 Client, build# 3314 Another instance found for Full tag 2.7.66 - EL7.1 Server/EL7.1 Client, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/SLES11 SP3 Client, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314 |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found on b2_8 for failover testing , build# 6. |
| Comment by Gerrit Updater [ 11/Mar/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/18871 |
| Comment by Gerrit Updater [ 06/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18871/ |
| Comment by Joseph Gmitter (Inactive) [ 11/Apr/16 ] |
|
Additional fix has landed to master for 2.9.0 |
| Comment by Andreas Dilger [ 09/May/16 ] |
|
The patch landed to try the new workaround, but test 84 is still in the ALWAYS_EXCEPT list in conf-sanity.sh, so until that is removed there is no way to know whether this problem was actually fixed. |
| Comment by Gerrit Updater [ 13/May/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/20194 |
| Comment by Andreas Dilger [ 20/May/16 ] |
|
It might be worthwhile to test the patch from https://github.com/Xyratex/lustre-stable/commit/6197a27f174e683d3c66137db8976bddc7ef179b to see if that is fixing the problem? I think that patch could be simplified to just call sb->s_op->s_freeze() before marking the device read-only. |
| Comment by Gerrit Updater [ 01/Jun/16 ] |
|
Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20535 |
| Comment by Gerrit Updater [ 02/Jun/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/20586 |
| Comment by Hongchao Zhang [ 02/Jun/16 ] |
|
the patch ported from MRP-2135 (https://github.com/Xyratex/lustre-stable/commit/6197a27f174e683d3c66137db8976bddc7ef179b) |
| Comment by James A Simmons [ 07/Jun/16 ] |
|
Does this patch mean we don't need |
| Comment by Andreas Dilger [ 07/Jun/16 ] |
|
No, this won't replace This patch is to (hopefully) fix a problem where the device is sync'd and set read-only, but loses some recent writes, for an unknown reason. This shows up with a variety of different symptoms, and may be a result of bad interactions with LVM and VM virtual block devices, or it may be caused by the dev readonly patches. |
| Comment by Gerrit Updater [ 14/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20586/ |
| Comment by Peter Jones [ 13/Jul/16 ] |
|
Landed for 2.9 |
| Comment by John Hammond [ 09/Sep/16 ] |
|
Did the landing of http://review.whamcloud.com/20586/ resolve this issue? I see that we still have 84 in ALWAYS_EXCEPT. |
| Comment by Hongchao Zhang [ 13/Sep/16 ] |
|
the patch http://review.whamcloud.com/#/c/20194/ to remove test 84 has been refreshed, and have passed the tests in Maloo now. |
| Comment by Andreas Dilger [ 15/Sep/16 ] |
|
Reopen until patch enabling test_84 actually lands. |
| Comment by Gerrit Updater [ 26/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20194/ |
| Comment by Peter Jones [ 26/Sep/16 ] |
|
Test re-enabled for 2.9 |
| Comment by Jian Yu [ 10/Oct/16 ] |
|
Hi Hongchao, With patch http://review.whamcloud.com/7200 on master branch, conf-sanity test 84 failed as follows: CMD: onyx-31vm7 e2label /dev/mapper/mds1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Update not seen after 90s: wanted '' got 'lustre:MDT0000'
conf-sanity test_84: @@@@@@ FAIL: /dev/mapper/mds1_flakey failed to initialize!
https://testing.hpdd.intel.com/test_sets/e88a61c2-89bf-11e6-a8b7-5254006e85c2 Could you please advise? Thank you. |
| Comment by Jian Yu [ 19/Oct/16 ] |
|
Hi Hongchao, |