[LU-7428] conf-sanity test_84, replay-dual 0a: /dev/lvm-Role_MDS/P1 failed to initialize! Created: 15/Nov/15  Updated: 29/May/17  Resolved: 26/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: p4hc

Attachments: Zip Archive reaply-dual-all-log.zip    
Issue Links:
Duplicate
is duplicated by LU-7481 Failover: recovery-mds-scale test_fai... Resolved
Related
is related to LU-7364 conf-sanity test_84 fails with llog_p... Open
is related to LU-6029 conf-sanity test_84: recovery_duratio... Resolved
is related to LU-7169 conf-sanity 84 restart mds1 failed Resolved
is related to LU-7097 conf-sanity test_84 (check recovery_t... Resolved
is related to LU-6992 recovery-random-scale test_fail_clien... Resolved
is related to LU-7222 conf-sanity test_84: invalid llog tai... Resolved
is related to LU-7368 e2fsck unsafe to interrupt with quota... Resolved
is related to LU-6789 Interop 2.5.3<->master conf-sanity te... Resolved
is related to LU-7361 conf-sanity test_84: Error: 'recovery... Resolved
is related to LU-7492 conf-sanity_87 test failed conf-sanit... Resolved
is related to LU-7100 conf-sanity test_84 LBUGS with “(llog... Closed
is related to LU-7509 Failover: recovery-random-scale test_... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d42a610-8187-11e5-a41e-5254006e85c2.

The sub-test test_84 failed with the following error:

CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: shadow-10vm4 e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Update not seen after 90s: wanted '' got 'lustre:MDT0000'
 conf-sanity test_84: @@@@@@ FAIL: /dev/lvm-Role_MDS/P1 failed to initialize! 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4843:error()
  = /usr/lib64/lustre/tests/test-framework.sh:1270:mount_facet()
  = /usr/lib64/lustre/tests/test-framework.sh:1188:mount_facets()
  = /usr/lib64/lustre/tests/test-framework.sh:2513:facet_failover()
  = /usr/lib64/lustre/tests/conf-sanity.sh:5594:test_84()
  = /usr/lib64/lustre/tests/test-framework.sh:5090:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5127:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4992:run_test()

Please provide additional information about the failure here.

Info required for matching: conf-sanity 84
Info required for matching: replay-dual 0a



 Comments   
Comment by Andreas Dilger [ 15/Nov/15 ]

This is failing about twice per day on master.

Comment by Gerrit Updater [ 27/Nov/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17371
Subject: LU-7428 tests: write superblock in conf-sanity test_84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3e516e59697f7c48e5cb97054ee04cf97dc7132

Comment by nasf (Inactive) [ 28/Nov/15 ]

It is another failure instance LU-7169.

Comment by Andreas Dilger [ 28/Nov/15 ]

Please don't close this bug, as I've got a patch tracked here that may fix the problem.

Comment by Gerrit Updater [ 30/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17371/
Subject: LU-7428 tests: write superblock in conf-sanity test_84
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5fda01f3002e7e742a206ce149652c6b78356828

Comment by Andreas Dilger [ 04/Dec/15 ]

It looks like conf-sanity test_84 is still failing in some cases, even with this patch applied:

It still fails consistently with the new e2fsprogs patches base on master build d059b3c01 "LU-6693 out: not return NULL in object_update_param_get", though I can't really see why except that the superblock label never gets updated:
http://review.whamcloud.com/17151
http://review.whamcloud.com/17150
http://review.whamcloud.com/17152
http://review.whamcloud.com/17153

Comment by Andreas Dilger [ 04/Dec/15 ]

Bob, this test is failing too often (see many different linked bugs), and doesn't provide much value for testing in comparison. Could you please add it to the ALWAYS_EXCEPT list until it can be fixed. My last attempt didn't seem to resolve the problem.

Comment by Gerrit Updater [ 04/Dec/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/17482
Subject: LU-7428 test: disable conf-sanity, test_84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a26001d568432d1fa646c6fc850e0ab66e41f97f

Comment by Gerrit Updater [ 06/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/17482/
Subject: LU-7428 test: disable conf-sanity, test_84
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 74d95a078c6725884f67a2737ea5b7e55fab1087

Comment by Andreas Dilger [ 07/Dec/15 ]

It looks like all of the e2fsprogs test failures are on CentOS 7. I don't know if that means the failures are only on CentOS 7, or if it is because the new e2fsprogs patches are not tested on any other distro.

Comment by Hongchao Zhang [ 10/Dec/15 ]

Status update:
I have tried to reproduce the issue on CentOS7 with the latest e2fsprogs (branch remotes/origin/master-lustre), and I can't reproduce it.

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
recovery-mds-scale test_failover_ost failed with same issue.
Hard Failover: EL7 Server/Client
https://testing.hpdd.intel.com/test_sets/cf43bf1c-9e9a-11e5-b163-5254006e85c2

Hard Failover: EL7 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/a39034e8-a077-11e5-8d69-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/EL7.1 Client
https://testing.hpdd.intel.com/test_sets/7704baac-9f37-11e5-ba94-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/18ce6c0a-9f2b-11e5-bf9b-5254006e85c2

Comment by Peter Jones [ 15/Dec/15 ]

Hongchao

Have the tests ever run successfully on autotest? Perhaps there is something missing from the TEI environment for running these tests that you can identify so the TEI team can correct?

Peter

Comment by Hongchao Zhang [ 17/Dec/15 ]

Yes, there are successful test on autotest,

https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=3&sub_test%5Bquery_bugs%5D=&sub_test%5Bstatus%5D=&sub_test%5Bsub_test_script_id%5D=5e30cae0-7f56-11e4-b7e8-5254006e85c2&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=0dcf0e82-e30f-11e4-9cb2-5254006e85c2&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bend_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Bstart_date%5D=&test_session%5Btest_group%5D=&test_session%5Btest_host%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=7f66aa20-3db2-11e0-80c0-52540025f9af&utf8=✓&warn%5Bnotice%5D=true

I'm not sure whether the problem is related to TEI or not.

Comment by Saurabh Tandan (Inactive) [ 19/Dec/15 ]

Another instance for EL7.1 Server/EL7.1 Client - DNE
Master , Build# 3270
https://testing.hpdd.intel.com/test_sets/6cae7cf8-a26d-11e5-bdef-5254006e85c2

Comment by Andreas Dilger [ 23/Dec/15 ]

Hongchao, it looks like the main problem is that the e2label command run as part of mkfs.lustre to update the filesystem label from fsname:MDT0000 to fsname-MDT0000 is failing to be written to disk. It isn't clear if that is a problem with the way the dev_rdonly code works, that prevents the superblock update from being persistent, or if there is something in the VM that is dropping the writes from the guest.

What would make sense is to add some debugging to conf-sanity.sh test_84 (since this one seems to hit this failure most often) that runs e2label $MDSDEV to print the filesystem label after the initial mount + sync + sleep but before replay_barrier, and then print it again in mount_facet if the mount has failed so that we can see what the current label is.

Comment by parinay v kondekar (Inactive) [ 28/Dec/15 ]
  • We do see similar failure during relay-dual tests on
    Configuration : 4 node cluster . 1 MDS/ 1 OSS / 2 Clients.
    Release
    3.10.0_229.20.1.el7_lustremaster_master__81.x86_64_g70bb27b
    Server 2.7.64
    Client 2.7.642.7.64  
    kernel - 3.10.0_229.20.1.el7
    git hash - 70bb27b
    
  • stdout.log
    == replay-dual test 0a: expired recovery with lost client ============================================ 05:44:27 (1450935867)
    Check file is LU482_FAILED=/tmp/replay-dual.lu482.wNwVYw
    Filesystem                 1K-blocks  Used Available Use% Mounted on
    192.168.113.21@tcp:/lustre   1345184 35144   1209424   3% /mnt/lustre
    total: 50 creates in 0.14 seconds: 355.12 creates/second
    fail_loc=0x80000514
    Failing mds1 on fre1321
    Stopping /mnt/mds1 (opts:) on fre1321
    pdsh@fre1323: fre1321: ssh exited with exit code 1
    reboot facets: mds1
    Failover mds1 to fre1321
    05:44:45 (1450935885) waiting for fre1321 network 900 secs ...
    05:44:45 (1450935885) network interface is UP
    mount facets: mds1
    Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
    Waiting 90 secs for update
    Waiting 80 secs for update
    Waiting 70 secs for update
    Waiting 60 secs for update
    Waiting 50 secs for update
    Waiting 40 secs for update
    Waiting 30 secs for update
    Waiting 20 secs for update
    Waiting 10 secs for update
    Update not seen after 90s: wanted '' got 'lustre:MDT0000'
     replay-dual test_0a: @@@@@@ FAIL: /dev/vdb failed to initialize! 
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4822:error_noexit()
      = /usr/lib64/lustre/tests/test-framework.sh:4853:error()
      = /usr/lib64/lustre/tests/test-framework.sh:1270:mount_facet()
      = /usr/lib64/lustre/tests/test-framework.sh:1188:mount_facets()
      = /usr/lib64/lustre/tests/test-framework.sh:2523:facet_failover()
      = /usr/lib64/lustre/tests/replay-dual.sh:66:test_0a()
      = /usr/lib64/lustre/tests/test-framework.sh:5100:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5137:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4954:run_test()
      = /usr/lib64/lustre/tests/replay-dual.sh:76:main()
    Dumping lctl log to /tmp/test_logs/1450935862/replay-dual.test_0a.*.1450935987.log
    fre1324: Warning: Permanently added 'fre1323,192.168.113.23' (ECDSA) to the list of known hosts.
    
    fre1322: Warning: Permanently added 'fre1323,192.168.113.23' (ECDSA) to the list of known hosts.
    
    fre1321: Warning: Permanently added 'fre1323,192.168.113.23' (ECDSA) to the list of known hosts.
    
    FAIL 0a (121s)
    
  • Its reproducible. please note the kernel version is 3.10.x
  • Attaching the logs, let me know, if more information is required.

Can somebody confirm, if its the same issue ?

Thanks

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL7 Server/Client
https://testing.hpdd.intel.com/test_sets/285d11ca-bc00-11e5-a592-5254006e85c2

Comment by Gerrit Updater [ 27/Jan/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/18178
Subject: LU-7428 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4474b31c11f7b9034a25371abbaafe61371e9b85

Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Encountered same issue for tag 2.7.66 for FULL- EL7.1 Server/EL6.7 Client , master , build# 3314 for replay-vbr.
https://testing.hpdd.intel.com/test_sets/85c15246-ca91-11e5-9609-5254006e85c2
Encountered another instance for tag 2.7.66 for FUL - EL7.1 Server/EL7.1 Client , master , build# 3314.
https://testing.hpdd.intel.com/test_sets/b9fde76c-ca88-11e5-84d3-5254006e85c2

Update not seen after 90s: wanted '' got 'lustre:MDT0000'
 replay-vbr test_1b: @@@@@@ FAIL: /dev/lvm-Role_MDS/P1 failed to initialize! 
Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Another failure for master : Tag 2.7.66 FULL - EL7.1 Server/SLES11 SP3 Client, build# 3314 for replay-single.
https://testing.hpdd.intel.com/test_sets/9fad2e16-ca7b-11e5-9609-5254006e85c2

Another instance for FULL - EL7.1 Server/EL7.1 Client - DNE, master, build# 3314
https://testing.hpdd.intel.com/test_sets/a973fa52-cac5-11e5-9609-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9ece523c-cac5-11e5-9609-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ]

Another instance found for hardfailover : EL7 Server/Client, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/8d13249a-ca8f-11e5-9609-5254006e85c2

Another instance found for hardfailover : EL7 Server/SLES11 SP3 Client, tag 2.7.66, master build 3316
https://testing.hpdd.intel.com/test_sessions/2fbf67e4-cd4c-11e5-b1fa-5254006e85c2

Another instance found for Full tag 2.7.66 - EL7.1 Server/EL6.7 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/85c15246-ca91-11e5-9609-5254006e85c2

Another instance found for Full tag 2.7.66 - EL7.1 Server/EL7.1 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/b9fde76c-ca88-11e5-84d3-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/SLES11 SP3 Client, build# 3314
https://testing.hpdd.intel.com/test_sets/a63f3418-ca7b-11e5-9609-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9fad2e16-ca7b-11e5-9609-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - DNE, build# 3314
https://testing.hpdd.intel.com/test_sets/a973fa52-cac5-11e5-9609-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9ece523c-cac5-11e5-9609-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found on b2_8 for failover testing , build# 6.
https://testing.hpdd.intel.com/test_sessions/eb9f29ec-d8da-11e5-83e2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/2f0aa9f6-d5a5-11e5-9cc2-5254006e85c2

Comment by Gerrit Updater [ 11/Mar/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/18871
Subject: LU-7428 test: commit the label change to disk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c6eb1c7f48e205168bb9f3a011af8c957b19616

Comment by Gerrit Updater [ 06/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18871/
Subject: LU-7428 test: commit the label change to disk
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4635d6235f8c8f4bb212aa59710c4f68db6acd7a

Comment by Joseph Gmitter (Inactive) [ 11/Apr/16 ]

Additional fix has landed to master for 2.9.0

Comment by Andreas Dilger [ 09/May/16 ]

The patch landed to try the new workaround, but test 84 is still in the ALWAYS_EXCEPT list in conf-sanity.sh, so until that is removed there is no way to know whether this problem was actually fixed.

Comment by Gerrit Updater [ 13/May/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/20194
Subject: LU-7428 test: remove test 84 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 589c22a31431c94fcd981c95847d295f841fb1e8

Comment by Andreas Dilger [ 20/May/16 ]

It might be worthwhile to test the patch from https://github.com/Xyratex/lustre-stable/commit/6197a27f174e683d3c66137db8976bddc7ef179b to see if that is fixing the problem? I think that patch could be simplified to just call sb->s_op->s_freeze() before marking the device read-only.

Comment by Gerrit Updater [ 01/Jun/16 ]

Gu Zheng (gzheng@ddn.com) uploaded a new patch: http://review.whamcloud.com/20535
Subject: LU-7428 osd: freeze fs before set device readonly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cb2a762f99cef9ff86dc76445248f91b40f0199b

Comment by Gerrit Updater [ 02/Jun/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/20586
Subject: LU-7428 osd: set rdonly correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 521b8290fbf0b47d4ad03272a206d038f648db2d

Comment by Hongchao Zhang [ 02/Jun/16 ]

the patch ported from MRP-2135 (https://github.com/Xyratex/lustre-stable/commit/6197a27f174e683d3c66137db8976bddc7ef179b)
is tracked at http://review.whamcloud.com/20586

Comment by James A Simmons [ 07/Jun/16 ]

Does this patch mean we don't need LU-684 anymore?

Comment by Andreas Dilger [ 07/Jun/16 ]

No, this won't replace LU-684.

This patch is to (hopefully) fix a problem where the device is sync'd and set read-only, but loses some recent writes, for an unknown reason. This shows up with a variety of different symptoms, and may be a result of bad interactions with LVM and VM virtual block devices, or it may be caused by the dev readonly patches.

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20586/
Subject: LU-7428 osd: set device read-only correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a079ade7913b923b795ea5c01df4e69bf1a87691

Comment by Peter Jones [ 13/Jul/16 ]

Landed for 2.9

Comment by John Hammond [ 09/Sep/16 ]

Did the landing of http://review.whamcloud.com/20586/ resolve this issue? I see that we still have 84 in ALWAYS_EXCEPT.

Comment by Hongchao Zhang [ 13/Sep/16 ]

the patch http://review.whamcloud.com/#/c/20194/ to remove test 84 has been refreshed, and have passed the tests in Maloo now.

Comment by Andreas Dilger [ 15/Sep/16 ]

Reopen until patch enabling test_84 actually lands.

Comment by Gerrit Updater [ 26/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20194/
Subject: LU-7428 test: remove test 84 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e40a3fd8a531ed60528ca82e02ce41918b1be6ba

Comment by Peter Jones [ 26/Sep/16 ]

Test re-enabled for 2.9

Comment by Jian Yu [ 10/Oct/16 ]

Hi Hongchao,

With patch http://review.whamcloud.com/7200 on master branch, conf-sanity test 84 failed as follows:

CMD: onyx-31vm7 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
Update not seen after 90s: wanted '' got 'lustre:MDT0000'
 conf-sanity test_84: @@@@@@ FAIL: /dev/mapper/mds1_flakey failed to initialize! 

https://testing.hpdd.intel.com/test_sets/e88a61c2-89bf-11e6-a8b7-5254006e85c2

Could you please advise? Thank you.

Comment by Jian Yu [ 19/Oct/16 ]

Hi Hongchao,
Since the above failure is specific to dm-flakey patch, I filed LU-8729 to track it.

Generated at Sat Feb 10 02:08:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.