[LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery Created: 12/Jan/16 Updated: 14/Apr/20 Resolved: 14/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | parinay v kondekar (Inactive) | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)_dne_singlemds |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
== replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 04:57:43 (1452229063) Filesystem 1K-blocks Used Available Use% Mounted on fre1225@tcp:/lustre 1377952 68464 1236816 6% /mnt/lustre Filesystem 1K-blocks Used Available Use% Mounted on fre1225@tcp:/lustre 1377952 68464 1236816 6% /mnt/lustre Failing mds1 on fre1225 Stopping /mnt/mds1 (opts:) on fre1225 pdsh@fre1227: fre1225: ssh exited with exit code 1 Failing mds2 on fre1225 Stopping /mnt/mds2 (opts:) on fre1225 pdsh@fre1227: fre1225: ssh exited with exit code 1 reboot facets: mds1 Failover mds1 to fre1225 04:57:56 (1452229076) waiting for fre1225 network 900 secs ... 04:57:56 (1452229076) network interface is UP mount facets: mds1 Starting mds1: -o rw,user_xattr /dev/vdb /mnt/mds1 pdsh@fre1227: fre1225: ssh exited with exit code 1 pdsh@fre1227: fre1225: ssh exited with exit code 1 Started lustre-MDT0000 reboot facets: mds2 Failover mds2 to fre1225 04:58:07 (1452229087) waiting for fre1225 network 900 secs ... 04:58:07 (1452229087) network interface is UP mount facets: mds2 Starting mds2: -o rw,user_xattr /dev/vdc /mnt/mds2 pdsh@fre1227: fre1225: ssh exited with exit code 1 pdsh@fre1227: fre1225: ssh exited with exit code 1 Started lustre-MDT0001 fre1228: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec fre1227: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec fre1228: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec fre1227: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec /mnt/lustre/d110f.replay-single/striped_dir has type dir OK replay-single test_110f: @@@@@@ FAIL: 1 != 2 after recovery Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4798:error() = /usr/lib64/lustre/tests/replay-single.sh:3600:check_striped_dir_110() = /usr/lib64/lustre/tests/replay-single.sh:3725:test_110f() = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4899:run_test() = /usr/lib64/lustre/tests/replay-single.sh:3731:main() Dumping lctl log to /tmp/test_logs/1452229057/replay-single.test_110f.*.1452229095.log fre1228: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts. fre1225: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts. fre1226: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts. FAIL 110f (33s) |
| Comments |
| Comment by James Nunez (Inactive) [ 12/Jan/16 ] |
|
Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you? |
| Comment by parinay v kondekar (Inactive) [ 13/Jan/16 ] |
|
James, Thanks. |
| Comment by Yang Sheng [ 23/Jun/16 ] |
|
This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order. Thanks, |
| Comment by ZhangWei [ 06/Jul/17 ] |
|
We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows: test_110f() {
...
mkdir -p $DIR/$tdir
replay_barrier mds1
replay_barrier mds2
$LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir
* fail mds2,mds1*
check_striped_dir_110 || error "check striped_dir failed"
rm -rf $DIR/$tdir || error "rmdir failed"
return 0
}
And Our Lustre version is 2.9. Can some one help about this issue ? |
| Comment by Gerrit Updater [ 06/Jul/17 ] |
|
Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940 |
| Comment by ZhangWei [ 06/Jul/17 ] |
|
But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ? |
| Comment by Yang Sheng [ 07/Jul/17 ] |
|
Hi, ZhangWei, As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node. Thanks, |
| Comment by Gerrit Updater [ 19/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/ |
| Comment by Yang Sheng [ 19/Jul/17 ] |
|
Landed to 2.10. |
| Comment by Joseph Gmitter (Inactive) [ 27/Sep/17 ] |
|
This is actually landed to master for 2.11.0. |
| Comment by Andreas Dilger [ 07/Oct/19 ] |
|
Still seeing this failure very frequently on Oleg's test system: Test session: |
| Comment by Gerrit Updater [ 06/Apr/20 ] |
|
Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/38137 |
| Comment by Gerrit Updater [ 14/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38137/ |
| Comment by Peter Jones [ 14/Apr/20 ] |
|
Landed for 2.14 |