[LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery Created: 12/Jan/16  Updated: 14/Apr/20  Resolved: 14/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: parinay v kondekar (Inactive) Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: None
Environment:

Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)_dne_singlemds
Release
2.6.32_431.29.2.el6_lustremaster_9267_2_g959f8f7 Build Date: Sat 02 Jan 2016 05:21:40 PM UTC
Server 2.7.64
Client 2.7.64


Attachments: File 110f.lctl.tgz     File 110f__0.console.MDS.log     File 110f__0.messages.MDS.log     File 110f__0.stdout.log     File 110f__PTLDEBUG.lctl.tgz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
== replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 04:57:43 (1452229063)
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
Failing mds1 on fre1225
Stopping /mnt/mds1 (opts:) on fre1225
pdsh@fre1227: fre1225: ssh exited with exit code 1
Failing mds2 on fre1225
Stopping /mnt/mds2 (opts:) on fre1225
pdsh@fre1227: fre1225: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1225
04:57:56 (1452229076) waiting for fre1225 network 900 secs ...
04:57:56 (1452229076) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
pdsh@fre1227: fre1225: ssh exited with exit code 1
pdsh@fre1227: fre1225: ssh exited with exit code 1
Started lustre-MDT0000
reboot facets: mds2
Failover mds2 to fre1225
04:58:07 (1452229087) waiting for fre1225 network 900 secs ...
04:58:07 (1452229087) network interface is UP
mount facets: mds2
Starting mds2: -o rw,user_xattr  /dev/vdc /mnt/mds2
pdsh@fre1227: fre1225: ssh exited with exit code 1
pdsh@fre1227: fre1225: ssh exited with exit code 1
Started lustre-MDT0001
fre1228: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
fre1227: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
fre1228: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
fre1227: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
/mnt/lustre/d110f.replay-single/striped_dir has type dir OK
 replay-single test_110f: @@@@@@ FAIL: 1 != 2 after recovery 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
  = /usr/lib64/lustre/tests/replay-single.sh:3600:check_striped_dir_110()
  = /usr/lib64/lustre/tests/replay-single.sh:3725:test_110f()
  = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4899:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:3731:main()
Dumping lctl log to /tmp/test_logs/1452229057/replay-single.test_110f.*.1452229095.log
fre1228: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

fre1225: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

fre1226: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

FAIL 110f (33s)


 Comments   
Comment by James Nunez (Inactive) [ 12/Jan/16 ]

Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you?

Comment by parinay v kondekar (Inactive) [ 13/Jan/16 ]

James,
Its very consistent. I am attaching PTLDEBUG=-1 logs here.

Thanks.

Comment by Yang Sheng [ 23/Jun/16 ]

This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order.

Thanks,
YangSheng

Comment by ZhangWei [ 06/Jul/17 ]

We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows:

test_110f() {
         ...
        mkdir -p $DIR/$tdir
        replay_barrier mds1 
        replay_barrier mds2 
        $LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir
       * fail mds2,mds1*

        check_striped_dir_110 || error "check striped_dir failed"

        rm -rf $DIR/$tdir || error "rmdir failed"

        return 0
}

And Our Lustre version is 2.9. Can some one help about this issue ?

Comment by Gerrit Updater [ 06/Jul/17 ]

Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940
Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

Comment by ZhangWei [ 06/Jul/17 ]

But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

Comment by Yang Sheng [ 07/Jul/17 ]

Hi, ZhangWei,

As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node.

Thanks,
YangSheng

Comment by Gerrit Updater [ 19/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/
Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c245e87fad249622b1974dd64f3e497653269ee6

Comment by Yang Sheng [ 19/Jul/17 ]

Landed to 2.10.

Comment by Joseph Gmitter (Inactive) [ 27/Sep/17 ]

This is actually landed to master for 2.11.0.
(fixing the fixVersion)

Comment by Andreas Dilger [ 07/Oct/19 ]

Still seeing this failure very frequently on Oleg's test system:

Test session:
http://testing.linuxhacker.ru:3333/lustre-reports/3524/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Subtest log:
http://testing.linuxhacker.ru:3333/lustre-reports/3524/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/replay-single.test_110f.test_log.oleg264-client.log

Comment by Gerrit Updater [ 06/Apr/20 ]

Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/38137
Subject: LU-7653 lod: fix stripe allocation during recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 990d9d8cf607b035d2b341588212b77faf99f309

Comment by Gerrit Updater [ 14/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38137/
Subject: LU-7653 lod: fix stripe allocation during recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 858c0c6959c1319e83a18be5ef6cb50251542052

Comment by Peter Jones [ 14/Apr/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:10:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.