[LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.8.0
Labels:
None
Environment:
Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)_dne_singlemds
Release
2.6.32_431.29.2.el6_lustremaster_9267_2_g959f8f7 Build Date: Sat 02 Jan 2016 05:21:40 PM UTC
Server 2.7.64
Client 2.7.64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

== replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 04:57:43 (1452229063)
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
Filesystem          1K-blocks  Used Available Use% Mounted on
fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
Failing mds1 on fre1225
Stopping /mnt/mds1 (opts:) on fre1225
pdsh@fre1227: fre1225: ssh exited with exit code 1
Failing mds2 on fre1225
Stopping /mnt/mds2 (opts:) on fre1225
pdsh@fre1227: fre1225: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1225
04:57:56 (1452229076) waiting for fre1225 network 900 secs ...
04:57:56 (1452229076) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
pdsh@fre1227: fre1225: ssh exited with exit code 1
pdsh@fre1227: fre1225: ssh exited with exit code 1
Started lustre-MDT0000
reboot facets: mds2
Failover mds2 to fre1225
04:58:07 (1452229087) waiting for fre1225 network 900 secs ...
04:58:07 (1452229087) network interface is UP
mount facets: mds2
Starting mds2: -o rw,user_xattr  /dev/vdc /mnt/mds2
pdsh@fre1227: fre1225: ssh exited with exit code 1
pdsh@fre1227: fre1225: ssh exited with exit code 1
Started lustre-MDT0001
fre1228: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
fre1227: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
fre1228: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
fre1227: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
/mnt/lustre/d110f.replay-single/striped_dir has type dir OK
 replay-single test_110f: @@@@@@ FAIL: 1 != 2 after recovery 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
  = /usr/lib64/lustre/tests/replay-single.sh:3600:check_striped_dir_110()
  = /usr/lib64/lustre/tests/replay-single.sh:3725:test_110f()
  = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4899:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:3731:main()
Dumping lctl log to /tmp/test_logs/1452229057/replay-single.test_110f.*.1452229095.log
fre1228: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

fre1225: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

fre1226: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.

FAIL 110f (33s)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

110f__0.console.MDS.log
149 kB
13/Jan/16 3:17 AM
110f__0.messages.MDS.log
234 kB
13/Jan/16 3:17 AM
110f__0.stdout.log
7 kB
13/Jan/16 3:17 AM
110f__PTLDEBUG.lctl.tgz
1.69 MB
13/Jan/16 3:17 AM
110f.lctl.tgz
958 kB
12/Jan/16 4:34 AM

Activity

[LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery

Joseph Gmitter (Inactive) added a comment - 27/Sep/17 2:21 PM

This is actually landed to master for 2.11.0.
(fixing the fixVersion)

Joseph Gmitter (Inactive) added a comment - 27/Sep/17 2:21 PM This is actually landed to master for 2.11.0. (fixing the fixVersion)

Yang Sheng added a comment - 19/Jul/17 4:44 AM

Landed to 2.10.

Yang Sheng added a comment - 19/Jul/17 4:44 AM Landed to 2.10.

Gerrit Updater added a comment - 19/Jul/17 3:32 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/
Subject: ~~LU-7653~~ tests: replay-single/110f fails for mdts on same MDS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c245e87fad249622b1974dd64f3e497653269ee6

Gerrit Updater added a comment - 19/Jul/17 3:32 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/ Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: Commit: c245e87fad249622b1974dd64f3e497653269ee6

Yang Sheng added a comment - 07/Jul/17 6:30 AM

Hi, ZhangWei,

As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node.

Thanks,
YangSheng

Yang Sheng added a comment - 07/Jul/17 6:30 AM Hi, ZhangWei, As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node. Thanks, YangSheng

ZhangWei (Inactive) added a comment - 06/Jul/17 12:21 PM

But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

ZhangWei (Inactive) added a comment - 06/Jul/17 12:21 PM But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

Gerrit Updater added a comment - 06/Jul/17 3:20 AM

Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940
Subject: ~~LU-7653~~ tests: replay-single/110f fails for mdts on same MDS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

Gerrit Updater added a comment - 06/Jul/17 3:20 AM Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940 Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

ZhangWei (Inactive) added a comment - 06/Jul/17 2:17 AM - edited

We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows:

test_110f() {
         ...
        mkdir -p $DIR/$tdir
        replay_barrier mds1 
        replay_barrier mds2 
        $LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir
       * fail mds2,mds1*

        check_striped_dir_110 || error "check striped_dir failed"

        rm -rf $DIR/$tdir || error "rmdir failed"

        return 0
}

And Our Lustre version is 2.9. Can some one help about this issue ?

ZhangWei (Inactive) added a comment - 06/Jul/17 2:17 AM - edited We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows: test_110f() { ... mkdir -p $DIR/$tdir replay_barrier mds1 replay_barrier mds2 $LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir * fail mds2,mds1* check_striped_dir_110 || error "check striped_dir failed" rm -rf $DIR/$tdir || error "rmdir failed" return 0 } And Our Lustre version is 2.9. Can some one help about this issue ?

Yang Sheng added a comment - 23/Jun/16 3:06 PM

This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order.

Thanks,
YangSheng

Yang Sheng added a comment - 23/Jun/16 3:06 PM This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order. Thanks, YangSheng

parinay v kondekar (Inactive) added a comment - 13/Jan/16 3:15 AM

James,
Its very consistent. I am attaching PTLDEBUG=-1 logs here.

Thanks.

parinay v kondekar (Inactive) added a comment - 13/Jan/16 3:15 AM James, Its very consistent. I am attaching PTLDEBUG=-1 logs here. Thanks.

James Nunez (Inactive) added a comment - 12/Jan/16 6:17 PM

Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you?

James Nunez (Inactive) added a comment - 12/Jan/16 6:17 PM Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you?

People

Assignee:: Alexander Boyko

Reporter:: parinay v kondekar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 12/Jan/16 4:34 AM

Updated:: 14/Apr/20 2:28 PM

Resolved:: 14/Apr/20 2:28 PM