[LU-8544] recovery-double-scale test_pairwise_fail: start client on trevis-54vm5 failed Created: 25/Aug/16  Updated: 05/Aug/20  Resolved: 29/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5788 recovery-double-scale test_pairwise_f... Resolved
is related to LU-6655 MDS LBUG: (ldlm_lib.c:2277:target_que... Resolved
is related to LU-7759 umount hanging in modern distros when... Resolved
is related to LU-6670 Hard Failover recovery-small test_28:... Open
is related to LU-4039 Failure on test suite replay-single t... Resolved
is related to LU-8526 replay-single test_90: @@@@@@ FAIL: w... Resolved
is related to LU-8731 lfs df exits with status 0 on failures Resolved
is related to LU-6610 lfs df -h query hangs when OST1 is u... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/72a0fd32-6033-11e6-aa74-5254006e85c2.

The sub-test test_pairwise_fail failed with the following error:

start client on trevis-54vm5 failed

test logs:

CMD: trevis-54vm5 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
+ pm -h powerman --off trevis-54vm5
Command completed successfully
+ pm -h powerman --on trevis-54vm5
Command completed successfully
14:43:16 (1470926596) waiting for trevis-54vm5 network 900 secs ...
waiting ping -c 1 -w 3 trevis-54vm5, 895 secs left ...
waiting ping -c 1 -w 3 trevis-54vm5, 890 secs left ...
waiting ping -c 1 -w 3 trevis-54vm5, 885 secs left ...
waiting ping -c 1 -w 3 trevis-54vm5, 880 secs left ...
14:43:48 (1470926628) network interface is UP
CMD: trevis-54vm5 hostname
pdsh@trevis-54vm1: trevis-54vm5: mcmd: connect failed: Connection refused
CMD: trevis-54vm5 hostname
Reintegrating trevis-54vm5
Starting client: trevis-54vm5:  -o user_xattr,flock trevis-54vm7:trevis-54vm3:/lustre /mnt/lustre
CMD: trevis-54vm5 mkdir -p /mnt/lustre
CMD: trevis-54vm5 mount -t lustre -o user_xattr,flock trevis-54vm7:trevis-54vm3:/lustre /mnt/lustre
CMD: trevis-54vm5 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 
trevis-54vm5: stat: cannot read file system information for ‘/mnt/lustre’: Input/output error
 recovery-double-scale test_pairwise_fail: @@@@@@ FAIL: start client on trevis-54vm5 failed
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4804:error()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:72:reboot_recover_node()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:160:failover_pair()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:251:test_pairwise_fail()
  = /usr/lib64/lustre/tests/test-framework.sh:5068:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5107:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4954:run_test()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:303:main()


 Comments   
Comment by Saurabh Tandan (Inactive) [ 06/Sep/16 ]

This issue has been seen around 40 times in past 30 days overall.

Comment by Saurabh Tandan (Inactive) [ 07/Sep/16 ]

This issue was first seen for master on 2016-07-08 for build# 3405, Tag 2.8.55, Lustre version: 2.8.55.27.geb2657a
https://testing.hpdd.intel.com/test_sets/e5a03c8e-4568-11e6-80b9-5254006e85c2

Comment by Peter Jones [ 08/Sep/16 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by James Nunez (Inactive) [ 08/Sep/16 ]

Looking at test failures in Maloo, I see that this test started failing with this error message on 2016-07-01. I see failures on both onyx and trevis.

Comment by Hongchao Zhang [ 09/Sep/16 ]

the problem could be related to LU-7759, and the default LL_SBI_LAZYSTATFS cause the "df" fail to wait the recovery to finish.

00000080:00000004:0.0:1472256895.043480:0:3882:0:(obd_class.h:1166:obd_statfs_async()) lustre-clilov-ffff880037e6c000: osfs ffff88007aa25210 age 4294647497, max_age 4294916319
00020000:00080000:0.0:1472256895.043483:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 0 inactive
00020000:00080000:0.0:1472256895.043484:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 1 inactive
00020000:00080000:0.0:1472256895.043485:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 2 inactive
00020000:00080000:0.0:1472256895.043485:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 3 inactive
00020000:00080000:0.0:1472256895.043485:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 4 inactive
00020000:00080000:0.0:1472256895.043486:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 5 inactive
00020000:00080000:0.0:1472256895.043486:0:3882:0:(lov_request.c:648:lov_prep_statfs_set()) lov idx 6 inactive
00000080:00020000:0.0:1472256895.043488:0:3882:0:(llite_lib.c:1890:ll_statfs_internal()) obd_statfs fails: rc = -5
Comment by Gerrit Updater [ 13/Sep/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22459
Subject: LU-8544 test: using lfs df in client_up
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7438b526bb79b5138fb51cd1ed58eadc1bbeab26

Comment by Gerrit Updater [ 29/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22459/
Subject: LU-8544 test: using lfs df in client_up
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 72ec6eb3c74c85f54277aadfd9b83167ea8e81ec

Comment by Peter Jones [ 29/Sep/16 ]

Landed for 2.9

Comment by Hongchao Zhang [ 14/Oct/16 ]

Hi Bruno,
I will check it and update its status once I found something.

Generated at Sat Feb 10 02:18:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.