[LU-8222] replay-single test_0a: test failed to respond and timed out Created: 31/May/16  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Server/client : RHEL 7, master branch, build# 39282


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ee2b9da2-25b6-11e6-a3be-5254006e85c2.

The sub-test test_0a failed with the following error:

test failed to respond and timed out

Client dmesg:

[ 4291.067685] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: == replay-single test 0a: empty replay =============================================================== 08:54:47 (1464512087)
[ 4291.460662] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
[ 4291.727658] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
[ 4292.566872] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: local REPLAY BARRIER on lustre-MDT0000
[ 4298.096880] LustreError: 11-0: lustre-MDT0000-mdc-ffff8800558b5800: operation obd_ping to node 10.9.4.84@tcp failed: rc = -107
[ 4298.104735] Lustre: lustre-MDT0000-mdc-ffff8800558b5800: Connection to lustre-MDT0000 (at 10.9.4.84@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 4300.110135] Lustre: 22656:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1464512089/real 1464512089]  req@ffff88007912e100 x1535651801466544/t0(0) o400->MGC10.9.4.84@tcp@10.9.4.84@tcp:26/25 lens 224/224 e 0 to 1 dl 1464512096 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 4300.125416] LustreError: 166-1: MGC10.9.4.84@tcp: Connection to MGS (at 10.9.4.84@tcp) was lost; in progress operations using this service will fail
[ 4306.131160] Lustre: 22655:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1464512096/real 1464512096]  req@ffff88007912f000 x1535651801466784/t0(0) o250->MGC10.9.4.84@tcp@10.9.4.84@tcp:26/25 lens 520/544 e 0 to 1 dl 1464512102 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 4309.041466] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib
[ 4309.311583] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: lctl get_param -n at_max
[ 4310.133413] Lustre: Evicted from MGS (at 10.9.4.84@tcp) after server handle changed from 0x3cefc44a406cb1f5 to 0x3cefc44a406cb671
[ 4310.144216] Lustre: MGC10.9.4.84@tcp: Connection restored to 10.9.4.84@tcp (at 10.9.4.84@tcp)
[ 4312.245146] Lustre: lustre-MDT0000-mdc-ffff8800558b5800: Connection restored to 10.9.4.84@tcp (at 10.9.4.84@tcp)
[ 4312.528473] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: /usr/sbin/lctl mark trevis-8vm1.trevis.hpdd.intel.com: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 3 sec
[ 4312.670970] Lustre: DEBUG MARKER: trevis-8vm1.trevis.hpdd.intel.com: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 3 sec
[ 7896.917424] SysRq : Show State

This issue started from 05/11/2016 and since then it is occurring around every alternative day, i.e. once in 2 days.



 Comments   
Comment by Bob Glossman (Inactive) [ 03/Jun/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/b4e0cc72-2988-11e6-acf3-5254006e85c2

Comment by nasf (Inactive) [ 06/Jun/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/bd176ace-2b1b-11e6-80b9-5254006e85c2

Comment by Jian Yu [ 07/Jun/16 ]

More failure on master:
https://testing.hpdd.intel.com/test_sets/4b912dee-2ca7-11e6-80b9-5254006e85c2

Generated at Sat Feb 10 02:15:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.