[LU-3168] replay-ost-single test_3: rpc : FAIL: can't put import for osc.lustre-OST0000-osc-*.ost_server_uuid into FULL state after 662 sec, have REPLAY_LOCKS Created: 15/Apr/13  Updated: 22/Oct/15  Resolved: 22/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Low Priority Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-4230 Test failure on test suite replay-ost... Resolved
Severity: 3
Rank (Obsolete): 7728

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/cb34d710-a15b-11e2-b429-52540035b04c.

The sub-test test_3 failed with the following error:

test failed to respond and timed out

Info required for matching: replay-ost-single 3

From the test log:

== replay-ost-single test 3: Fail OST during write, with verification == 10:40:41 (1365529241)
Failing ost1 on wtm-13vm8
CMD: wtm-13vm8 grep -c /mnt/ost1' ' /proc/mounts
Stopping /mnt/ost1 (opts on wtm-13vm8
CMD: wtm-13vm8 umount -d /mnt/ost1
CMD: wtm-13vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: ost1
Failover ost1 to wtm-13vm8
10:41:01 (1365529261) waiting for wtm-13vm8 network 900 secs ...
10:41:01 (1365529261) network interface is UP
CMD: wtm-13vm8 hostname
mount facets: ost1
CMD: wtm-13vm8 test -b /dev/lvm-OSS/P1
Starting ost1: /dev/lvm-OSS/P1 /mnt/ost1
CMD: wtm-13vm8 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-OSS/P1 /mnt/ost1
CMD: wtm-13vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32
CMD: wtm-13vm8 e2label /dev/lvm-OSS/P1 2>/dev/null
Started lustre-OST0000
CMD: wtm-13vm1,wtm-13vm2.rosso.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-*.ost_server_uuid
wtm-13vm2: CMD: wtm-13vm2.rosso.whamcloud.com lctl get_param -n at_max
wtm-13vm1: CMD: wtm-13vm1.rosso.whamcloud.com lctl get_param -n at_max
wtm-13vm1: rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-*.ost_server_uuid into FULL state after 662 sec, have REPLAY_WAIT
wtm-13vm1: Trace dump:
wtm-13vm1: = /usr/lib64/lustre/tests/test-framework.sh:4024:error_noexit()
wtm-13vm1: = /usr/lib64/lustre/tests/test-framework.sh:4047:error()
wtm-13vm1: = /usr/lib64/lustre/tests/test-framework.sh:5070:_wait_import_state()
wtm-13vm1: = /usr/lib64/lustre/tests/test-framework.sh:5089:wait_import_state()
wtm-13vm1: = /usr/lib64/lustre/tests/test-framework.sh:5098:wait_import_state_mount()
wtm-13vm1: = rpc.sh:20:main()
[...]

From the OSS console log:

[...]
10:52:20:Lustre: lustre-OST0000: Client e2129d7f-62b9-b16f-bf66-e1b35c1765b7 (at 10.10.16.142@tcp) reconnecting, waiting for 3 clients in recovery for 0:41
10:52:20:Lustre: Skipped 78214 previous similar messages
10:52:20:LustreError: 5789:0:(ldlm_resource.c:1171:ldlm_resource_get()) lvbo_init failed for resource 2820: rc -2
10:52:20:LustreError: 5789:0:(ldlm_resource.c:1171:ldlm_resource_get()) Skipped 544404 previous similar messages
10:52:20:Lustre: lustre-OST0000: Client e2129d7f-62b9-b16f-bf66-e1b35c1765b7 (at 10.10.16.142@tcp) reconnecting, waiting for 3 clients in recovery for 0:42
10:52:20:Lustre: Skipped 160009 previous similar messages
10:52:20:Lustre: DEBUG MARKER: /usr/sbin/lctl mark rpc : @@@@@@ FAIL: can\'t put import for osc.lustre-OST0000-osc-*.ost_server_uuid into FULL state after 662 sec, have REPLAY_WAIT
[...]



 Comments   
Comment by Andreas Dilger [ 22/Apr/13 ]

It looks like this test has only started failing recently. Is it possible that the behaviour is different because hard failover mode has been enabled, or is it more likely to be a regression in the code?

Comment by Li Wei (Inactive) [ 05/Aug/13 ]

https://maloo.whamcloud.com/test_sets/6604a7f6-f9b9-11e2-aee1-52540035b04c

Comment by Bob Glossman (Inactive) [ 19/Sep/13 ]

https://maloo.whamcloud.com/test_sets/286162ce-2108-11e3-a2f9-52540035b04c

maybe this bug should have priority raised. maloo says it's being hit a lot:

Failure Rate: 31.00% of last 100 executions [all branches]

Comment by Andreas Dilger [ 20/Sep/13 ]

It's only shows at most a 1% hit rate on master.

Comment by Dmitry Eremin (Inactive) [ 25/Feb/14 ]

new failure: https://maloo.whamcloud.com/test_sets/2ed3f2a0-9df1-11e3-87da-52540035b04c

Comment by James Nunez (Inactive) [ 28/Oct/14 ]

Looks like same problem for b2_5 patch at: https://testing.hpdd.intel.com/test_sets/220dc3a4-5e9c-11e4-badb-5254006e85c2

The state/status is different

rpc : @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-*.ost_server_uuid into FULL state after 662 sec, have DISCONN
Comment by Jian Yu [ 15/Dec/14 ]

It looks like this is a duplicate of LU-4230.

Comment by nasf (Inactive) [ 04/May/15 ]

Another failure instance on b2_5:
https://testing.hpdd.intel.com/test_sets/e6c46776-f1f3-11e4-98d4-5254006e85c2

Comment by Andreas Dilger [ 22/Oct/15 ]

Closing this old bug as basically unreproducable.

Generated at Sat Feb 10 01:31:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.