[LU-4621] recovery-mds-scale: test_failover_ost Created: 12/Feb/14  Updated: 04/Oct/17  Resolved: 27/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.3
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: 22pl
Environment:

client and server: lustre-master build # 1877


Issue Links:
Duplicate
is duplicated by LU-5785 recovery-mds-scale test_failover_ost:... Resolved
Related
is related to LU-3325 recovery-mds-scale test_failover_mds:... Resolved
is related to LU-5157 Failure on test suite recovery-mds-sc... Resolved
Severity: 3
Rank (Obsolete): 12647

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/2a24816e-9088-11e3-91ee-52540035b04c.

The sub-test test_failover_ost failed with the following error:

test_failover_ost returned 1

Client 3 console shows D process:

11:05:02:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_ost: failover OST == 11:04:36 (1391713476)
11:05:03:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on client-32vm5
11:05:03:Lustre: DEBUG MARKER: Started client load: dd on client-32vm5
11:05:04:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:
11:05:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on client-32vm6
11:05:05:Lustre: DEBUG MARKER: Started client load: tar on client-32vm6
11:05:05:Lustre: DEBUG MARKER: cat /tmp/client-load.pid
11:05:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
11:05:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
11:05:07:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
11:05:07:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
11:05:08:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
11:05:08:		exit $rc
11:05:09:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_tar.sh
11:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost7 recovery complete before doing next failover...
11:05:09:Lustre: DEBUG MARKER: Wait ost7 recovery complete before doing next failover...
11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff880028c28400 x1459227849618364/t309237768099(309237768099) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 53 previous similar messages
11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88002fdea800 x1459227849618852/t309237768160(309237768160) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 13 previous similar messages
11:05:11:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
11:05:12:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
11:05:12:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
11:05:13:Lustre: lustre-MDT0000-mdc-ffff88007a375c00: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp)
11:05:13:Lustre: DEBUG MARKER: lctl get_param -n at_max
11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:15:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:38:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:38:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:39:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:44:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:45:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:45:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:48:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:48:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:49:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
11:05:53:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on ost7
11:05:54:Lustre: DEBUG MARKER: Starting failover on ost7
11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713512/real 1391713512]  req@ffff880024f1fc00 x1459227849666748/t0(0) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 440/432 e 0 to 1 dl 1391713530 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
11:10:02:Lustre: lustre-OST0004-osc-ffff88007a375c00: Connection to lustre-OST0004 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
11:10:02:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection to lustre-OST0002 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
11:10:02:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
11:10:02:Lustre: Skipped 1 previous similar message
11:10:02:LNet: Host 10.10.4.199 reset our connection while we were sending data; it may have rebooted.
11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
11:10:02:Lustre: Skipped 2 previous similar messages
11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection restored to lustre-OST0000 (at 10.10.4.203@tcp)
11:10:02:Lustre: lustre-OST0001-osc-ffff88007a375c00: Connection restored to lustre-OST0001 (at 10.10.4.203@tcp)
11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713662/real 1391713662]  req@ffff88002438c400 x1459227849671944/t0(0) o8->lustre-OST0002-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 400/544 e 0 to 1 dl 1391713687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 128 previous similar messages
11:10:04:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection restored to lustre-OST0002 (at 10.10.4.203@tcp)
11:10:04:INFO: task tar:24176 blocked for more than 120 seconds.
11:10:04:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
11:10:05:tar           D 0000000000000001     0 24176  24163 0x00000080
11:10:05: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
11:10:05: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
11:10:05: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
11:10:05:Call Trace:
11:10:05: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
11:10:05: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
11:10:05: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
11:10:06: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
11:10:06: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
11:10:06: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
11:10:06: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
11:10:07: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
11:10:07: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
11:10:07: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
11:10:07: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
11:10:08: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
11:10:09: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
11:10:09: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
11:10:09: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
11:10:09: [<ffffffff8119ea78>] notify_change+0x168/0x340
11:10:10: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
11:10:10: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
11:10:10: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
11:10:10: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
11:10:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
11:10:10:Lustre: lustre-OST0003-osc-ffff88007a375c00: Connection restored to lustre-OST0003 (at 10.10.4.203@tcp)
11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880023a03800 x1459227849663432/t4295715092(4295715092) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.203@tcp:28/4 lens 440/400 e 0 to 0 dl 1391713884 ref 2 fl Interpret:R/4/0 rc -2/-2
11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 117 previous similar messages
11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
11:12:37:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection restored to lustre-OST0005 (at 10.10.4.203@tcp)
11:12:37:INFO: task tar:24176 blocked for more than 120 seconds.
11:12:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
11:12:37:tar           D 0000000000000001     0 24176  24163 0x00000080
11:12:39: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
11:12:39: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
11:12:40: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
11:12:40:Call Trace:
11:12:40: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
11:12:40: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
11:12:40: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
11:12:40: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
11:12:41: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
11:12:41: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
11:12:41: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
11:12:42: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
11:12:42: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
11:12:42: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
11:12:42: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
11:12:42: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
11:12:43: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
11:12:44: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
11:12:44: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
11:12:44: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
11:12:44: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
11:12:44: [<ffffffff8119ea78>] notify_change+0x168/0x340
11:12:45: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
11:12:45: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
11:12:45: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
11:12:46: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
11:12:46: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
11:12:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
11:12:47:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
11:12:48:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
11:12:48:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
11:12:48:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi


 Comments   
Comment by Andreas Dilger [ 28/Apr/14 ]

Sarah, when running the failover testing, how many times can the servers properly do failover before they hit a problem? I'm trying to figure out whether failover is working most of the time and occasionally fails, or if it mostly fails and rarely or never works?

Comment by Sarah Liu [ 05/May/14 ]

Hello Andreas,

In this case, the ost hit error only after 1 failover

2014-02-06 11:24:44 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           384 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 1 times
Comment by Peter Jones [ 09/May/14 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 02/Jun/14 ]

Hi Hongchao, could you please take a look at this. If needed, please run this test manually a few times to see if it can handle failovers. I'm most interested that failover is working in general, and not necessarily in this particular failure unless it is causing the recovery to break often.

Comment by Hongchao Zhang [ 03/Jun/14 ]

In the recent failed reports of this issue in Maloo, there are some suspected logs related to the bug.
the clients needed to recover have connected and queued its replay request, but the "transno" order is still broken

Lustre: lustre-OST0004: Will be in recovery for at least 1:00, or until 3 clients reconnect
LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715089, ql: 2, comp: 1, conn: 3, next: 4295715092, last_committed: 4295715015)
LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715093, ql: 2, comp: 1, conn: 3, next: 4295715095, last_committed: 4295715015)
LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715121, ql: 2, comp: 1, conn: 3, next: 4295715130, last_committed: 4295715015)
LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90d:0x0: rc = -2
LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90c:0x0: rc = -2
Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45
Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45

this is caused by the multiple creating operations (ofd_precreate_objects) in "ofd_create_hdl", for each creating operation will start/stop a transaction, which will increase the transno, and the final transno will be
the transno produced by the last creating operation.

this issue could be related to the different result of the original creating request from the replay one, which the replay request could create less objects, for instance, there is less free space for the "dd" in the other node.

Comment by Hongchao Zhang [ 04/Jun/14 ]

the debug patch is tracked at http://review.whamcloud.com/#/c/10626/, it will check the result of the replayed creation request from OSP during recovery.

Comment by Mikhail Pershin [ 12/Jun/14 ]

well, you are right, the precreate is 'multi-transaction' request, like OUT batches, and during replay we will see not continuos transaction sequence but with gaps. Technically it is not a problem. Meanwhile I don't see how replay can be differ from original request, but let's see what your patch will show. Is that possible to create test case for that?

Comment by Andreas Dilger [ 20/Jun/14 ]

Looking at this test, it seems unlikely that you will hit the same problem just in a single autotest run. I think the patch needs to be fixed up and landed to master in order to be able to get enough runs to hit the failure.

Comment by Andreas Dilger [ 04/Jul/14 ]

Note that the "debug" patch http://review.whamcloud.com/10626 has now turned into a "fix" patch.

Comment by Jodi Levi (Inactive) [ 10/Jul/14 ]

Patch has landed to Master.

Comment by Jian Yu [ 12/Aug/14 ]

Hi Hongchao,

I'm trying to back-port the patch http://review.whamcloud.com/10626 to Lustre b2_5 branch but found that the function ofd_create_hdl() does not exist on b2_5. The function was introduced by patch http://review.whamcloud.com/7130 on master branch.

Could you please take a look to see how to fix the issue (originally reported in LU-3325) on Lustre b2_5 branch?

Comment by Hongchao Zhang [ 21/Aug/14 ]

the patch against b2_5 is tracked at http://review.whamcloud.com/#/c/11541/

Comment by Jian Yu [ 02/Sep/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)
TEST_GROUP=failover

The same failure occurred:
https://testing.hpdd.intel.com/test_sets/cd75fb30-3269-11e4-8c3a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/fd86748a-3269-11e4-bbc4-5254006e85c2

Comment by Jian Yu [ 30/Oct/14 ]

More instances on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/6143ccd0-5ea6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/567427ca-80b9-11e4-9ec8-5254006e85c2

Generated at Sat Feb 10 01:44:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.