[LU-4621] recovery-mds-scale: test_failover_ost Created: 12/Feb/14 Updated: 04/Oct/17 Resolved: 27/Apr/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 22pl | ||
| Environment: |
client and server: lustre-master build # 1877 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 12647 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/2a24816e-9088-11e3-91ee-52540035b04c. The sub-test test_failover_ost failed with the following error:
Client 3 console shows D process: 11:05:02:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_ost: failover OST == 11:04:36 (1391713476) 11:05:03:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on client-32vm5 11:05:03:Lustre: DEBUG MARKER: Started client load: dd on client-32vm5 11:05:04:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin: 11:05:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on client-32vm6 11:05:05:Lustre: DEBUG MARKER: Started client load: tar on client-32vm6 11:05:05:Lustre: DEBUG MARKER: cat /tmp/client-load.pid 11:05:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 11:05:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 11:05:07:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && 11:05:07: echo $(< /proc/sys/lnet/catastrophe) || echo 0); 11:05:08: if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi 11:05:08: exit $rc 11:05:09:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_tar.sh 11:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost7 recovery complete before doing next failover... 11:05:09:Lustre: DEBUG MARKER: Wait ost7 recovery complete before doing next failover... 11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff880028c28400 x1459227849618364/t309237768099(309237768099) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301 11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 53 previous similar messages 11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88002fdea800 x1459227849618852/t309237768160(309237768160) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301 11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 13 previous similar messages 11:05:11:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover... 11:05:12:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover... 11:05:12:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 11:05:13:Lustre: lustre-MDT0000-mdc-ffff88007a375c00: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp) 11:05:13:Lustre: DEBUG MARKER: lctl get_param -n at_max 11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:15:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:38:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:38:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:39:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:44:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:45:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:45:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:48:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:48:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:49:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec 11:05:53:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on ost7 11:05:54:Lustre: DEBUG MARKER: Starting failover on ost7 11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713512/real 1391713512] req@ffff880024f1fc00 x1459227849666748/t0(0) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 440/432 e 0 to 1 dl 1391713530 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 11:10:02:Lustre: lustre-OST0004-osc-ffff88007a375c00: Connection to lustre-OST0004 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete 11:10:02:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection to lustre-OST0002 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete 11:10:02:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete 11:10:02:Lustre: Skipped 1 previous similar message 11:10:02:LNet: Host 10.10.4.199 reset our connection while we were sending data; it may have rebooted. 11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete 11:10:02:Lustre: Skipped 2 previous similar messages 11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection restored to lustre-OST0000 (at 10.10.4.203@tcp) 11:10:02:Lustre: lustre-OST0001-osc-ffff88007a375c00: Connection restored to lustre-OST0001 (at 10.10.4.203@tcp) 11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713662/real 1391713662] req@ffff88002438c400 x1459227849671944/t0(0) o8->lustre-OST0002-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 400/544 e 0 to 1 dl 1391713687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 128 previous similar messages 11:10:04:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection restored to lustre-OST0002 (at 10.10.4.203@tcp) 11:10:04:INFO: task tar:24176 blocked for more than 120 seconds. 11:10:04:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 11:10:05:tar D 0000000000000001 0 24176 24163 0x00000080 11:10:05: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc 11:10:05: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400 11:10:05: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058 11:10:05:Call Trace: 11:10:05: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0 11:10:05: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 11:10:05: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180 11:10:06: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 11:10:06: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20 11:10:06: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc] 11:10:06: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov] 11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass] 11:10:07: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass] 11:10:07: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov] 11:10:07: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov] 11:10:07: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov] 11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass] 11:10:08: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass] 11:10:09: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre] 11:10:09: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre] 11:10:09: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre] 11:10:09: [<ffffffff8119ea78>] notify_change+0x168/0x340 11:10:10: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0 11:10:10: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210 11:10:10: [<ffffffff811b2cce>] do_utimes+0xde/0xf0 11:10:10: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90 11:10:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 11:10:10:Lustre: lustre-OST0003-osc-ffff88007a375c00: Connection restored to lustre-OST0003 (at 10.10.4.203@tcp) 11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff880023a03800 x1459227849663432/t4295715092(4295715092) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.203@tcp:28/4 lens 440/400 e 0 to 0 dl 1391713884 ref 2 fl Interpret:R/4/0 rc -2/-2 11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 117 previous similar messages 11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12. 11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12. 11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting 11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting 11:12:37:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection restored to lustre-OST0005 (at 10.10.4.203@tcp) 11:12:37:INFO: task tar:24176 blocked for more than 120 seconds. 11:12:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 11:12:37:tar D 0000000000000001 0 24176 24163 0x00000080 11:12:39: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc 11:12:39: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400 11:12:40: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058 11:12:40:Call Trace: 11:12:40: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0 11:12:40: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 11:12:40: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180 11:12:40: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 11:12:41: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20 11:12:41: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc] 11:12:41: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov] 11:12:42: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass] 11:12:42: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass] 11:12:42: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov] 11:12:42: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov] 11:12:42: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov] 11:12:43: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass] 11:12:44: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass] 11:12:44: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre] 11:12:44: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre] 11:12:44: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre] 11:12:44: [<ffffffff8119ea78>] notify_change+0x168/0x340 11:12:45: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0 11:12:45: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210 11:12:45: [<ffffffff811b2cce>] do_utimes+0xde/0xf0 11:12:46: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90 11:12:46: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 11:12:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK 11:12:47:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK 11:12:48:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && 11:12:48: echo $(< /proc/sys/lnet/catastrophe) || echo 0); 11:12:48: if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi |
| Comments |
| Comment by Andreas Dilger [ 28/Apr/14 ] |
|
Sarah, when running the failover testing, how many times can the servers properly do failover before they hit a problem? I'm trying to figure out whether failover is working most of the time and occasionally fails, or if it mostly fails and rarely or never works? |
| Comment by Sarah Liu [ 05/May/14 ] |
|
Hello Andreas, In this case, the ost hit error only after 1 failover 2014-02-06 11:24:44 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 384 seconds Number of failovers before exit: mds1: 0 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times ost7: 1 times |
| Comment by Peter Jones [ 09/May/14 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 02/Jun/14 ] |
|
Hi Hongchao, could you please take a look at this. If needed, please run this test manually a few times to see if it can handle failovers. I'm most interested that failover is working in general, and not necessarily in this particular failure unless it is causing the recovery to break often. |
| Comment by Hongchao Zhang [ 03/Jun/14 ] |
|
In the recent failed reports of this issue in Maloo, there are some suspected logs related to the bug. Lustre: lustre-OST0004: Will be in recovery for at least 1:00, or until 3 clients reconnect LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715089, ql: 2, comp: 1, conn: 3, next: 4295715092, last_committed: 4295715015) LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715093, ql: 2, comp: 1, conn: 3, next: 4295715095, last_committed: 4295715015) LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715121, ql: 2, comp: 1, conn: 3, next: 4295715130, last_committed: 4295715015) LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90d:0x0: rc = -2 LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90c:0x0: rc = -2 Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45 Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45 this is caused by the multiple creating operations (ofd_precreate_objects) in "ofd_create_hdl", for each creating operation will start/stop a transaction, which will increase the transno, and the final transno will be this issue could be related to the different result of the original creating request from the replay one, which the replay request could create less objects, for instance, there is less free space for the "dd" in the other node. |
| Comment by Hongchao Zhang [ 04/Jun/14 ] |
|
the debug patch is tracked at http://review.whamcloud.com/#/c/10626/, it will check the result of the replayed creation request from OSP during recovery. |
| Comment by Mikhail Pershin [ 12/Jun/14 ] |
|
well, you are right, the precreate is 'multi-transaction' request, like OUT batches, and during replay we will see not continuos transaction sequence but with gaps. Technically it is not a problem. Meanwhile I don't see how replay can be differ from original request, but let's see what your patch will show. Is that possible to create test case for that? |
| Comment by Andreas Dilger [ 20/Jun/14 ] |
|
Looking at this test, it seems unlikely that you will hit the same problem just in a single autotest run. I think the patch needs to be fixed up and landed to master in order to be able to get enough runs to hit the failure. |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
Note that the "debug" patch http://review.whamcloud.com/10626 has now turned into a "fix" patch. |
| Comment by Jodi Levi (Inactive) [ 10/Jul/14 ] |
|
Patch has landed to Master. |
| Comment by Jian Yu [ 12/Aug/14 ] |
|
Hi Hongchao, I'm trying to back-port the patch http://review.whamcloud.com/10626 to Lustre b2_5 branch but found that the function ofd_create_hdl() does not exist on b2_5. The function was introduced by patch http://review.whamcloud.com/7130 on master branch. Could you please take a look to see how to fix the issue (originally reported in |
| Comment by Hongchao Zhang [ 21/Aug/14 ] |
|
the patch against b2_5 is tracked at http://review.whamcloud.com/#/c/11541/ |
| Comment by Jian Yu [ 02/Sep/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) The same failure occurred: |
| Comment by Jian Yu [ 30/Oct/14 ] |
|
More instances on Lustre b2_5 branch: |