Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4621

recovery-mds-scale: test_failover_ost

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0, Lustre 2.5.3
    • client and server: lustre-master build # 1877
    • 3
    • 12647

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/2a24816e-9088-11e3-91ee-52540035b04c.

      The sub-test test_failover_ost failed with the following error:

      test_failover_ost returned 1

      Client 3 console shows D process:

      11:05:02:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_ost: failover OST == 11:04:36 (1391713476)
      11:05:03:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on client-32vm5
      11:05:03:Lustre: DEBUG MARKER: Started client load: dd on client-32vm5
      11:05:04:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:
      11:05:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on client-32vm6
      11:05:05:Lustre: DEBUG MARKER: Started client load: tar on client-32vm6
      11:05:05:Lustre: DEBUG MARKER: cat /tmp/client-load.pid
      11:05:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
      11:05:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
      11:05:07:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      11:05:07:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      11:05:08:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      11:05:08:		exit $rc
      11:05:09:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_tar.sh
      11:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost7 recovery complete before doing next failover...
      11:05:09:Lustre: DEBUG MARKER: Wait ost7 recovery complete before doing next failover...
      11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff880028c28400 x1459227849618364/t309237768099(309237768099) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
      11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 53 previous similar messages
      11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88002fdea800 x1459227849618852/t309237768160(309237768160) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
      11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 13 previous similar messages
      11:05:11:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
      11:05:12:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      11:05:12:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      11:05:13:Lustre: lustre-MDT0000-mdc-ffff88007a375c00: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp)
      11:05:13:Lustre: DEBUG MARKER: lctl get_param -n at_max
      11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:15:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:38:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:38:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:45:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:45:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:48:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:48:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on ost7
      11:05:54:Lustre: DEBUG MARKER: Starting failover on ost7
      11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713512/real 1391713512]  req@ffff880024f1fc00 x1459227849666748/t0(0) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 440/432 e 0 to 1 dl 1391713530 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      11:10:02:Lustre: lustre-OST0004-osc-ffff88007a375c00: Connection to lustre-OST0004 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection to lustre-OST0002 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: Skipped 1 previous similar message
      11:10:02:LNet: Host 10.10.4.199 reset our connection while we were sending data; it may have rebooted.
      11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: Skipped 2 previous similar messages
      11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection restored to lustre-OST0000 (at 10.10.4.203@tcp)
      11:10:02:Lustre: lustre-OST0001-osc-ffff88007a375c00: Connection restored to lustre-OST0001 (at 10.10.4.203@tcp)
      11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713662/real 1391713662]  req@ffff88002438c400 x1459227849671944/t0(0) o8->lustre-OST0002-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 400/544 e 0 to 1 dl 1391713687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 128 previous similar messages
      11:10:04:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection restored to lustre-OST0002 (at 10.10.4.203@tcp)
      11:10:04:INFO: task tar:24176 blocked for more than 120 seconds.
      11:10:04:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      11:10:05:tar           D 0000000000000001     0 24176  24163 0x00000080
      11:10:05: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
      11:10:05: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
      11:10:05: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
      11:10:05:Call Trace:
      11:10:05: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
      11:10:05: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
      11:10:05: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
      11:10:06: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
      11:10:06: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
      11:10:06: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
      11:10:06: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
      11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:10:07: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
      11:10:07: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
      11:10:07: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
      11:10:07: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
      11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:10:08: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
      11:10:09: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
      11:10:09: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
      11:10:09: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
      11:10:09: [<ffffffff8119ea78>] notify_change+0x168/0x340
      11:10:10: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
      11:10:10: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
      11:10:10: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
      11:10:10: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
      11:10:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      11:10:10:Lustre: lustre-OST0003-osc-ffff88007a375c00: Connection restored to lustre-OST0003 (at 10.10.4.203@tcp)
      11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880023a03800 x1459227849663432/t4295715092(4295715092) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.203@tcp:28/4 lens 440/400 e 0 to 0 dl 1391713884 ref 2 fl Interpret:R/4/0 rc -2/-2
      11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 117 previous similar messages
      11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
      11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
      11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
      11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
      11:12:37:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection restored to lustre-OST0005 (at 10.10.4.203@tcp)
      11:12:37:INFO: task tar:24176 blocked for more than 120 seconds.
      11:12:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      11:12:37:tar           D 0000000000000001     0 24176  24163 0x00000080
      11:12:39: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
      11:12:39: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
      11:12:40: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
      11:12:40:Call Trace:
      11:12:40: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
      11:12:40: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
      11:12:40: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
      11:12:40: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
      11:12:41: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
      11:12:41: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
      11:12:41: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
      11:12:42: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:12:42: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
      11:12:42: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
      11:12:42: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
      11:12:42: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
      11:12:43: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:12:44: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
      11:12:44: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
      11:12:44: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
      11:12:44: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
      11:12:44: [<ffffffff8119ea78>] notify_change+0x168/0x340
      11:12:45: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
      11:12:45: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
      11:12:45: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
      11:12:46: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
      11:12:46: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      11:12:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      11:12:47:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      11:12:48:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      11:12:48:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      11:12:48:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      

      Attachments

        Issue Links

          Activity

            [LU-4621] recovery-mds-scale: test_failover_ost
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/6143ccd0-5ea6-11e4-badb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/567427ca-80b9-11e4-9ec8-5254006e85c2
            yujian Jian Yu added a comment - Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) TEST_GROUP=failover The same failure occurred: https://testing.hpdd.intel.com/test_sets/cd75fb30-3269-11e4-8c3a-5254006e85c2 https://testing.hpdd.intel.com/test_sets/fd86748a-3269-11e4-bbc4-5254006e85c2

            the patch against b2_5 is tracked at http://review.whamcloud.com/#/c/11541/

            hongchao.zhang Hongchao Zhang added a comment - the patch against b2_5 is tracked at http://review.whamcloud.com/#/c/11541/
            yujian Jian Yu added a comment -

            Hi Hongchao,

            I'm trying to back-port the patch http://review.whamcloud.com/10626 to Lustre b2_5 branch but found that the function ofd_create_hdl() does not exist on b2_5. The function was introduced by patch http://review.whamcloud.com/7130 on master branch.

            Could you please take a look to see how to fix the issue (originally reported in LU-3325) on Lustre b2_5 branch?

            yujian Jian Yu added a comment - Hi Hongchao, I'm trying to back-port the patch http://review.whamcloud.com/10626 to Lustre b2_5 branch but found that the function ofd_create_hdl() does not exist on b2_5. The function was introduced by patch http://review.whamcloud.com/7130 on master branch. Could you please take a look to see how to fix the issue (originally reported in LU-3325 ) on Lustre b2_5 branch?

            Patch has landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch has landed to Master.

            Note that the "debug" patch http://review.whamcloud.com/10626 has now turned into a "fix" patch.

            adilger Andreas Dilger added a comment - Note that the "debug" patch http://review.whamcloud.com/10626 has now turned into a "fix" patch.

            Looking at this test, it seems unlikely that you will hit the same problem just in a single autotest run. I think the patch needs to be fixed up and landed to master in order to be able to get enough runs to hit the failure.

            adilger Andreas Dilger added a comment - Looking at this test, it seems unlikely that you will hit the same problem just in a single autotest run. I think the patch needs to be fixed up and landed to master in order to be able to get enough runs to hit the failure.

            well, you are right, the precreate is 'multi-transaction' request, like OUT batches, and during replay we will see not continuos transaction sequence but with gaps. Technically it is not a problem. Meanwhile I don't see how replay can be differ from original request, but let's see what your patch will show. Is that possible to create test case for that?

            tappro Mikhail Pershin added a comment - well, you are right, the precreate is 'multi-transaction' request, like OUT batches, and during replay we will see not continuos transaction sequence but with gaps. Technically it is not a problem. Meanwhile I don't see how replay can be differ from original request, but let's see what your patch will show. Is that possible to create test case for that?
            hongchao.zhang Hongchao Zhang added a comment - - edited

            the debug patch is tracked at http://review.whamcloud.com/#/c/10626/, it will check the result of the replayed creation request from OSP during recovery.

            hongchao.zhang Hongchao Zhang added a comment - - edited the debug patch is tracked at http://review.whamcloud.com/#/c/10626/ , it will check the result of the replayed creation request from OSP during recovery.

            In the recent failed reports of this issue in Maloo, there are some suspected logs related to the bug.
            the clients needed to recover have connected and queued its replay request, but the "transno" order is still broken

            Lustre: lustre-OST0004: Will be in recovery for at least 1:00, or until 3 clients reconnect
            LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715089, ql: 2, comp: 1, conn: 3, next: 4295715092, last_committed: 4295715015)
            LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715093, ql: 2, comp: 1, conn: 3, next: 4295715095, last_committed: 4295715015)
            LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715121, ql: 2, comp: 1, conn: 3, next: 4295715130, last_committed: 4295715015)
            LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90d:0x0: rc = -2
            LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90c:0x0: rc = -2
            Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45
            Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45
            

            this is caused by the multiple creating operations (ofd_precreate_objects) in "ofd_create_hdl", for each creating operation will start/stop a transaction, which will increase the transno, and the final transno will be
            the transno produced by the last creating operation.

            this issue could be related to the different result of the original creating request from the replay one, which the replay request could create less objects, for instance, there is less free space for the "dd" in the other node.

            hongchao.zhang Hongchao Zhang added a comment - In the recent failed reports of this issue in Maloo, there are some suspected logs related to the bug. the clients needed to recover have connected and queued its replay request, but the "transno" order is still broken Lustre: lustre-OST0004: Will be in recovery for at least 1:00, or until 3 clients reconnect LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715089, ql: 2, comp: 1, conn: 3, next: 4295715092, last_committed: 4295715015) LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715093, ql: 2, comp: 1, conn: 3, next: 4295715095, last_committed: 4295715015) LustreError: 9129:0:(ldlm_lib.c:1744:check_for_next_transno()) lustre-OST0004: waking for gap in transno, VBR is OFF (skip: 4295715121, ql: 2, comp: 1, conn: 3, next: 4295715130, last_committed: 4295715015) LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90d:0x0: rc = -2 LustreError: 9129:0:(ldlm_resource.c:1154:ldlm_resource_get()) lustre-OST0004: lvbo_init failed for resource 0x2f90c:0x0: rc = -2 Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45 Lustre: lustre-OST0004: Client 8710edc6-b33b-5d92-cf10-e2fffadf506e (at 10.10.4.201@tcp) reconnecting, waiting for 3 clients in recovery for 1:45 this is caused by the multiple creating operations (ofd_precreate_objects) in "ofd_create_hdl", for each creating operation will start/stop a transaction, which will increase the transno, and the final transno will be the transno produced by the last creating operation. this issue could be related to the different result of the original creating request from the replay one, which the replay request could create less objects, for instance, there is less free space for the "dd" in the other node.

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: