Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4621

recovery-mds-scale: test_failover_ost

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0, Lustre 2.5.3
    • client and server: lustre-master build # 1877
    • 3
    • 12647

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/2a24816e-9088-11e3-91ee-52540035b04c.

      The sub-test test_failover_ost failed with the following error:

      test_failover_ost returned 1

      Client 3 console shows D process:

      11:05:02:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_ost: failover OST == 11:04:36 (1391713476)
      11:05:03:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on client-32vm5
      11:05:03:Lustre: DEBUG MARKER: Started client load: dd on client-32vm5
      11:05:04:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:
      11:05:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on client-32vm6
      11:05:05:Lustre: DEBUG MARKER: Started client load: tar on client-32vm6
      11:05:05:Lustre: DEBUG MARKER: cat /tmp/client-load.pid
      11:05:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
      11:05:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
      11:05:07:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      11:05:07:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      11:05:08:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      11:05:08:		exit $rc
      11:05:09:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_tar.sh
      11:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait ost7 recovery complete before doing next failover...
      11:05:09:Lustre: DEBUG MARKER: Wait ost7 recovery complete before doing next failover...
      11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff880028c28400 x1459227849618364/t309237768099(309237768099) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
      11:05:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 53 previous similar messages
      11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88002fdea800 x1459227849618852/t309237768160(309237768160) o101->lustre-MDT0000-mdc-ffff88007a375c00@10.10.4.198@tcp:12/10 lens 568/544 e 0 to 0 dl 1391713537 ref 2 fl Interpret:R/4/0 rc 301/301
      11:05:11:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 13 previous similar messages
      11:05:11:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
      11:05:12:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      11:05:12:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      11:05:13:Lustre: lustre-MDT0000-mdc-ffff88007a375c00: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp)
      11:05:13:Lustre: DEBUG MARKER: lctl get_param -n at_max
      11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:15:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:38:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:38:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:39:Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:40:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:42:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:43:Lustre: DEBUG MARKER: osc.lustre-OST0002-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:45:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:45:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:48:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:48:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: osc.lustre-OST0003-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:49:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:50:Lustre: DEBUG MARKER: osc.lustre-OST0004-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: osc.lustre-OST0005-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: osc.lustre-OST0006-osc-*.ost_server_uuid in FULL state after 0 sec
      11:05:53:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on ost7
      11:05:54:Lustre: DEBUG MARKER: Starting failover on ost7
      11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713512/real 1391713512]  req@ffff880024f1fc00 x1459227849666748/t0(0) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 440/432 e 0 to 1 dl 1391713530 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      11:10:01:Lustre: 1984:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      11:10:02:Lustre: lustre-OST0004-osc-ffff88007a375c00: Connection to lustre-OST0004 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection to lustre-OST0002 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: Skipped 1 previous similar message
      11:10:02:LNet: Host 10.10.4.199 reset our connection while we were sending data; it may have rebooted.
      11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      11:10:02:Lustre: Skipped 2 previous similar messages
      11:10:02:Lustre: lustre-OST0000-osc-ffff88007a375c00: Connection restored to lustre-OST0000 (at 10.10.4.203@tcp)
      11:10:02:Lustre: lustre-OST0001-osc-ffff88007a375c00: Connection restored to lustre-OST0001 (at 10.10.4.203@tcp)
      11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391713662/real 1391713662]  req@ffff88002438c400 x1459227849671944/t0(0) o8->lustre-OST0002-osc-ffff88007a375c00@10.10.4.199@tcp:28/4 lens 400/544 e 0 to 1 dl 1391713687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      11:10:03:Lustre: 1982:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 128 previous similar messages
      11:10:04:Lustre: lustre-OST0002-osc-ffff88007a375c00: Connection restored to lustre-OST0002 (at 10.10.4.203@tcp)
      11:10:04:INFO: task tar:24176 blocked for more than 120 seconds.
      11:10:04:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      11:10:05:tar           D 0000000000000001     0 24176  24163 0x00000080
      11:10:05: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
      11:10:05: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
      11:10:05: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
      11:10:05:Call Trace:
      11:10:05: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
      11:10:05: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
      11:10:05: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
      11:10:06: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
      11:10:06: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
      11:10:06: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
      11:10:06: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
      11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:10:07: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
      11:10:07: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
      11:10:07: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
      11:10:07: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
      11:10:07: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:10:08: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
      11:10:09: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
      11:10:09: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
      11:10:09: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
      11:10:09: [<ffffffff8119ea78>] notify_change+0x168/0x340
      11:10:10: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
      11:10:10: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
      11:10:10: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
      11:10:10: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
      11:10:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      11:10:10:Lustre: lustre-OST0003-osc-ffff88007a375c00: Connection restored to lustre-OST0003 (at 10.10.4.203@tcp)
      11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880023a03800 x1459227849663432/t4295715092(4295715092) o2->lustre-OST0004-osc-ffff88007a375c00@10.10.4.203@tcp:28/4 lens 440/400 e 0 to 0 dl 1391713884 ref 2 fl Interpret:R/4/0 rc -2/-2
      11:10:10:LustreError: 1982:0:(client.c:2777:ptlrpc_replay_interpret()) Skipped 117 previous similar messages
      11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
      11:12:36:LustreError: 11-0: lustre-OST0004-osc-ffff88007a375c00: Communicating with 10.10.4.203@tcp, operation ldlm_enqueue failed with -12.
      11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
      11:12:36:LustreError: 1982:0:(import.c:631:ptlrpc_connect_import()) already connecting
      11:12:37:Lustre: lustre-OST0005-osc-ffff88007a375c00: Connection restored to lustre-OST0005 (at 10.10.4.203@tcp)
      11:12:37:INFO: task tar:24176 blocked for more than 120 seconds.
      11:12:37:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      11:12:37:tar           D 0000000000000001     0 24176  24163 0x00000080
      11:12:39: ffff880062a21a18 0000000000000086 ffff880062a219e0 ffff880062a219dc
      11:12:39: ffff88007ae95538 ffff88007f823240 ffff880002216700 0000000000000400
      11:12:40: ffff88002fcc9058 ffff880062a21fd8 000000000000fb88 ffff88002fcc9058
      11:12:40:Call Trace:
      11:12:40: [<ffffffff8150f035>] schedule_timeout+0x215/0x2e0
      11:12:40: [<ffffffffa0743d90>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
      11:12:40: [<ffffffff8150ecb3>] wait_for_common+0x123/0x180
      11:12:40: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
      11:12:41: [<ffffffff8150edcd>] wait_for_completion+0x1d/0x20
      11:12:41: [<ffffffffa0b5d88c>] osc_io_setattr_end+0xbc/0x190 [osc]
      11:12:41: [<ffffffffa093b3a0>] ? lov_io_end_wrapper+0x0/0x100 [lov]
      11:12:42: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:12:42: [<ffffffffa055dc80>] ? cl_io_start+0x0/0x140 [obdclass]
      11:12:42: [<ffffffffa093b491>] lov_io_end_wrapper+0xf1/0x100 [lov]
      11:12:42: [<ffffffffa093b07e>] lov_io_call+0x8e/0x130 [lov]
      11:12:42: [<ffffffffa093ce0c>] lov_io_end+0x4c/0xf0 [lov]
      11:12:43: [<ffffffffa055d100>] cl_io_end+0x60/0x150 [obdclass]
      11:12:44: [<ffffffffa0561e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
      11:12:44: [<ffffffffa0a1bb58>] cl_setattr_ost+0x218/0x2f0 [lustre]
      11:12:44: [<ffffffffa09e67f5>] ll_setattr_raw+0xa45/0x10c0 [lustre]
      11:12:44: [<ffffffffa09e6ecd>] ll_setattr+0x5d/0xf0 [lustre]
      11:12:44: [<ffffffff8119ea78>] notify_change+0x168/0x340
      11:12:45: [<ffffffff811b2b1c>] utimes_common+0xdc/0x1b0
      11:12:45: [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
      11:12:45: [<ffffffff811b2cce>] do_utimes+0xde/0xf0
      11:12:46: [<ffffffff811b2de2>] sys_utimensat+0x32/0x90
      11:12:46: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      11:12:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      11:12:47:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      11:12:48:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      11:12:48:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      11:12:48:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: