Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3325

recovery-mds-scale test_failover_mds: tar: Cannot write: Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.5.1

    • Lustre Branch: master
      Lustre Build: http://build.whamcloud.com/job/lustre-master/1486
      Distro/Arch: RHEL6.4/x86_64
      Test Group: failover
    • 3
    • 8209

    Description

      After running recovery-mds-scale test_failover_mds for about 9 hours (MDS failed over 38 times), client load on one of the clients failed as follows:

      tar: etc/ConsoleKit/seats.d/00-primary.seat: Cannot write: Input/output error
      tar: etc/ConsoleKit/seats.d/00-primary.seat: Cannot utime: Cannot send after transport endpoint shutdown
      tar: Exiting with failure status due to previous errors
      

      Console log on the client (client-32vm6) showed that:

      17:52:45:Lustre: DEBUG MARKER: mds1 has failed over 38 times, and counting...
      17:52:46:Lustre: Evicted from MGS (at 10.10.4.198@tcp) after server handle changed from 0x9510c59bdf7b69c8 to 0xa78bdb1c4b1a185f
      17:52:46:Lustre: MGC10.10.4.198@tcp: Connection restored to MGS (at 10.10.4.198@tcp)
      18:00:41:Lustre: lustre-OST0005-osc-ffff880037c72c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:LustreError: 11-0: lustre-OST0004-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 3 previous similar messages
      18:00:41:LustreError: 11-0: lustre-OST0000-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 1 previous similar message
      18:00:41:Lustre: lustre-OST0000-osc-ffff880037c72c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:Lustre: Skipped 1 previous similar message
      18:00:41:Lustre: lustre-OST0006-osc-ffff880037c72c00: Connection to lustre-OST0006 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:Lustre: 7109:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1368234025/real 1368234025]  req@ffff88006dbdbc00 x1434661777629520/t0(0) o103->lustre-OST0005-osc-ffff880037c72c00@10.10.4.199@tcp:17/18 lens 328/224 e 0 to 1 dl 1368234036 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      18:00:41:Lustre: 7109:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      18:00:41:LustreError: 11-0: lustre-OST0001-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 3 previous similar messages
      18:00:41:LustreError: 167-0: lustre-OST0005-osc-ffff880037c72c00: This client was evicted by lustre-OST0005; in progress operations using this service will fail.
      18:00:41:LustreError: Skipped 1 previous similar message
      18:00:41:Lustre: 7109:0:(llite_lib.c:2503:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.10.4.198@tcp:10.10.4.202@tcp:/lustre/fid: [0x200000401:0x1ce3c:0x0]/ may get corrupted (rc -108)
      18:00:41:Lustre: 7109:0:(llite_lib.c:2503:ll_dirty_page_discard_warn()) Skipped 3 previous similar messages
      18:00:41:Lustre: lustre-OST0001-osc-ffff880037c72c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp)
      

      Console log on OSS (client-32vm4) showed that:

      17:52:50:Lustre: DEBUG MARKER: mds1 has failed over 38 times, and counting...
      17:53:12:LustreError: 167-0: lustre-MDT0000-lwp-OST0000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      17:53:12:LustreError: Skipped 3 previous similar messages
      17:53:12:Lustre: lustre-MDT0000-lwp-OST0002: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp)
      18:00:36:Lustre: 5788:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1368234022/real 1368234024]  req@ffff880068743800 x1434661750825740/t0(0) o104->lustre-OST0005@10.10.4.201@tcp:15/16 lens 296/224 e 0 to 1 dl 1368234033 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      18:00:36:Lustre: 5788:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 105 previous similar messages
      18:00:36:LustreError: 138-a: lustre-OST0005: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 9765:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0005_UUID lock: ffff880069496c00/0x15300103a13ef60b lrc: 4/0,0 mode: PR/PR res: 13012/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8246 expref: 156 pid: 9910 timeout: 4329187148 lvb_type: 1
      18:00:36:LustreError: 9765:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 17 previous similar messages
      18:00:36:LustreError: 9867:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880048f78c00 x1434661777625888/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0005: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234033 with bad export cookie 1526721388779448426
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 3 previous similar messages
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234034 with bad export cookie 1526721388779448426
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 1 previous similar message
      18:00:36:LustreError: 9427:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880048f78400 x1434661777625880/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0004: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 138-a: lustre-OST0000: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 5063:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0000_UUID lock: ffff880069496800/0x15300103a13ef619 lrc: 4/0,0 mode: PR/PR res: 13204/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8270 expref: 178 pid: 9910 timeout: 4329190079 lvb_type: 1
      18:00:36:LustreError: 5063:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:36:LustreError: 9425:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff88002d861800 x1434661777625848/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0000: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234035 with bad export cookie 1526721388779166123
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 1 previous similar message
      18:00:36:LustreError: 138-a: lustre-OST0006: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 9930:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0006_UUID lock: ffff88005eee0800/0x15300103a13ef5e1 lrc: 4/0,0 mode: PR/PR res: 12947/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a81c8 expref: 168 pid: 9910 timeout: 4329191220 lvb_type: 1
      18:00:36:LustreError: 9930:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:36:LustreError: 7431:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-lustre-OST0006_UUID lock: ffff880069496a00/0x15300103a13ef612 lrc: 4/0,0 mode: PR/PR res: 12948/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a825b expref: 161 pid: 9910 timeout: 4329188939 lvb_type: 1
      18:00:37:LustreError: 9429:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880058037800 x1434661777628724/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 0 to 0 dl 1368234045 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:37:LustreError: 9429:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 3 previous similar messages
      18:00:37:Lustre: lustre-OST0006: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:37:Lustre: Skipped 3 previous similar messages
      18:00:37:LustreError: 7431:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-lustre-OST0002_UUID lock: ffff88005eee0200/0x15300103a13ef5f6 lrc: 4/0,0 mode: PR/PR res: 13043/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8207 expref: 164 pid: 9910 timeout: 4329189806 lvb_type: 1
      18:00:38:LustreError: 5064:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0001_UUID lock: ffff88005eee0400/0x15300103a13ef5ef lrc: 4/0,0 mode: PR/PR res: 13011/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a81f2 expref: 188 pid: 9942 timeout: 4329191830 lvb_type: 1
      18:00:38:LustreError: 5064:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:38:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234037 with bad export cookie 1526721388779166137
      18:00:38:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 6 previous similar messages
      

      Maloo report: https://maloo.whamcloud.com/test_sets/053120d2-bb19-11e2-8824-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-3325] recovery-mds-scale test_failover_mds: tar: Cannot write: Input/output error
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2)
            Distro/Arch: RHEL6.5/x86_64
            Test Group: failover

            The same failure occurred:
            https://maloo.whamcloud.com/test_sets/4815f934-a71f-11e3-aad0-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2) Distro/Arch: RHEL6.5/x86_64 Test Group: failover The same failure occurred: https://maloo.whamcloud.com/test_sets/4815f934-a71f-11e3-aad0-52540035b04c
            yujian Jian Yu added a comment - - edited More instance on Lustre b2_5 branch: https://maloo.whamcloud.com/test_sets/c06adfbc-8f32-11e3-b8e1-52540035b04c https://maloo.whamcloud.com/test_sets/7d4fc910-956b-11e3-936f-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/13/
            Distro/Arch: RHEL6.4/x86_64

            The issue occurred on recovery-mds-scale test_failover_mds:
            https://maloo.whamcloud.com/test_sets/b5aaac10-7f3d-11e3-94f3-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/13/ Distro/Arch: RHEL6.4/x86_64 The issue occurred on recovery-mds-scale test_failover_mds: https://maloo.whamcloud.com/test_sets/b5aaac10-7f3d-11e3-94f3-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
            Distro/Arch: RHEL6.4/x86_64

            The issue occurred on recovery-mds-scale test_failover_ost:
            https://maloo.whamcloud.com/test_sets/5592fa00-5ddf-11e3-aed2-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/ Distro/Arch: RHEL6.4/x86_64 The issue occurred on recovery-mds-scale test_failover_ost: https://maloo.whamcloud.com/test_sets/5592fa00-5ddf-11e3-aed2-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_4_1_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/45/
            Distro/Arch: SLES11SP2/x86_64 (client), RHEL6.4/x86_64 (server)

            The issue occurred on recovery-mds-scale test_failover_ost:
            https://maloo.whamcloud.com/test_sets/1f5265dc-1dfe-11e3-a184-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_4_1_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/45/ Distro/Arch: SLES11SP2/x86_64 (client), RHEL6.4/x86_64 (server) The issue occurred on recovery-mds-scale test_failover_ost: https://maloo.whamcloud.com/test_sets/1f5265dc-1dfe-11e3-a184-52540035b04c

            debug patch is tracked at http://review.whamcloud.com/#/c/6747/

            hongchao.zhang Hongchao Zhang added a comment - debug patch is tracked at http://review.whamcloud.com/#/c/6747/

            according the these occurrences, there is no IO RPC during the time interval between the blocking AST sent and eviction.
            I'll create a debug patch to collect more info to help trace the issue.

            hongchao.zhang Hongchao Zhang added a comment - according the these occurrences, there is no IO RPC during the time interval between the blocking AST sent and eviction. I'll create a debug patch to collect more info to help trace the issue.
            yujian Jian Yu added a comment -

            does the issue occur again recently?

            Yes, still occurs on master branch:
            https://maloo.whamcloud.com/test_sets/22470fc8-d578-11e2-9a7f-52540035b04c
            https://maloo.whamcloud.com/test_sets/c9e1b6ce-d065-11e2-a7e3-52540035b04c
            https://maloo.whamcloud.com/test_sets/d855b29a-cbca-11e2-b831-52540035b04c
            https://maloo.whamcloud.com/test_sets/2a780870-c39c-11e2-a2df-52540035b04c

            how about running the test with some big timeout value(currently it's 10) for the load of the test is a little heavy!

            The default value of TIMEOUT variable in cfg/local.sh is 20. Since the issue can only be reproduced in autotest runs, the value of TIMEOUT variable in the test configuration file used by autotest needs to be changed, however, this will affect all of the test sessions performed by autotest.

            yujian Jian Yu added a comment - does the issue occur again recently? Yes, still occurs on master branch: https://maloo.whamcloud.com/test_sets/22470fc8-d578-11e2-9a7f-52540035b04c https://maloo.whamcloud.com/test_sets/c9e1b6ce-d065-11e2-a7e3-52540035b04c https://maloo.whamcloud.com/test_sets/d855b29a-cbca-11e2-b831-52540035b04c https://maloo.whamcloud.com/test_sets/2a780870-c39c-11e2-a2df-52540035b04c how about running the test with some big timeout value(currently it's 10) for the load of the test is a little heavy! The default value of TIMEOUT variable in cfg/local.sh is 20. Since the issue can only be reproduced in autotest runs, the value of TIMEOUT variable in the test configuration file used by autotest needs to be changed, however, this will affect all of the test sessions performed by autotest.

            this issue is similar with https://jira.hpdd.intel.com/browse/LU-1499, which is caused by the time out of lock AST callback.

            does the issue occur again recently? how about running the test with some big timeout value(currently it's 10) for the load of the test is a little heavy!

            hongchao.zhang Hongchao Zhang added a comment - this issue is similar with https://jira.hpdd.intel.com/browse/LU-1499 , which is caused by the time out of lock AST callback. does the issue occur again recently? how about running the test with some big timeout value(currently it's 10) for the load of the test is a little heavy!
            yujian Jian Yu added a comment -

            Lustre Tag: v2_4_0_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/12/
            Distro/Arch: RHEL6.4/x86_64

            The issue occurred on recovery-mds-scale test_failover_mds:
            https://maloo.whamcloud.com/test_sets/2e35d81c-c6c5-11e2-ae4e-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_4_0_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/12/ Distro/Arch: RHEL6.4/x86_64 The issue occurred on recovery-mds-scale test_failover_mds: https://maloo.whamcloud.com/test_sets/2e35d81c-c6c5-11e2-ae4e-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_4_0_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-master/1501/
            Distro/Arch: SLES11SP2/x86_64 (client), RHEL6.4/x86_64 (server)

            The issue occurred on recovery-mds-scale test_failover_mds:
            https://maloo.whamcloud.com/test_sets/a43ff822-bf9e-11e2-8398-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_4_0_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-master/1501/ Distro/Arch: SLES11SP2/x86_64 (client), RHEL6.4/x86_64 (server) The issue occurred on recovery-mds-scale test_failover_mds: https://maloo.whamcloud.com/test_sets/a43ff822-bf9e-11e2-8398-52540035b04c

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: