Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3325

recovery-mds-scale test_failover_mds: tar: Cannot write: Input/output error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.5.1

    • Lustre Branch: master
      Lustre Build: http://build.whamcloud.com/job/lustre-master/1486
      Distro/Arch: RHEL6.4/x86_64
      Test Group: failover
    • 3
    • 8209

    Description

      After running recovery-mds-scale test_failover_mds for about 9 hours (MDS failed over 38 times), client load on one of the clients failed as follows:

      tar: etc/ConsoleKit/seats.d/00-primary.seat: Cannot write: Input/output error
      tar: etc/ConsoleKit/seats.d/00-primary.seat: Cannot utime: Cannot send after transport endpoint shutdown
      tar: Exiting with failure status due to previous errors
      

      Console log on the client (client-32vm6) showed that:

      17:52:45:Lustre: DEBUG MARKER: mds1 has failed over 38 times, and counting...
      17:52:46:Lustre: Evicted from MGS (at 10.10.4.198@tcp) after server handle changed from 0x9510c59bdf7b69c8 to 0xa78bdb1c4b1a185f
      17:52:46:Lustre: MGC10.10.4.198@tcp: Connection restored to MGS (at 10.10.4.198@tcp)
      18:00:41:Lustre: lustre-OST0005-osc-ffff880037c72c00: Connection to lustre-OST0005 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:LustreError: 11-0: lustre-OST0004-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 3 previous similar messages
      18:00:41:LustreError: 11-0: lustre-OST0000-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 1 previous similar message
      18:00:41:Lustre: lustre-OST0000-osc-ffff880037c72c00: Connection to lustre-OST0000 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:Lustre: Skipped 1 previous similar message
      18:00:41:Lustre: lustre-OST0006-osc-ffff880037c72c00: Connection to lustre-OST0006 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      18:00:41:Lustre: 7109:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1368234025/real 1368234025]  req@ffff88006dbdbc00 x1434661777629520/t0(0) o103->lustre-OST0005-osc-ffff880037c72c00@10.10.4.199@tcp:17/18 lens 328/224 e 0 to 1 dl 1368234036 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      18:00:41:Lustre: 7109:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      18:00:41:LustreError: 11-0: lustre-OST0001-osc-ffff880037c72c00: Communicating with 10.10.4.199@tcp, operation ost_write failed with -107.
      18:00:41:LustreError: Skipped 3 previous similar messages
      18:00:41:LustreError: 167-0: lustre-OST0005-osc-ffff880037c72c00: This client was evicted by lustre-OST0005; in progress operations using this service will fail.
      18:00:41:LustreError: Skipped 1 previous similar message
      18:00:41:Lustre: 7109:0:(llite_lib.c:2503:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.10.4.198@tcp:10.10.4.202@tcp:/lustre/fid: [0x200000401:0x1ce3c:0x0]/ may get corrupted (rc -108)
      18:00:41:Lustre: 7109:0:(llite_lib.c:2503:ll_dirty_page_discard_warn()) Skipped 3 previous similar messages
      18:00:41:Lustre: lustre-OST0001-osc-ffff880037c72c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp)
      

      Console log on OSS (client-32vm4) showed that:

      17:52:50:Lustre: DEBUG MARKER: mds1 has failed over 38 times, and counting...
      17:53:12:LustreError: 167-0: lustre-MDT0000-lwp-OST0000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      17:53:12:LustreError: Skipped 3 previous similar messages
      17:53:12:Lustre: lustre-MDT0000-lwp-OST0002: Connection restored to lustre-MDT0000 (at 10.10.4.198@tcp)
      18:00:36:Lustre: 5788:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1368234022/real 1368234024]  req@ffff880068743800 x1434661750825740/t0(0) o104->lustre-OST0005@10.10.4.201@tcp:15/16 lens 296/224 e 0 to 1 dl 1368234033 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      18:00:36:Lustre: 5788:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 105 previous similar messages
      18:00:36:LustreError: 138-a: lustre-OST0005: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 9765:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0005_UUID lock: ffff880069496c00/0x15300103a13ef60b lrc: 4/0,0 mode: PR/PR res: 13012/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8246 expref: 156 pid: 9910 timeout: 4329187148 lvb_type: 1
      18:00:36:LustreError: 9765:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 17 previous similar messages
      18:00:36:LustreError: 9867:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880048f78c00 x1434661777625888/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0005: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234033 with bad export cookie 1526721388779448426
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 3 previous similar messages
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234034 with bad export cookie 1526721388779448426
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 1 previous similar message
      18:00:36:LustreError: 9427:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880048f78400 x1434661777625880/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0004: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 138-a: lustre-OST0000: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 5063:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0000_UUID lock: ffff880069496800/0x15300103a13ef619 lrc: 4/0,0 mode: PR/PR res: 13204/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8270 expref: 178 pid: 9910 timeout: 4329190079 lvb_type: 1
      18:00:36:LustreError: 5063:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:36:LustreError: 9425:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff88002d861800 x1434661777625848/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 1 to 0 dl 1368234065 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:36:Lustre: lustre-OST0000: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234035 with bad export cookie 1526721388779166123
      18:00:36:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 1 previous similar message
      18:00:36:LustreError: 138-a: lustre-OST0006: A client on nid 10.10.4.201@tcp was evicted due to a lock blocking callback time out: rc -107
      18:00:36:LustreError: Skipped 1 previous similar message
      18:00:36:LustreError: 9930:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0006_UUID lock: ffff88005eee0800/0x15300103a13ef5e1 lrc: 4/0,0 mode: PR/PR res: 12947/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a81c8 expref: 168 pid: 9910 timeout: 4329191220 lvb_type: 1
      18:00:36:LustreError: 9930:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:36:LustreError: 7431:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-lustre-OST0006_UUID lock: ffff880069496a00/0x15300103a13ef612 lrc: 4/0,0 mode: PR/PR res: 12948/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a825b expref: 161 pid: 9910 timeout: 4329188939 lvb_type: 1
      18:00:37:LustreError: 9429:0:(ldlm_lib.c:2706:target_bulk_io()) @@@ Eviction on bulk GET  req@ffff880058037800 x1434661777628724/t0(0) o4->c3d0e929-541a-b945-bf7d-9ade70c3a652@10.10.4.201@tcp:0/0 lens 488/448 e 0 to 0 dl 1368234045 ref 1 fl Interpret:/0/0 rc 0/0
      18:00:37:LustreError: 9429:0:(ldlm_lib.c:2706:target_bulk_io()) Skipped 3 previous similar messages
      18:00:37:Lustre: lustre-OST0006: Bulk IO write error with c3d0e929-541a-b945-bf7d-9ade70c3a652 (at 10.10.4.201@tcp), client will retry: rc -107
      18:00:37:Lustre: Skipped 3 previous similar messages
      18:00:37:LustreError: 7431:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-lustre-OST0002_UUID lock: ffff88005eee0200/0x15300103a13ef5f6 lrc: 4/0,0 mode: PR/PR res: 13043/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a8207 expref: 164 pid: 9910 timeout: 4329189806 lvb_type: 1
      18:00:38:LustreError: 5064:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 10.10.4.201@tcp) returned 0 from blocking AST ns: filter-lustre-OST0001_UUID lock: ffff88005eee0400/0x15300103a13ef5ef lrc: 4/0,0 mode: PR/PR res: 13011/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 10.10.4.201@tcp remote: 0x413d78011f8a81f2 expref: 188 pid: 9942 timeout: 4329191830 lvb_type: 1
      18:00:38:LustreError: 5064:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) Skipped 13 previous similar messages
      18:00:38:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 10.10.4.201@tcp arrived at 1368234037 with bad export cookie 1526721388779166137
      18:00:38:LustreError: 7431:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) Skipped 6 previous similar messages
      

      Maloo report: https://maloo.whamcloud.com/test_sets/053120d2-bb19-11e2-8824-52540035b04c

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: