Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11886

OSTs and MDTs become unreachable under load, 2.11

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.11.0
    • None
    • Lustre 2.11.0, ZFS 0.7.9, CentOS 7.5, Omni-Path
    • 2
    • 9223372036854775807

    Description

      Hi everyone,
       
      a v2.11.0 installation, there is an ongoing file transfer in Lustre running on one of the client nodes. Every now and then the process stops with I/O errors and then the file systems doesn't respond. After a while the access to the FS resumes, but the file transfer aborts with an error by that time.
       
      `lctl get_param*.*state` shows a history of evictions of targets:
       

      mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state=
      current_state: FULL
      state_history:
       - [ 1546444758, CONNECTING ]
       - [ 1546444758, FULL ]
       <...>
       - [ 1546594365, DISCONN ]
       - [ 1546594388, CONNECTING ]
       - [ 1546594388, RECOVER ]
       - [ 1546594388, FULL ]
      
      mgc.MGC10.149.0.183@o2ib.state=
      current_state: FULL
      state_history:
       - [ 1546444756, CONNECTING ]
       - [ 1546444756, FULL ]
       <...>
       - [ 1546594354, DISCONN ]
       - [ 1546594363, CONNECTING ]
       - [ 1546594363, EVICTED ]
       - [ 1546594363, RECOVER ]   - [ 1546594363, FULL ]
      osc.scratch-OST0000-osc-ffff8ecec4bf2800.state=
      current_state: CONNECTING
      state_history:
       - [ 1548316546, DISCONN ]
       - [ 1548316551, CONNECTING ]
       <...>
       - [ 1548316851, CONNECTING ]
       - [ 1548316921, DISCONN ]
       - [ 1548316926, CONNECTING ]
       - [ 1548316996, DISCONN ]
       - [ 1548317001, CONNECTING ]
       - [ 1548317071, DISCONN ]
       - [ 1548317076, CONNECTING ] 
      
      

      and so on for all the affected by the file transfer OSTs.

       

      In the OSS's logs there are all sorts of connectivity problems reported both by Lustre:

      ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie 13579970295878280357
      ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous similar messages
      ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.149.255.238@o2ib  ns: filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc: 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid: 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849 timeout: 3689597 lvb_type: 0
      ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar messages
      ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to /tmp/lustre-log.1548315195.186010
      ods01: Jan 24 08:44:41 ods01 kernel: Lustre: 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548314995/real 1548315881]  req@ffff88975cc9d100 x1619663992901664/t0(0) o104->scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1

       

       and LNet:

      ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
      ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous similar messages
      ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with 10.149.255.254@o2ib (52): c: 
      0, oc: 0, rc: 63
      ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous similar messages

      while `opareport` doesn't detect any Omni-Path errors at all.

       

      From the client side:

      Jan 24 05:43:03 admin kernel: LustreError: 133-1: scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from 10.149.0.185@o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4
      Jan 24 05:43:03 admin kernel: LustreError: 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8ecb19cd0000 x1621600181397872/t0(0) o3->scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860

       
      and on another client:

      Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1548316966/real 1548316966]  req@ffff995a6dab4e00 x1623278997809600/t0(0) o101->scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous similar messages
      

       

       

      Kind regards,

      Konstantin

      Attachments

        Activity

          People

            pjones Peter Jones
            cv_eng cv_eng (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: