Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11815

MDT-MDT connection stuck and never restored

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 2.12.0-RC3
    • 2
    • 9223372036854775807

    Description

      running mdtest on dne2 configuration (two MDS and one MDT per MDS), MDT-MDT connection disconnected seveal times and reconnection fails and aborted.
      As far as I see log, there are some indication of network errors between MDS.

      Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63
      Dec 20 07:02:18 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63  to allow for qp creation
      Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID
      Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10)
      Dec 20 07:02:19 mds13 kernel: Lustre: Skipped 29 previous similar messages
      Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257025/real 0]  req@ffff91e9f7715700 x1620318877676672/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257036 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
      Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      
      Dec 20 07:17:42 mds13 kernel: LustreError: Skipped 3 previous similar messages
      Dec 20 07:17:47 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10
      Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63  to allow for qp creation
      Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 51 previous similar messages
      Dec 20 07:20:01 mds13 systemd[1]: Started Session 332 of user root.
      Dec 20 07:20:01 mds13 systemd[1]: Starting Session 332 of user root.
      Dec 20 07:20:40 mds13 systemd-logind[2938]: Removed session 331.
      Dec 20 07:19:40 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Dec 20 07:19:40 mds13 kernel: Lustre: Skipped 1 previous similar message
      Dec 20 07:20:06 mds13 kernel: LNet: 71965:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.11.226@o2ib10: connected
      Dec 20 07:20:12 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10
      Dec 20 07:21:52 mds13 systemd-logind[2938]: New session 333 of user root.
      

      Although it's simple network configuration (just single switch) and I didn't see any network errors between MDS and clent/OSS, it suspect still there are network problems between MDSs?

      Attachments

        Activity

          People

            wc-triage WC Triage
            sihara Shuichi Ihara
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: