Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4531

frequent evictions and timeouts on routed lnet

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 12392

    Description

      IU's dc2 filesystem is primarily served out via IB, but there is also a tcp connected cluster routed through 4 LNET routers (dc2xfer[01-04]).

      We've been getting a lot of client evictions recently and aren't sure the best way to troubleshoot them.

      Here's an example of one of the evictions that happened on Jan 19:
      Client [149.165.226.203@tcp] logs from Jan 19:

      Jan 19 03:33:40 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection to service dc2-OST0038 via nid 10.10.0.6@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Jan 19 03:33:40 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      Jan 19 03:33:43 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0038; in progress operations using this service will fail.
      Jan 19 03:33:43 c3 kernel: Lustre: Server dc2-OST0038_UUID version (2.1.6.0) is much newer than client version (1.8.9)
      Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0038-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
      Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887793d31200 (12440460/0/0/0) (rc: 2)
      Jan 19 03:33:43 c3 kernel: LustreError: 38780:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
      Jan 19 03:33:43 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection restored to service dc2-OST0038 using nid 10.10.0.6@o2ib.
      Jan 19 03:33:43 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      Jan 19 04:12:18 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection to service dc2-OST0054 via nid 10.10.0.9@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Jan 19 04:12:21 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      Jan 19 04:12:21 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0054; in progress operations using this service will fail.
      Jan 19 04:12:21 c3 kernel: Lustre: Server dc2-OST0054_UUID version (2.1.6.0) is much newer than client version (1.8.9)
      Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0054-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
      Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887fd34beb40 (12159828/0/0/0) (rc: 2)
      Jan 19 04:12:21 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection restored to service dc2-OST0054 using nid 10.10.0.9@o2ib.
      Jan 19 04:12:21 c3 kernel: LustreError: 46324:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
      Jan 19 04:12:22 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      Jan 19 08:30:58 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786163853766 sent from dc2-OST0095-osc-ffff8860122a1000 to NID 10.10.0.15@o2ib 22s ago has timed out (22s prior to deadline).
      Jan 19 08:30:58 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection to service dc2-OST0095 via nid 10.10.0.15@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Jan 19 08:31:00 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection restored to service dc2-OST0095 using nid 10.10.0.15@o2ib.
      Jan 19 08:31:00 c3 kernel: Lustre: Server dc2-OST0095_UUID version (2.1.6.0) is much newer than client version (1.8.9)
      Jan 19 15:10:06 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984600 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 22s ago has timed out (22s prior to deadline).
      Jan 19 15:10:06 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection to service dc2-OST003f via nid 10.10.0.7@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Jan 19 15:10:07 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984602 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 24s ago has timed out (22s prior to deadline).
      Jan 19 15:10:09 c3 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.7@o2ib. The ost_connect operation failed with -16
      Jan 19 15:10:09 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984604 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 26s ago has timed out (22s prior to deadline).
      Jan 19 15:10:11 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984607 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 28s ago has timed out (22s prior to deadline).
      Jan 19 15:10:15 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection restored to service dc2-OST003f using nid 10.10.0.7@o2ib.
      Jan 19 15:10:15 c3 kernel: Lustre: Server dc2-OST003f_UUID version (2.1.6.0) is much newer than client version (1.8.9)
      

      Router logs

      dc2xfer01: Jan 19 02:19:47 dc2xfer01 kernel: : LustreError: 3529:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.207@tcp: -113
      dc2xfer01: Jan 19 02:20:46 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880ffee94000) timed out 3 secs ago, resid: 0, wmem: 8020736
      dc2xfer01: Jan 19 02:21:12 dc2xfer01 kernel: : LustreError: 3530:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.205@tcp: -113
      dc2xfer01: Jan 19 08:02:42 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880fcd326000) timed out 8 secs ago, resid: 0, wmem: 7883680
      

      I'll attach the OSS logs. It looks like there are some Bulk IO errors, but does that indicate an issue between the OSS and the router, or the router and the client? In the past when I've seen those errors, there hasn't been a router in between to confuse the issue.

      Attachments

        Activity

          People

            isaac Isaac Huang (Inactive)
            manish Manish Patel (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: