Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12096

ldlm_run_ast_work call traces and network errors on overloaded OSS

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn't find any issues on our different IB fabrics, but lnet shows dropped packets.

      Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages
      Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace:
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dcd890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8b185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dac86b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc166b10b>] ofd_intent_policy+0x69b/0x920 [ofd]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8bec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0db48a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e3b302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e4235a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0de692b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dea25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff850c1c31>] kthread+0xd1/0xe0
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff85774c24>] ret_from_fork_nospec_begin+0xe/0x21
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368
      
      [root@fir-io1-s1 ~]# lnetctl stats show
      statistics:
          msgs_alloc: 0
          msgs_max: 16371
          rst_alloc: 283239
          errors: 0
          send_count: 2387805172
          resend_count: 0
          response_timeout_count: 203807
          local_interrupt_count: 0
          local_dropped_count: 33
          local_aborted_count: 0
          local_no_route_count: 0
          local_timeout_count: 961
          local_error_count: 13
          remote_dropped_count: 3
          remote_error_count: 0
          remote_timeout_count: 12
          network_timeout_count: 0
          recv_count: 2387455644
          route_count: 0
          drop_count: 2971
          send_length: 871207195166809
          recv_length: 477340920770381
          route_length: 0
          drop_length: 1291240
      

      I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log
      Also attaching full kernel logs as fir-io1-s1-kern.log

      We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.

      Meanwhile, we'll keep investigating a possible network issue.

      Thanks!
      Stephane

      Attachments

        Activity

          [LU-12096] ldlm_run_ast_work call traces and network errors on overloaded OSS

          Thanks!

          sthiell Stephane Thiell added a comment - Thanks!

          Yes, absolutely.  Any Lustre node using an IB connection.

          pfarrell Patrick Farrell (Inactive) added a comment - - edited Yes, absolutely.  Any Lustre node using an IB connection.

          Are CQ entries also used on LNet routers? I assume they do? All of our routers (FDR/EDR and EDR/EDR) are running 2.12.0, maybe I need to update these too.

          sthiell Stephane Thiell added a comment - Are CQ entries also used on LNet routers? I assume they do? All of our routers (FDR/EDR and EDR/EDR) are running 2.12.0, maybe I need to update these too.
          sthiell Stephane Thiell added a comment - - edited

          Hi Amir – Your absolutely right for the MGS, this is very likely another issue after all. We restarted this MGS today and I think things are better now. It's on our old 2.8 systems anyway so we don't want to spend too much time on that right now.

          But ok for the recommendation re: the second patch, thanks much!

          sthiell Stephane Thiell added a comment - - edited Hi Amir – Your absolutely right for the MGS, this is very likely another issue after all. We restarted this MGS today and I think things are better now. It's on our old 2.8 systems anyway so we don't want to spend too much time on that right now. But ok for the recommendation re: the second patch, thanks much!

          The RDMA timeouts are error level output. Regarding:

           Mar 21 13:50:57 sh-25-08.int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer
          

          If you don't see RDMA timeouts, this could be an unrelated problem.

          Either way, I think it'll be good to try out both those patches the CQ entries one landed, but the other one I think makes sense to apply as well.

          ashehata Amir Shehata (Inactive) added a comment - The RDMA timeouts are error level output. Regarding: Mar 21 13:50:57 sh-25-08. int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer If you don't see RDMA timeouts, this could be an unrelated problem. Either way, I think it'll be good to try out both those patches the CQ entries one landed, but the other one I think makes sense to apply as well.

          We have started an emergency rolling update of our 2.12 lustre clients on Sherlock with the patch https://review.whamcloud.com/34474/  "LU-12065 lnd: increase CQ entries". I hope this will fix the bulk read timeout that we see on both 2.8 and 2.12 servers.

          sthiell Stephane Thiell added a comment - We have started an emergency rolling update of our 2.12 lustre clients on Sherlock with the patch https://review.whamcloud.com/34474/   " LU-12065 lnd: increase CQ entries". I hope this will fix the bulk read timeout that we see on both 2.8 and 2.12 servers.

          Hey, I just noticed that Andreas landed https://review.whamcloud.com/34474/ to b2_12 so I guess I can add it to our 2.12 clients to try to mitigate this issue then. Let me know if you think it's a bad idea.

          The issue with the MGS is not as critical as it doesn't seem to have an impact on production.

          sthiell Stephane Thiell added a comment - Hey, I just noticed that Andreas landed  https://review.whamcloud.com/34474/  to b2_12 so I guess I can add it to our 2.12 clients to try to mitigate this issue then. Let me know if you think it's a bad idea. The issue with the MGS is not as critical as it doesn't seem to have an impact on production.

          Thanks Amir!! We're affected by both issues I guess at the moment.

          Looks like this ticket could be a duplicate of LU-12065 then (our errors are more READ errors rather than WRITE though, but that's because our overall read-oriented workload I think):

          [root@fir-hn01 sthiell.root]# clush -w@oss 'journalctl -n 10000 | grep "network error"'
          fir-io1-s1: Mar 21 10:30:45 fir-io1-s1 kernel: LustreError: 96626:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff9838134df050 x1628556138350512/t0(0) o3->a8709ec9-480f-5069-1ed2-459
          fir-io4-s2: Mar 21 09:41:10 fir-io4-s2 kernel: LustreError: 11649:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff9f624be54450 x1628592064536240/t0(0) o3->f9709d24-98d8-0cac-f2ff-a99
          fir-io3-s2: Mar 21 09:42:53 fir-io3-s2 kernel: LustreError: 89904:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff8f64b9d83050 x1628590311939072/t0(0) o3->48668d7e-41a9-a375-0254-f12
          fir-io3-s2: Mar 21 10:31:13 fir-io3-s2 kernel: LustreError: 61500:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff8f45aaeab050 x1628590295062864/t0(0) o3->539dad34-3729-b804-7e3e-a23
          fir-io2-s2: Mar 21 09:42:42 fir-io2-s2 kernel: LustreError: 41983:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff939c4e0b4450 x1628574482967280/t0(0) o3->0528e199-3990-4d89-45f8-08e
          fir-io2-s2: Mar 21 10:31:03 fir-io2-s2 kernel: LustreError: 47646:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff937f2c7fec50 x1628593644009328/t0(0) o3->df3d4e84-b52f-56d8-5336-c34
          fir-io1-s2: Mar 21 09:41:53 fir-io1-s2 kernel: LustreError: 125651:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff900db6796850 x1628587142580752/t0(0) o3->fdfab198-010f-1b84-313f-f3
          fir-io1-s2: Mar 21 10:30:14 fir-io1-s2 kernel: LustreError: 125778:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff901c24c78c50 x1628593670375840/t0(0) o3->b10270ab-2dff-3a94-2c78-6a
          fir-io1-s2: Mar 21 10:30:39 fir-io1-s2 kernel: LustreError: 105343:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff900db6796850 x1628592114787056/t0(0) o3->1699114c-d7be-fde3-c1c0-12
          fir-io2-s1: Mar 21 09:41:43 fir-io2-s1 kernel: LustreError: 69998:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff986ba1925050 x1628592081804192/t0(0) o3->71bd1389-41d9-d380-693e-d73
          fir-io4-s1: Mar 21 09:42:51 fir-io4-s1 kernel: LustreError: 31413:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffffa0facc9ccc50 x1628593652265184/t0(0) o3->b10270ab-2dff-3a94-2c78-6a8
          fir-io3-s1: Mar 21 09:41:33 fir-io3-s1 kernel: LustreError: 30747:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff898e09130050 x1628587142580656/t0(0) o3->fdfab198-010f-1b84-313f-f38
          fir-io3-s1: Mar 21 10:28:55 fir-io3-s1 kernel: Lustre: 130578:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1553189329/real 1553189335]  req@ffff896e1c4da1
          fir-io3-s1: Mar 21 10:29:53 fir-io3-s1 kernel: LustreError: 24124:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff898404cd6c50 x1628556112999824/t0(0) o3->e93665fa-c928-d716-64a5-338
          

          Re: LU-11931, we have 2.12.0 clients that constantly log the following messages (MGS is 2.8 in that case):

          Mar 21 13:50:57 sh-25-08.int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer
          Mar 21 13:50:57 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
          Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
          Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553201774, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
          Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 129134:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0840) refcount nonzer
          Mar 21 14:01:14 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
          Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
          Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553202391, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
          Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 129843:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0600) refcount nonzer
          Mar 21 14:11:31 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
          Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
          Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203005, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
          Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 130529:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1800) refcount nonzer
          Mar 21 14:21:45 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
          Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
          Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203621, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
          Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 131236:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0f00) refcount nonzer
          Mar 21 14:32:01 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
          ...
          
          sthiell Stephane Thiell added a comment - Thanks Amir!! We're affected by both issues I guess at the moment. Looks like this ticket could be a duplicate of LU-12065 then (our errors are more READ errors rather than WRITE though, but that's because our overall read-oriented workload I think): [root@fir-hn01 sthiell.root]# clush -w@oss 'journalctl -n 10000 | grep "network error"' fir-io1-s1: Mar 21 10:30:45 fir-io1-s1 kernel: LustreError: 96626:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff9838134df050 x1628556138350512/t0(0) o3->a8709ec9-480f-5069-1ed2-459 fir-io4-s2: Mar 21 09:41:10 fir-io4-s2 kernel: LustreError: 11649:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff9f624be54450 x1628592064536240/t0(0) o3->f9709d24-98d8-0cac-f2ff-a99 fir-io3-s2: Mar 21 09:42:53 fir-io3-s2 kernel: LustreError: 89904:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff8f64b9d83050 x1628590311939072/t0(0) o3->48668d7e-41a9-a375-0254-f12 fir-io3-s2: Mar 21 10:31:13 fir-io3-s2 kernel: LustreError: 61500:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff8f45aaeab050 x1628590295062864/t0(0) o3->539dad34-3729-b804-7e3e-a23 fir-io2-s2: Mar 21 09:42:42 fir-io2-s2 kernel: LustreError: 41983:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff939c4e0b4450 x1628574482967280/t0(0) o3->0528e199-3990-4d89-45f8-08e fir-io2-s2: Mar 21 10:31:03 fir-io2-s2 kernel: LustreError: 47646:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff937f2c7fec50 x1628593644009328/t0(0) o3->df3d4e84-b52f-56d8-5336-c34 fir-io1-s2: Mar 21 09:41:53 fir-io1-s2 kernel: LustreError: 125651:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff900db6796850 x1628587142580752/t0(0) o3->fdfab198-010f-1b84-313f-f3 fir-io1-s2: Mar 21 10:30:14 fir-io1-s2 kernel: LustreError: 125778:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff901c24c78c50 x1628593670375840/t0(0) o3->b10270ab-2dff-3a94-2c78-6a fir-io1-s2: Mar 21 10:30:39 fir-io1-s2 kernel: LustreError: 105343:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff900db6796850 x1628592114787056/t0(0) o3->1699114c-d7be-fde3-c1c0-12 fir-io2-s1: Mar 21 09:41:43 fir-io2-s1 kernel: LustreError: 69998:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff986ba1925050 x1628592081804192/t0(0) o3->71bd1389-41d9-d380-693e-d73 fir-io4-s1: Mar 21 09:42:51 fir-io4-s1 kernel: LustreError: 31413:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffffa0facc9ccc50 x1628593652265184/t0(0) o3->b10270ab-2dff-3a94-2c78-6a8 fir-io3-s1: Mar 21 09:41:33 fir-io3-s1 kernel: LustreError: 30747:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff898e09130050 x1628587142580656/t0(0) o3->fdfab198-010f-1b84-313f-f38 fir-io3-s1: Mar 21 10:28:55 fir-io3-s1 kernel: Lustre: 130578:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1553189329/real 1553189335] req@ffff896e1c4da1 fir-io3-s1: Mar 21 10:29:53 fir-io3-s1 kernel: LustreError: 24124:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ req@ffff898404cd6c50 x1628556112999824/t0(0) o3->e93665fa-c928-d716-64a5-338 Re: LU-11931 , we have 2.12.0 clients that constantly log the following messages (MGS is 2.8 in that case): Mar 21 13:50:57 sh-25-08.int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer Mar 21 13:50:57 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1) Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553201774, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21 Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 129134:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0840) refcount nonzer Mar 21 14:01:14 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1) Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553202391, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21 Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 129843:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0600) refcount nonzer Mar 21 14:11:31 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1) Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203005, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21 Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 130529:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1800) refcount nonzer Mar 21 14:21:45 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1) Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203621, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21 Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 131236:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0f00) refcount nonzer Mar 21 14:32:01 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1) ...

          We have a couple of inflight patches that are suppose to address RDMA timeouts:

          https://review.whamcloud.com/34396

          https://review.whamcloud.com/34474/

          Details on:

          https://jira.whamcloud.com/browse/LU-11931
          https://jira.whamcloud.com/browse/LU-12065

           

          respectively

          ashehata Amir Shehata (Inactive) added a comment - We have a couple of inflight patches that are suppose to address RDMA timeouts: https://review.whamcloud.com/34396 https://review.whamcloud.com/34474/ Details on: https://jira.whamcloud.com/browse/LU-11931 https://jira.whamcloud.com/browse/LU-12065   respectively

          I'll let Amir weigh in on possible issues and patches.  There may be something, but Amir is much better qualified to evaluate if it's a possible match.

          Looking at your lock tables, there's nothing unusual currently, though it could be that it was cleaned up by the evictions.

          The fact that you're just now also seeing timeouts on your 2.8 system is interesting.  I'm still not enough of an o2ib expert to do anything with the knowledge, but it might be relevant.

          pfarrell Patrick Farrell (Inactive) added a comment - I'll let Amir weigh in on possible issues and patches.  There may be something, but Amir is much better qualified to evaluate if it's a possible match. Looking at your lock tables, there's nothing unusual currently, though it could be that it was cleaned up by the evictions. The fact that you're just now also seeing timeouts on your 2.8 system is interesting.  I'm still not enough of an o2ib expert to do anything with the knowledge, but it might be relevant.

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: