Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12476

ldlm_bl_ processes running at 100% causing client issues

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • 3
    • 9223372036854775807

    Description

      symptom is that clients cannot access lustre filesystem data. Seeing timeouts in the logs, e.g.,:

      Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561643289/real 1561643289]  req@ffff9bffcbd3b900 x1637338661347344/t0(0) o36->echo-MDT0000-mdc-ffff9c2b2b775000@10.23.22.104@tcp:12/10 lens 880/856 e 24 to 1 dl 1561643890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
      
      Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      
      Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection to echo-MDT0000 (at 10.23.22.104@tcp) was lost; in progress operations using this service will wait for recovery to complete
      
      Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection restored to 10.23.22.104@tcp (at 10.23.22.104@tcp)
      

      On the MDS we see:

      Jun 27 06:55:08 emds1 kernel: LustreError: 27539:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1561643408, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-echo-MDT0000_UUID lock: ffff88522ed35800/0x7d046634332b9f1e lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27539 timeout: 0 lvb_type: 0
      
      Jun 27 07:00:04 emds1 kernel: Lustre: 27723:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/4), not sending early reply#012  req@ffff88231ac4f500 x1637338661348352/t0(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:498/0 lens 928/3128 e 1 to 0 dl 1561644008 ref 2 fl Interpret:/0/0 rc 0/0
      
      Jun 27 07:00:37 emds1 kernel: Lustre: 49916:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (118:3300s); client may timeout.  req@ffff8823586ce900 x1637332991328208/t910888684941(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:247/0 lens 680/424 e 3 to 0 dl 1561640737 ref 1 fl Complete:/0/0 rc 0/0
      
      Jun 27 07:00:37 emds1 kernel: LNet: Service thread pid 49916 completed after 3417.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      We see a ldlm_bl_01 (or ldlm_bl_02) at 100% on a CPU core for extended periods (over an hour). It will recover for a few minutes, then max out a CPU again

      What might be causing this?

      Attachments

        Activity

          People

            pjones Peter Jones
            cmcl Campbell Mcleay (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: