Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13356

lctl conf_param hung on the MGS node

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      From vmcore

      lctl thread was sleeping and waiting lock to be granted

      PID: 26309  TASK: ffff9730a36b1040  CPU: 6   COMMAND: "lctl"
       #0 [ffff973084edfa08] __schedule at ffffffffaf369b97
       #1 [ffff973084edfa98] schedule at ffffffffaf36a099
       #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc]
       #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs]
       #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs]
       #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc]
       #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs]
       #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs]
       #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs]
       #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass]
      #10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass]
      #11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490
      #12 [ffff973084edff00] sys_ioctl at ffffffffaee56731
      #13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb
       
      crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40
      struct ldlm_resource {
        lr_ns_bucket = 0xffff9730a7f8cb18,
        lr_hash = {
          next = 0x0,
          pprev = 0xffff9730f8d22608
        },
        lr_refcount = {
          counter = 295
        },
        lr_lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        },
        lr_granted = {
          next = 0xffff97308638c720,
          prev = 0xffff973086a542a0
        },
        lr_waiting = {
          next = 0xffff973085da84e0,
          prev = 0xffff973082f21920
        },
        lr_enqueueing = {
          next = 0xffff9730f7a6ab80,
          prev = 0xffff9730f7a6ab80
        },
        lr_name = {
          name = {3546639893419028083, 0, 0, 0}
        },
        {
          lr_itree = 0x0,
          lr_ibits_queues = 0x0
        },
        {
          lr_contention_time = 0,
          lr_lvb_inode = 0x0
        },
        lr_type = LDLM_PLAIN,
        lr_lvb_len = 0,
      

      The wait is infinite.
      Here is four clients which conflicts with lock

      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000
        exp_client_uuid = {
          uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400
        exp_client_uuid = {
          uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800
        exp_client_uuid = {
          uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400
        exp_client_uuid = {
          uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000"
        }
      

      It looks like problem started from a network errors

      [ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3
      [ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail
      [ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
      [ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID
      [ 2039.639855] Lustre: Skipped 8 previous similar messages
      [ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3)
      [ 2039.656802] Lustre: Skipped 7 previous similar messages
      [ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
      [ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2
      [ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks:
      [ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
      
      ....
      |00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID|
      |00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912|
      

      MGC uses OBD_CONNECT_MNE_SWAB flag.
      The root cause of this problem is flag OBD_CONNECT_MNE_SWAB equal to OBD_CONNECT_MDS_MDS. OBD_CONNECT_MNE_SWAB was used for 2.2 clients for MNE swabbing. OBD_CONNECT_MDS_MDS flag is used to skip export fail during reconnect for MDS-MDS interaction. Locks for MDS-MDS are not added to a waiting_locks_list because there is no eviction and so on. This leads to a situation when MGS can not cancel locks for a clients if client doesn't receive/respond to a blocking ast.

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: