Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13356

lctl conf_param hung on the MGS node

Details

    • 3
    • 9223372036854775807

    Description

      From vmcore

      lctl thread was sleeping and waiting lock to be granted

      PID: 26309  TASK: ffff9730a36b1040  CPU: 6   COMMAND: "lctl"
       #0 [ffff973084edfa08] __schedule at ffffffffaf369b97
       #1 [ffff973084edfa98] schedule at ffffffffaf36a099
       #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc]
       #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs]
       #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs]
       #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc]
       #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs]
       #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs]
       #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs]
       #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass]
      #10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass]
      #11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490
      #12 [ffff973084edff00] sys_ioctl at ffffffffaee56731
      #13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb
       
      crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40
      struct ldlm_resource {
        lr_ns_bucket = 0xffff9730a7f8cb18,
        lr_hash = {
          next = 0x0,
          pprev = 0xffff9730f8d22608
        },
        lr_refcount = {
          counter = 295
        },
        lr_lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        },
        lr_granted = {
          next = 0xffff97308638c720,
          prev = 0xffff973086a542a0
        },
        lr_waiting = {
          next = 0xffff973085da84e0,
          prev = 0xffff973082f21920
        },
        lr_enqueueing = {
          next = 0xffff9730f7a6ab80,
          prev = 0xffff9730f7a6ab80
        },
        lr_name = {
          name = {3546639893419028083, 0, 0, 0}
        },
        {
          lr_itree = 0x0,
          lr_ibits_queues = 0x0
        },
        {
          lr_contention_time = 0,
          lr_lvb_inode = 0x0
        },
        lr_type = LDLM_PLAIN,
        lr_lvb_len = 0,
      

      The wait is infinite.
      Here is four clients which conflicts with lock

      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000
        exp_client_uuid = {
          uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400
        exp_client_uuid = {
          uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800
        exp_client_uuid = {
          uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000"
        }
      crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400
        exp_client_uuid = {
          uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000"
        }
      

      It looks like problem started from a network errors

      [ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3
      [ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail
      [ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
      [ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID
      [ 2039.639855] Lustre: Skipped 8 previous similar messages
      [ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3)
      [ 2039.656802] Lustre: Skipped 7 previous similar messages
      [ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
      [ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2
      [ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks:
      [ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
      
      ....
      |00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID|
      |00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912|
      

      MGC uses OBD_CONNECT_MNE_SWAB flag.
      The root cause of this problem is flag OBD_CONNECT_MNE_SWAB equal to OBD_CONNECT_MDS_MDS. OBD_CONNECT_MNE_SWAB was used for 2.2 clients for MNE swabbing. OBD_CONNECT_MDS_MDS flag is used to skip export fail during reconnect for MDS-MDS interaction. Locks for MDS-MDS are not added to a waiting_locks_list because there is no eviction and so on. This leads to a situation when MGS can not cancel locks for a clients if client doesn't receive/respond to a blocking ast.

      Attachments

        Issue Links

          Activity

            [LU-13356] lctl conf_param hung on the MGS node

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41309/
            Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 337b1d1bb301725b91380326985af52a5bede3a1

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41309/ Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 337b1d1bb301725b91380326985af52a5bede3a1

            Hello,

            We hit this issue in production. So I backported the patch https://review.whamcloud.com/37880/ on b2_12.

            I am aware this patch remove the support of OBD_CONNECT_MNE_SWAB, so I don't expect that land on the b2_12.
            But it seems important enough to be integrated to our version of Lustre.

            eaujames Etienne Aujames added a comment - Hello, We hit this issue in production. So I backported the patch https://review.whamcloud.com/37880/ on b2_12. I am aware this patch remove the support of OBD_CONNECT_MNE_SWAB, so I don't expect that land on the b2_12. But it seems important enough to be integrated to our version of Lustre.

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41309
            Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: bf855c643c4c89ac57841e80705ef617cf65e02b

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41309 Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: bf855c643c4c89ac57841e80705ef617cf65e02b

            We also hit this on 2.12, and would benefit from a backport.  Chances are good there is at least one client not responding somewhere on a large system.

            dauchy Nathan Dauchy (Inactive) added a comment - We also hit this on 2.12, and would benefit from a backport.  Chances are good there is at least one client not responding somewhere on a large system.

            If I understand correctly, that means that you could not do an IR if one client is not responding.

            It seems this problem is important enough to be backported to 2.12 LTS, no?

             

            degremoa Aurelien Degremont (Inactive) added a comment - - edited If I understand correctly, that means that you could not do an IR if one client is not responding. It seems this problem is important enough to be backported to 2.12 LTS, no?  
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37880/
            Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3fe77a129e131014ff654bde616a62a1e243e322

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37880/ Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3fe77a129e131014ff654bde616a62a1e243e322

            Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37880
            Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0b965f10ebc350fb9c083f415178e787c7996bbe

            gerrit Gerrit Updater added a comment - Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37880 Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0b965f10ebc350fb9c083f415178e787c7996bbe

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: