[LU-13356] lctl conf_param hung on the MGS node Created: 11/Mar/20  Updated: 22/Sep/23  Resolved: 14/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0, Lustre 2.12.9

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Related
is related to LU-10674 MGS very unstable in 2.10.x Open
is related to LU-15453 MDT shutdown hangs on mutex_lock, po... Open
is related to LU-11990 conf-sanity test_66: replace nids fai... Reopened
is related to LU-12735 MGS misbehaving in 2.12.2+ Resolved
is related to LU-15539 clients report mds_mds_connection in ... Resolved
Epic/Theme: LTS12, mgs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

From vmcore

lctl thread was sleeping and waiting lock to be granted

PID: 26309  TASK: ffff9730a36b1040  CPU: 6   COMMAND: "lctl"
 #0 [ffff973084edfa08] __schedule at ffffffffaf369b97
 #1 [ffff973084edfa98] schedule at ffffffffaf36a099
 #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc]
 #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs]
 #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs]
 #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc]
 #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs]
 #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs]
 #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs]
 #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass]
#10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass]
#11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490
#12 [ffff973084edff00] sys_ioctl at ffffffffaee56731
#13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb
 
crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40
struct ldlm_resource {
  lr_ns_bucket = 0xffff9730a7f8cb18,
  lr_hash = {
    next = 0x0,
    pprev = 0xffff9730f8d22608
  },
  lr_refcount = {
    counter = 295
  },
  lr_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  lr_granted = {
    next = 0xffff97308638c720,
    prev = 0xffff973086a542a0
  },
  lr_waiting = {
    next = 0xffff973085da84e0,
    prev = 0xffff973082f21920
  },
  lr_enqueueing = {
    next = 0xffff9730f7a6ab80,
    prev = 0xffff9730f7a6ab80
  },
  lr_name = {
    name = {3546639893419028083, 0, 0, 0}
  },
  {
    lr_itree = 0x0,
    lr_ibits_queues = 0x0
  },
  {
    lr_contention_time = 0,
    lr_lvb_inode = 0x0
  },
  lr_type = LDLM_PLAIN,
  lr_lvb_len = 0,

The wait is infinite.
Here is four clients which conflicts with lock

crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000
  exp_client_uuid = {
    uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400
  exp_client_uuid = {
    uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800
  exp_client_uuid = {
    uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400
  exp_client_uuid = {
    uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000"
  }

It looks like problem started from a network errors

[ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3
[ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail
[ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
[ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID
[ 2039.639855] Lustre: Skipped 8 previous similar messages
[ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3)
[ 2039.656802] Lustre: Skipped 7 previous similar messages
[ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
[ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2
[ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks:
[ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0

....
|00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID|
|00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912|

MGC uses OBD_CONNECT_MNE_SWAB flag.
The root cause of this problem is flag OBD_CONNECT_MNE_SWAB equal to OBD_CONNECT_MDS_MDS. OBD_CONNECT_MNE_SWAB was used for 2.2 clients for MNE swabbing. OBD_CONNECT_MDS_MDS flag is used to skip export fail during reconnect for MDS-MDS interaction. Locks for MDS-MDS are not added to a waiting_locks_list because there is no eviction and so on. This leads to a situation when MGS can not cancel locks for a clients if client doesn't receive/respond to a blocking ast.



 Comments   
Comment by Gerrit Updater [ 11/Mar/20 ]

Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37880
Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b965f10ebc350fb9c083f415178e787c7996bbe

Comment by Gerrit Updater [ 14/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37880/
Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3fe77a129e131014ff654bde616a62a1e243e322

Comment by Peter Jones [ 14/Apr/20 ]

Landed for 2.14

Comment by Aurelien Degremont (Inactive) [ 01/Oct/20 ]

If I understand correctly, that means that you could not do an IR if one client is not responding.

It seems this problem is important enough to be backported to 2.12 LTS, no?

 

Comment by Nathan Dauchy (Inactive) [ 16/Oct/20 ]

We also hit this on 2.12, and would benefit from a backport.  Chances are good there is at least one client not responding somewhere on a large system.

Comment by Gerrit Updater [ 25/Jan/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41309
Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: bf855c643c4c89ac57841e80705ef617cf65e02b

Comment by Etienne Aujames [ 25/Jan/21 ]

Hello,

We hit this issue in production. So I backported the patch https://review.whamcloud.com/37880/ on b2_12.

I am aware this patch remove the support of OBD_CONNECT_MNE_SWAB, so I don't expect that land on the b2_12.
But it seems important enough to be integrated to our version of Lustre.

Comment by Gerrit Updater [ 05/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41309/
Subject: LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 337b1d1bb301725b91380326985af52a5bede3a1

Generated at Sat Feb 10 03:00:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.