Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0, Lustre 2.12.9
Affects Version/s: Lustre 2.14.0, Lustre 2.12.5
Labels:
- patch

Epic/Theme:
- LTS12
- mgs
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

From vmcore

lctl thread was sleeping and waiting lock to be granted

PID: 26309  TASK: ffff9730a36b1040  CPU: 6   COMMAND: "lctl"
 #0 [ffff973084edfa08] __schedule at ffffffffaf369b97
 #1 [ffff973084edfa98] schedule at ffffffffaf36a099
 #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc]
 #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs]
 #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs]
 #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc]
 #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs]
 #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs]
 #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs]
 #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass]
#10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass]
#11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490
#12 [ffff973084edff00] sys_ioctl at ffffffffaee56731
#13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb

crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40
struct ldlm_resource {
  lr_ns_bucket = 0xffff9730a7f8cb18,
  lr_hash = {
    next = 0x0,
    pprev = 0xffff9730f8d22608
  },
  lr_refcount = {
    counter = 295
  },
  lr_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  lr_granted = {
    next = 0xffff97308638c720,
    prev = 0xffff973086a542a0
  },
  lr_waiting = {
    next = 0xffff973085da84e0,
    prev = 0xffff973082f21920
  },
  lr_enqueueing = {
    next = 0xffff9730f7a6ab80,
    prev = 0xffff9730f7a6ab80
  },
  lr_name = {
    name = {3546639893419028083, 0, 0, 0}
  },
  {
    lr_itree = 0x0,
    lr_ibits_queues = 0x0
  },
  {
    lr_contention_time = 0,
    lr_lvb_inode = 0x0
  },
  lr_type = LDLM_PLAIN,
  lr_lvb_len = 0,

The wait is infinite.
Here is four clients which conflicts with lock

crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000
  exp_client_uuid = {
    uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400
  exp_client_uuid = {
    uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800
  exp_client_uuid = {
    uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000"
  }
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400
  exp_client_uuid = {
    uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000"
  }

It looks like problem started from a network errors

[ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3
[ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail
[ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0
[ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID
[ 2039.639855] Lustre: Skipped 8 previous similar messages
[ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3)
[ 2039.656802] Lustre: Skipped 7 previous similar messages
[ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
[ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2
[ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks:
[ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0

....
|00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID|
|00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912|

MGC uses OBD_CONNECT_MNE_SWAB flag.
The root cause of this problem is flag OBD_CONNECT_MNE_SWAB equal to OBD_CONNECT_MDS_MDS. OBD_CONNECT_MNE_SWAB was used for 2.2 clients for MNE swabbing. OBD_CONNECT_MDS_MDS flag is used to skip export fail during reconnect for MDS-MDS interaction. Locks for MDS-MDS are not added to a waiting_locks_list because there is no eviction and so on. This leads to a situation when MGS can not cancel locks for a clients if client doesn't receive/respond to a blocking ast.

Attachments

Issue Links

is related to

LU-10674 MGS very unstable in 2.10.x

Open

LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock

Open

LU-11990 conf-sanity test_66: replace nids fail alone MGS

Resolved

LU-12735 MGS misbehaving in 2.12.2+

Resolved

LU-15539 clients report mds_mds_connection in connect_flags after lustre update on servers

Resolved

lctl conf_param hung on the MGS node

Details

Description

Attachments

Issue Links

Activity

People

Dates