Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.14.0, Lustre 2.12.5
Description
From vmcore
lctl thread was sleeping and waiting lock to be granted
PID: 26309 TASK: ffff9730a36b1040 CPU: 6 COMMAND: "lctl" #0 [ffff973084edfa08] __schedule at ffffffffaf369b97 #1 [ffff973084edfa98] schedule at ffffffffaf36a099 #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc] #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs] #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs] #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc] #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs] #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs] #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs] #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass] #10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass] #11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490 #12 [ffff973084edff00] sys_ioctl at ffffffffaee56731 #13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb
crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40 struct ldlm_resource { lr_ns_bucket = 0xffff9730a7f8cb18, lr_hash = { next = 0x0, pprev = 0xffff9730f8d22608 }, lr_refcount = { counter = 295 }, lr_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, lr_granted = { next = 0xffff97308638c720, prev = 0xffff973086a542a0 }, lr_waiting = { next = 0xffff973085da84e0, prev = 0xffff973082f21920 }, lr_enqueueing = { next = 0xffff9730f7a6ab80, prev = 0xffff9730f7a6ab80 }, lr_name = { name = {3546639893419028083, 0, 0, 0} }, { lr_itree = 0x0, lr_ibits_queues = 0x0 }, { lr_contention_time = 0, lr_lvb_inode = 0x0 }, lr_type = LDLM_PLAIN, lr_lvb_len = 0,
The wait is infinite.
Here is four clients which conflicts with lock
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000 exp_client_uuid = { uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000" } crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400 exp_client_uuid = { uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000" } crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800 exp_client_uuid = { uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000" } crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400 exp_client_uuid = { uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000" }
It looks like problem started from a network errors
[ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3 [ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail [ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0 [ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID [ 2039.639855] Lustre: Skipped 8 previous similar messages [ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3) [ 2039.656802] Lustre: Skipped 7 previous similar messages [ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 [ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2 [ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks: [ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0 .... |00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID| |00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912|
MGC uses OBD_CONNECT_MNE_SWAB flag.
The root cause of this problem is flag OBD_CONNECT_MNE_SWAB equal to OBD_CONNECT_MDS_MDS. OBD_CONNECT_MNE_SWAB was used for 2.2 clients for MNE swabbing. OBD_CONNECT_MDS_MDS flag is used to skip export fail during reconnect for MDS-MDS interaction. Locks for MDS-MDS are not added to a waiting_locks_list because there is no eviction and so on. This leads to a situation when MGS can not cancel locks for a clients if client doesn't receive/respond to a blocking ast.
Attachments
Issue Links
- is related to
-
LU-10674 MGS very unstable in 2.10.x
- Open
-
LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock
- Open
-
LU-11990 conf-sanity test_66: replace nids fail alone MGS
- Resolved
-
LU-12735 MGS misbehaving in 2.12.2+
- Resolved
-
LU-15539 clients report mds_mds_connection in connect_flags after lustre update on servers
- Resolved