[LU-13356] lctl conf_param hung on the MGS node Created: 11/Mar/20 Updated: 22/Sep/23 Resolved: 14/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.9 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Epic/Theme: | LTS12, mgs | ||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
From vmcore lctl thread was sleeping and waiting lock to be granted PID: 26309 TASK: ffff9730a36b1040 CPU: 6 COMMAND: "lctl" #0 [ffff973084edfa08] __schedule at ffffffffaf369b97 #1 [ffff973084edfa98] schedule at ffffffffaf36a099 #2 [ffff973084edfaa8] ldlm_completion_ast at ffffffffc1582d45 [ptlrpc] #3 [ffff973084edfb50] mgs_completion_ast_generic at ffffffffc141f76c [mgs] #4 [ffff973084edfb98] mgs_completion_ast_config at ffffffffc141f983 [mgs] #5 [ffff973084edfba8] ldlm_cli_enqueue_local at ffffffffc1583ecc [ptlrpc] #6 [ffff973084edfc48] mgs_revoke_lock at ffffffffc14243b4 [mgs] #7 [ffff973084edfcf0] mgs_set_param at ffffffffc1441826 [mgs] #8 [ffff973084edfd50] mgs_iocontrol at ffffffffc14271ca [mgs] #9 [ffff973084edfdd0] class_handle_ioctl at ffffffffc10a40cd [obdclass] #10 [ffff973084edfe60] obd_class_ioctl at ffffffffc10a46d2 [obdclass] #11 [ffff973084edfe80] do_vfs_ioctl at ffffffffaee56490 #12 [ffff973084edff00] sys_ioctl at ffffffffaee56731 #13 [ffff973084edff50] system_call_fastpath at ffffffffaf376ddb crash-7.2.5_new> ldlm_resource 0xffff9730f7a6ab40
struct ldlm_resource {
lr_ns_bucket = 0xffff9730a7f8cb18,
lr_hash = {
next = 0x0,
pprev = 0xffff9730f8d22608
},
lr_refcount = {
counter = 295
},
lr_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
lr_granted = {
next = 0xffff97308638c720,
prev = 0xffff973086a542a0
},
lr_waiting = {
next = 0xffff973085da84e0,
prev = 0xffff973082f21920
},
lr_enqueueing = {
next = 0xffff9730f7a6ab80,
prev = 0xffff9730f7a6ab80
},
lr_name = {
name = {3546639893419028083, 0, 0, 0}
},
{
lr_itree = 0x0,
lr_ibits_queues = 0x0
},
{
lr_contention_time = 0,
lr_lvb_inode = 0x0
},
lr_type = LDLM_PLAIN,
lr_lvb_len = 0,
The wait is infinite. crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63000
exp_client_uuid = {
uuid = "bb972e10-11bc-7387-336f-6a82a0e0dd52\000\000\000"
}
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d63400
exp_client_uuid = {
uuid = "5915d2ba-94aa-bb2e-5b88-144f699f7fa1\000\000\000"
}
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d61800
exp_client_uuid = {
uuid = "9eeccfff-2a06-f62b-6132-9799e0bcd8aa\000\000\000"
}
crash-7.2.5_new> obd_export.exp_client_uuid 0xffff973085d66400
exp_client_uuid = {
uuid = "b5a63e29-e36a-ea42-6e59-5387ded252b0\000\000\000"
}
It looks like problem started from a network errors [ 2031.028363] LNet: 30781:0:(lib-msg.c:703:lnet_attempt_msg_resend()) msg 0@<0:0>->10.10.100.6@o2ib3 exceeded retry count 3 [ 2039.570481] LustreError: 166-1: MGC10.10.100.3@o2ib3: Connection to MGS (at 10.10.100.3@o2ib3) was lost; in progress operations using this service will fail [ 2039.586364] LustreError: 19280:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581103875, 780s ago), entering recovery for MGS@10.10.100.3@o2ib3 ns: MGC10.10.100.3@o2ib3 lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 4/1,0 mode: --/CR res: [0x3138323131786e73:0x0:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0 [ 2039.628629] Lustre: MGS: Received new LWP connection from 10.10.100.3@o2ib3, removing former export from same NID [ 2039.639855] Lustre: Skipped 8 previous similar messages [ 2039.646053] Lustre: MGS: Connection restored to 56d1a214-8cb4-e698-a1b5-8ec5fd85505f (at 10.10.100.3@o2ib3) [ 2039.656802] Lustre: Skipped 7 previous similar messages [ 2039.663325] LustreError: 31823:0:(ldlm_resource.c:1159:ldlm_resource_complain()) MGC10.10.100.3@o2ib3: namespace resource [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 2039.663336] LustreError: 19280:0:(mgc_request.c:599:do_requeue()) failed processing log: -5 [ 2039.695497] Lustre: 31823:0:(ldlm_resource.c:1772:ldlm_resource_dump()) --- Resource: [0x3138323131786e73:0x0:0x0].0x0 (ffff9731734a8d80) refcount = 2 [ 2039.711112] Lustre: 31823:0:(ldlm_resource.c:1789:ldlm_resource_dump()) Waiting locks: [ 2039.720115] Lustre: 31823:0:(ldlm_resource.c:1791:ldlm_resource_dump()) ### ### ns: ?? lock: ffff9711df178000/0x5c3c0316dd62b431 lrc: 2/0,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x5c3c0316dd62b438 expref: -99 pid: 19280 timeout: 0 lvb_type: 0 .... |00010000:02000400:6.0:1581107912.144637:0:19276:0:(ldlm_lib.c:1162:target_handle_connect()) MGS: Received new LWP connection from 162@gni99, removing former export from same NID| |00010000:00080000:6.0:1581107912.144640:0:19276:0:(ldlm_lib.c:1242:target_handle_connect()) MGS: connection from b5a63e29-e36a-ea42-6e59-5387ded252b0@162@gni99 t0 exp ffff973085d66400 cur 1581107912 last 1581107912| MGC uses OBD_CONNECT_MNE_SWAB flag. |
| Comments |
| Comment by Gerrit Updater [ 11/Mar/20 ] |
|
Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/37880 |
| Comment by Gerrit Updater [ 14/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37880/ |
| Comment by Peter Jones [ 14/Apr/20 ] |
|
Landed for 2.14 |
| Comment by Aurelien Degremont (Inactive) [ 01/Oct/20 ] |
|
If I understand correctly, that means that you could not do an IR if one client is not responding. It seems this problem is important enough to be backported to 2.12 LTS, no?
|
| Comment by Nathan Dauchy (Inactive) [ 16/Oct/20 ] |
|
We also hit this on 2.12, and would benefit from a backport. Chances are good there is at least one client not responding somewhere on a large system. |
| Comment by Gerrit Updater [ 25/Jan/21 ] |
|
Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41309 |
| Comment by Etienne Aujames [ 25/Jan/21 ] |
|
Hello, We hit this issue in production. So I backported the patch https://review.whamcloud.com/37880/ on b2_12. I am aware this patch remove the support of OBD_CONNECT_MNE_SWAB, so I don't expect that land on the b2_12. |
| Comment by Gerrit Updater [ 05/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41309/ |