[LU-12735] MGS misbehaving in 2.12.2+ Created: 09/Sep/19  Updated: 06/Jul/21  Resolved: 11/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.6, 2.12.2_116 + LU-11285,LU-12017,LU-11761


Attachments: Text File fir-md1-s1-foreachbt-2019-09-08-22-38-08.log     Text File vmcore-dmesg-oak-md1-s1-lu12735.txt    
Issue Links:
Related
is related to LU-13356 lctl conf_param hung on the MGS node Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + LU-11285,LU-12017,LU-11761): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.

Since the upgrade of the filesystem, I've noticed those messages:

[Sun Sep  8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M
[Sun Sep  8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M
[Sun Sep  8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M
[Sun Sep  8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sun Sep  8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M
[Sun Sep  8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M
[Sun Sep  8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M

Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.

So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:
 

top - 22:35:21 up 3 days, 17:27,  1 user,  load average: 499.00, 487.24, 455.13
Tasks: 2201 total, 101 running, 2100 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us, 30.2 sy,  0.0 ni, 69.5 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 26356524+total, 99668392 free,  4215400 used, 15968145+buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 23785484+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 22485 root      20   0       0      0      0 R  37.8  0.0 181:28.81 ll_mgs_0029
 22394 root      20   0       0      0      0 R  35.9  0.0 175:39.61 ll_mgs_0001
 22476 root      20   0       0      0      0 R  33.9  0.0 171:57.25 ll_mgs_0020
 22467 root      20   0       0      0      0 R  31.9  0.0 171:41.76 ll_mgs_0011
 22479 root      20   0       0      0      0 R  31.6  0.0 171:59.09 ll_mgs_0023
 22483 root      20   0       0      0      0 R  27.6  0.0 166:36.77 ll_mgs_0027
 22471 root      20   0       0      0      0 R  21.4  0.0 172:23.91 ll_mgs_0015
 22451 root      20   0       0      0      0 R  19.7  0.0 170:26.72 ll_mgs_0003
 22455 root      20   0       0      0      0 R  19.7  0.0 160:29.77 ll_mgs_0007
 22478 root      20   0       0      0      0 R  15.5  0.0 174:40.18 ll_mgs_0022
 22457 root      20   0       0      0      0 R  13.8  0.0 172:22.42 ll_mgs_0008
 22484 root      20   0       0      0      0 R  11.5  0.0 186:13.69 ll_mgs_0028
 22473 root      20   0       0      0      0 R   9.9  0.0 159:36.69 ll_mgs_0017
 16487 root      20   0  164268   4572   1604 R   8.2  0.0   0:00.29 top
 22459 root      20   0       0      0      0 R   7.2  0.0 171:38.46 ll_mgs_0010
 22475 root      20   0       0      0      0 R   6.6  0.0 167:11.56 ll_mgs_0019
 22487 root      20   0       0      0      0 R   6.2  0.0 162:49.27 ll_mgs_0031
 24401 root      20   0       0      0      0 S   4.6  0.0  15:56.76 mdt00_098
 22472 root      20   0       0      0      0 R   3.9  0.0 170:10.49 ll_mgs_0016
 22482 root      20   0       0      0      0 R   3.6  0.0 168:09.82 ll_mgs_0026
 22269 root      20   0       0      0      0 S   3.3  0.0 101:06.47 kiblnd_sd_00_00
 22452 root      20   0       0      0      0 R   3.3  0.0 170:29.48 ll_mgs_0004
 22453 root      20   0       0      0      0 R   3.0  0.0 155:18.83 ll_mgs_0005
 22486 root      20   0       0      0      0 R   3.0  0.0 172:00.85 ll_mgs_0030
 24415 root      20   0       0      0      0 S   3.0  0.0   3:12.69 mdt00_109
 

I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.

I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log

Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64

Thanks!
Stephane



 Comments   
Comment by Peter Jones [ 09/Sep/19 ]

Mike

Could you please advise?

Peter

Comment by Mikhail Pershin [ 23/Sep/19 ]

is this occurring often or this is only one such situation so far?

Comment by Stephane Thiell [ 17/Feb/21 ]

Hi Mike,
This issue is still around with 2.12.6 servers. Sometimes, after a OST failover, the MGS gets in a weird state and keeps disconnecting from clients. From a 2.12.6 client, I can see this when the issue occurs:

Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594272, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e44d608f000) refcount nonzero (1) after lock cleanup; forcing
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message
Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5)
Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message

Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594886, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e4292446900) refcount nonzero (1) after lock cleanup; forcin
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message
Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5)
Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message

Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613595498, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e461f1dac00) refcount nonzero (1) after lock cleanup; forcin
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message

etc..

Then, when unmounting the MGS, the server gets stuck with some kind of ldlm_bl deadlock (see vmcore-dmesg-oak-md1-s1-lu12735.txt ). When I take a crash dump, I see this:

crash> ps | grep ">"
>  3840      2   7  ffff8a6569feb180  RU   0.0       0      0  [ldlm_bl_01]
>  3841      2   5  ffff8a6569fec200  RU   0.0       0      0  [ldlm_bl_02]
> 22779      2   8  ffff8a65e97a9080  RU   0.0       0      0  [ldlm_bl_03]
> 22908      2   6  ffff8a5644bca100  RU   0.0       0      0  [ldlm_bl_04]
> 27290      2  19  ffff8a5642ba0000  RU   0.0       0      0  [ldlm_bl_05]
> 27291      2  21  ffff8a5642ba5280  RU   0.0       0      0  [ldlm_bl_06]
> 27292      2   2  ffff8a5642ba2100  RU   0.0       0      0  [ldlm_bl_07]
> 27293      2  12  ffff8a5642ba3180  RU   0.0       0      0  [ldlm_bl_08]
> 27294      2  11  ffff8a5642ba4200  RU   0.0       0      0  [ldlm_bl_09]
> 27295      2  15  ffff8a5664fa6300  RU   0.0       0      0  [ldlm_bl_10]
> 28505      2  17  ffff8a56bdb60000  RU   0.0       0      0  [ldlm_bl_11]
> 28506      2  10  ffff8a56bdb63180  RU   0.0       0      0  [ldlm_bl_12]
> 28507      2   4  ffff8a65c47a2100  RU   0.0       0      0  [ldlm_bl_13]
> 28508      2  16  ffff8a65c37bb180  RU   0.0       0      0  [ldlm_bl_14]
> 28509      2   1  ffff8a65c37ba100  RU   0.0       0      0  [ldlm_bl_15]
> 28510      2  13  ffff8a56517ad280  RU   0.0       0      0  [ldlm_bl_16]
> 28511      2   3  ffff8a56517ae300  RU   0.0       0      0  [ldlm_bl_17]
> 28512      2   0  ffff8a65cab9b180  RU   0.0       0      0  [ldlm_bl_18]
> 28513      2  14  ffff8a65cab9e300  RU   0.0       0      0  [ldlm_bl_19]
> 28514      2  18  ffff8a65cab9d280  RU   0.0       0      0  [ldlm_bl_20]
> 28515      2   9  ffff8a65cab9c200  RU   0.0       0      0  [ldlm_bl_21]
> 28516      2  23  ffff8a65cab9a100  RU   0.0       0      0  [ldlm_bl_22]
> 28517      2  20  ffff8a65cab99080  RU   0.0       0      0  [ldlm_bl_23]
> 28518      2  22  ffff8a65cab98000  RU   0.0       0      0  [ldlm_bl_24]
crash> bt 3840
PID: 3840   TASK: ffff8a6569feb180  CPU: 7   COMMAND: "ldlm_bl_01"
 #0 [ffff8a757dac8e48] crash_nmi_callback at ffffffffaa458537
 #1 [ffff8a757dac8e58] nmi_handle at ffffffffaab8c93c
 #2 [ffff8a757dac8eb0] do_nmi at ffffffffaab8cc18
 #3 [ffff8a757dac8ef0] end_repeat_nmi at ffffffffaab8bd9c
    [exception RIP: native_queued_spin_lock_slowpath+294]
    RIP: ffffffffaa517626  RSP: ffff8a6569ffbc10  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff8a6559155e80  RCX: 0000000000390000
    RDX: ffff8a657e29b8c0  RSI: 0000000000210001  RDI: ffff8a56673df99c
    RBP: ffff8a6569ffbc10   R8: ffff8a757dadb8c0   R9: 0000000000000000
    R10: ffff8a75692b7658  R11: ffff8a75692b7648  R12: ffff8a6569ffbdb8
    R13: ffff8a5652b91c00  R14: ffff8a6559155e80  R15: ffff8a756fc07800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8a6569ffbc10] native_queued_spin_lock_slowpath at ffffffffaa517626
 #5 [ffff8a6569ffbc18] queued_spin_lock_slowpath at ffffffffaab7b43b
 #6 [ffff8a6569ffbc28] _raw_spin_lock at ffffffffaab89b50
 #7 [ffff8a6569ffbc38] lock_res_and_lock at ffffffffc123a02c [ptlrpc]
 #8 [ffff8a6569ffbc50] ldlm_lock_cancel at ffffffffc1242f8d [ptlrpc]
 #9 [ffff8a6569ffbc70] ldlm_cancel_lock_for_export at ffffffffc12431c4 [ptlrpc]
#10 [ffff8a6569ffbcb8] ldlm_cancel_locks_for_export_cb at ffffffffc12434ec [ptlrpc]
#11 [ffff8a6569ffbce0] cfs_hash_for_each_relax at ffffffffc0d2b1a0 [libcfs]
#12 [ffff8a6569ffbd60] cfs_hash_for_each_empty at ffffffffc0d2e700 [libcfs]
#13 [ffff8a6569ffbda0] ldlm_export_cancel_locks at ffffffffc12437fa [ptlrpc]
#14 [ffff8a6569ffbe10] ldlm_bl_thread_main at ffffffffc126ccf8 [ptlrpc]
#15 [ffff8a6569ffbec8] kthread at ffffffffaa4c5c21
#16 [ffff8a6569ffbf50] ret_from_fork_nospec_begin at ffffffffaab94ddd
Comment by DELBARY Gael [ 17/Mar/21 ]

Hi,

Unfortunately we hit the same issue. Our Lustre version is a bit different, 2.12.4 for servers and (2.12.5/6 for clients).

The context is a bit different because we have found some elements to trigger the MGS weird state.

If you have one client disappears from lnet network that has a lock on the MGS, then you do a modification (before ldlm timeout) in the MGS (ex:  ost add, set_param -P, ...), then you unmount the MGS we fall into this issue.

On the crash we have the same backtrace like Stephane. But for this time, the crash was immediate on unmount. We think https://jira.whamcloud.com/browse/LU-13356 could help. We will confirm this with a crash deep analysis.

One way to avoid this is to be sure all exports are connected... @Stephane, before the unmount, have you got other weird clients more than the failed OST?

 

Comment by Stephane Thiell [ 17/Mar/21 ]

Glad to know that I'm not alone, Gaël. I don't think we got weird clients apart from the OST failover. At the moment, our workaround with Lustre 2.12.6 is to unmount the MGS before stopping any MDT or OST (for failover), just to avoid a MGS crash. Remounting the MGS before starting OST/MDTs seems to be fine. Next time, I will check the exports. It's an annoying problem especially if you have MDTs co-located with the MGS.

Comment by Etienne Aujames [ 11/May/21 ]

Hi,

We have applied the patch https://review.whamcloud.com/41309 ("LU-13356 client: don't use OBD_CONNECT_MNE_SWAB") in production for several weeks now and the issue doesn't appear again.

The issue used to occurred after an OST failover.

Comment by Stephane Thiell [ 11/May/21 ]

Etienne – Great news, thanks! Hope this patch can make it to 2.12.7.

Comment by Peter Jones [ 11/May/21 ]

Good news - thanks

Comment by Matt Rásó-Barnett (Inactive) [ 30/Jun/21 ]

Hello, sorry to comment on a resolved issue, but I believe I'm hitting this issue as well at the moment.

Can I ask if using the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 is the full solution here?

@Stefane - are you using this patch in your 2.12 servers at the moment? I don't see it applied to b2_12 upstream at the moment which is a bit concerning - it's been quite debilitating for us, so I'm curious why it's not made it into 2.12.7.

Thanks,

Matt

Comment by DELBARY Gael [ 30/Jun/21 ]

Hi Matt,

I don't answer in place of Stéphane but on our side it fully fixes this issue. We have the https://jira.whamcloud.com/browse/LU-13356 backport on all our Lustre in production.

Gael

Comment by Matt Rásó-Barnett (Inactive) [ 30/Jun/21 ]

Thanks Gael, that's all I wanted to hear! Super - I'm going to test this on our environment now and hopefully fix this issue.

Cheers!
Matt

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jul/21 ]

Hi Gael, sorry I missed this before, but is this a client and server patch? I tried deploying just to our MGS server initially, but it didn't have any effect, before looking more closely. Have you deployed this across both clients & servers in your environment?

Thanks again,
Matt

Comment by DELBARY Gael [ 06/Jul/21 ]

Hi Matt,

Unfortunately that's a server patch only. In our case, the effect was immediate. Perhaps, you have a similar issue but not exactly the same. Have you got an opened Jira ticket on your issue?

Gael

Comment by Matt Rásó-Barnett (Inactive) [ 06/Jul/21 ]

Thanks for confirming. Yes perhaps I have something else going on too, I've opened my own issue under https://jira.whamcloud.com/browse/LU-14802

Thanks again,
Matt

Comment by Stephane Thiell [ 06/Jul/21 ]

Hey Matt, sorry to hear about your current issues. We have not applied the patch https://review.whamcloud.com/#/c/41309/ (backport for LU-13356) in production yet, I was waiting for it to land to 2.12.7. I also thought it was a server-only patch.
As you have seen in my ticket LU-14695, we have been having issues with the MGS propagating the configuration to the other targets when new OSTS are added, and also we have some kind of llog corruption that we're still trying to understand. But in our case, new clients can still mount properly (could be some luck). I hope Whamcloud will be able to help understand these config llog / MGS -related problems. I'm going to watch your ticket! Fingers crossed.

Generated at Sat Feb 10 02:55:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.