[LU-12735] MGS misbehaving in 2.12.2+ Created: 09/Sep/19 Updated: 06/Jul/21 Resolved: 11/May/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + Since the upgrade of the filesystem, I've noticed those messages: [Sun Sep 8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M [Sun Sep 8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M [Sun Sep 8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M [Sun Sep 8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sun Sep 8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M [Sun Sep 8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M [Sun Sep 8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS. So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load: top - 22:35:21 up 3 days, 17:27, 1 user, load average: 499.00, 487.24, 455.13 Tasks: 2201 total, 101 running, 2100 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 30.2 sy, 0.0 ni, 69.5 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 26356524+total, 99668392 free, 4215400 used, 15968145+buff/cache KiB Swap: 4194300 total, 4194300 free, 0 used. 23785484+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22485 root 20 0 0 0 0 R 37.8 0.0 181:28.81 ll_mgs_0029 22394 root 20 0 0 0 0 R 35.9 0.0 175:39.61 ll_mgs_0001 22476 root 20 0 0 0 0 R 33.9 0.0 171:57.25 ll_mgs_0020 22467 root 20 0 0 0 0 R 31.9 0.0 171:41.76 ll_mgs_0011 22479 root 20 0 0 0 0 R 31.6 0.0 171:59.09 ll_mgs_0023 22483 root 20 0 0 0 0 R 27.6 0.0 166:36.77 ll_mgs_0027 22471 root 20 0 0 0 0 R 21.4 0.0 172:23.91 ll_mgs_0015 22451 root 20 0 0 0 0 R 19.7 0.0 170:26.72 ll_mgs_0003 22455 root 20 0 0 0 0 R 19.7 0.0 160:29.77 ll_mgs_0007 22478 root 20 0 0 0 0 R 15.5 0.0 174:40.18 ll_mgs_0022 22457 root 20 0 0 0 0 R 13.8 0.0 172:22.42 ll_mgs_0008 22484 root 20 0 0 0 0 R 11.5 0.0 186:13.69 ll_mgs_0028 22473 root 20 0 0 0 0 R 9.9 0.0 159:36.69 ll_mgs_0017 16487 root 20 0 164268 4572 1604 R 8.2 0.0 0:00.29 top 22459 root 20 0 0 0 0 R 7.2 0.0 171:38.46 ll_mgs_0010 22475 root 20 0 0 0 0 R 6.6 0.0 167:11.56 ll_mgs_0019 22487 root 20 0 0 0 0 R 6.2 0.0 162:49.27 ll_mgs_0031 24401 root 20 0 0 0 0 S 4.6 0.0 15:56.76 mdt00_098 22472 root 20 0 0 0 0 R 3.9 0.0 170:10.49 ll_mgs_0016 22482 root 20 0 0 0 0 R 3.6 0.0 168:09.82 ll_mgs_0026 22269 root 20 0 0 0 0 S 3.3 0.0 101:06.47 kiblnd_sd_00_00 22452 root 20 0 0 0 0 R 3.3 0.0 170:29.48 ll_mgs_0004 22453 root 20 0 0 0 0 R 3.0 0.0 155:18.83 ll_mgs_0005 22486 root 20 0 0 0 0 R 3.0 0.0 172:00.85 ll_mgs_0030 24415 root 20 0 0 0 0 S 3.0 0.0 3:12.69 mdt00_109 I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump. I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08 Thanks! |
| Comments |
| Comment by Peter Jones [ 09/Sep/19 ] |
|
Mike Could you please advise? Peter |
| Comment by Mikhail Pershin [ 23/Sep/19 ] |
|
is this occurring often or this is only one such situation so far? |
| Comment by Stephane Thiell [ 17/Feb/21 ] |
|
Hi Mike, Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594272, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e44d608f000) refcount nonzero (1) after lock cleanup; forcing Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5) Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594886, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e4292446900) refcount nonzero (1) after lock cleanup; forcin Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5) Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613595498, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e461f1dac00) refcount nonzero (1) after lock cleanup; forcin Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message etc.. Then, when unmounting the MGS, the server gets stuck with some kind of ldlm_bl deadlock (see vmcore-dmesg-oak-md1-s1-lu12735.txt
crash> ps | grep ">"
> 3840 2 7 ffff8a6569feb180 RU 0.0 0 0 [ldlm_bl_01]
> 3841 2 5 ffff8a6569fec200 RU 0.0 0 0 [ldlm_bl_02]
> 22779 2 8 ffff8a65e97a9080 RU 0.0 0 0 [ldlm_bl_03]
> 22908 2 6 ffff8a5644bca100 RU 0.0 0 0 [ldlm_bl_04]
> 27290 2 19 ffff8a5642ba0000 RU 0.0 0 0 [ldlm_bl_05]
> 27291 2 21 ffff8a5642ba5280 RU 0.0 0 0 [ldlm_bl_06]
> 27292 2 2 ffff8a5642ba2100 RU 0.0 0 0 [ldlm_bl_07]
> 27293 2 12 ffff8a5642ba3180 RU 0.0 0 0 [ldlm_bl_08]
> 27294 2 11 ffff8a5642ba4200 RU 0.0 0 0 [ldlm_bl_09]
> 27295 2 15 ffff8a5664fa6300 RU 0.0 0 0 [ldlm_bl_10]
> 28505 2 17 ffff8a56bdb60000 RU 0.0 0 0 [ldlm_bl_11]
> 28506 2 10 ffff8a56bdb63180 RU 0.0 0 0 [ldlm_bl_12]
> 28507 2 4 ffff8a65c47a2100 RU 0.0 0 0 [ldlm_bl_13]
> 28508 2 16 ffff8a65c37bb180 RU 0.0 0 0 [ldlm_bl_14]
> 28509 2 1 ffff8a65c37ba100 RU 0.0 0 0 [ldlm_bl_15]
> 28510 2 13 ffff8a56517ad280 RU 0.0 0 0 [ldlm_bl_16]
> 28511 2 3 ffff8a56517ae300 RU 0.0 0 0 [ldlm_bl_17]
> 28512 2 0 ffff8a65cab9b180 RU 0.0 0 0 [ldlm_bl_18]
> 28513 2 14 ffff8a65cab9e300 RU 0.0 0 0 [ldlm_bl_19]
> 28514 2 18 ffff8a65cab9d280 RU 0.0 0 0 [ldlm_bl_20]
> 28515 2 9 ffff8a65cab9c200 RU 0.0 0 0 [ldlm_bl_21]
> 28516 2 23 ffff8a65cab9a100 RU 0.0 0 0 [ldlm_bl_22]
> 28517 2 20 ffff8a65cab99080 RU 0.0 0 0 [ldlm_bl_23]
> 28518 2 22 ffff8a65cab98000 RU 0.0 0 0 [ldlm_bl_24]
crash> bt 3840
PID: 3840 TASK: ffff8a6569feb180 CPU: 7 COMMAND: "ldlm_bl_01"
#0 [ffff8a757dac8e48] crash_nmi_callback at ffffffffaa458537
#1 [ffff8a757dac8e58] nmi_handle at ffffffffaab8c93c
#2 [ffff8a757dac8eb0] do_nmi at ffffffffaab8cc18
#3 [ffff8a757dac8ef0] end_repeat_nmi at ffffffffaab8bd9c
[exception RIP: native_queued_spin_lock_slowpath+294]
RIP: ffffffffaa517626 RSP: ffff8a6569ffbc10 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8a6559155e80 RCX: 0000000000390000
RDX: ffff8a657e29b8c0 RSI: 0000000000210001 RDI: ffff8a56673df99c
RBP: ffff8a6569ffbc10 R8: ffff8a757dadb8c0 R9: 0000000000000000
R10: ffff8a75692b7658 R11: ffff8a75692b7648 R12: ffff8a6569ffbdb8
R13: ffff8a5652b91c00 R14: ffff8a6559155e80 R15: ffff8a756fc07800
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff8a6569ffbc10] native_queued_spin_lock_slowpath at ffffffffaa517626
#5 [ffff8a6569ffbc18] queued_spin_lock_slowpath at ffffffffaab7b43b
#6 [ffff8a6569ffbc28] _raw_spin_lock at ffffffffaab89b50
#7 [ffff8a6569ffbc38] lock_res_and_lock at ffffffffc123a02c [ptlrpc]
#8 [ffff8a6569ffbc50] ldlm_lock_cancel at ffffffffc1242f8d [ptlrpc]
#9 [ffff8a6569ffbc70] ldlm_cancel_lock_for_export at ffffffffc12431c4 [ptlrpc]
#10 [ffff8a6569ffbcb8] ldlm_cancel_locks_for_export_cb at ffffffffc12434ec [ptlrpc]
#11 [ffff8a6569ffbce0] cfs_hash_for_each_relax at ffffffffc0d2b1a0 [libcfs]
#12 [ffff8a6569ffbd60] cfs_hash_for_each_empty at ffffffffc0d2e700 [libcfs]
#13 [ffff8a6569ffbda0] ldlm_export_cancel_locks at ffffffffc12437fa [ptlrpc]
#14 [ffff8a6569ffbe10] ldlm_bl_thread_main at ffffffffc126ccf8 [ptlrpc]
#15 [ffff8a6569ffbec8] kthread at ffffffffaa4c5c21
#16 [ffff8a6569ffbf50] ret_from_fork_nospec_begin at ffffffffaab94ddd
|
| Comment by DELBARY Gael [ 17/Mar/21 ] |
|
Hi, Unfortunately we hit the same issue. Our Lustre version is a bit different, 2.12.4 for servers and (2.12.5/6 for clients). The context is a bit different because we have found some elements to trigger the MGS weird state. If you have one client disappears from lnet network that has a lock on the MGS, then you do a modification (before ldlm timeout) in the MGS (ex: ost add, set_param -P, ...), then you unmount the MGS we fall into this issue. On the crash we have the same backtrace like Stephane. But for this time, the crash was immediate on unmount. We think https://jira.whamcloud.com/browse/LU-13356 could help. We will confirm this with a crash deep analysis. One way to avoid this is to be sure all exports are connected... @Stephane, before the unmount, have you got other weird clients more than the failed OST?
|
| Comment by Stephane Thiell [ 17/Mar/21 ] |
|
Glad to know that I'm not alone, Gaël. I don't think we got weird clients apart from the OST failover. At the moment, our workaround with Lustre 2.12.6 is to unmount the MGS before stopping any MDT or OST (for failover), just to avoid a MGS crash. Remounting the MGS before starting OST/MDTs seems to be fine. Next time, I will check the exports. It's an annoying problem especially if you have MDTs co-located with the MGS. |
| Comment by Etienne Aujames [ 11/May/21 ] |
|
Hi, We have applied the patch https://review.whamcloud.com/41309 (" The issue used to occurred after an OST failover. |
| Comment by Stephane Thiell [ 11/May/21 ] |
|
Etienne – Great news, thanks! Hope this patch can make it to 2.12.7. |
| Comment by Peter Jones [ 11/May/21 ] |
|
Good news - thanks |
| Comment by Matt Rásó-Barnett (Inactive) [ 30/Jun/21 ] |
|
Hello, sorry to comment on a resolved issue, but I believe I'm hitting this issue as well at the moment. Can I ask if using the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 is the full solution here? @Stefane - are you using this patch in your 2.12 servers at the moment? I don't see it applied to b2_12 upstream at the moment which is a bit concerning - it's been quite debilitating for us, so I'm curious why it's not made it into 2.12.7. Thanks, Matt |
| Comment by DELBARY Gael [ 30/Jun/21 ] |
|
Hi Matt, I don't answer in place of Stéphane but on our side it fully fixes this issue. We have the https://jira.whamcloud.com/browse/LU-13356 backport on all our Lustre in production. Gael |
| Comment by Matt Rásó-Barnett (Inactive) [ 30/Jun/21 ] |
|
Thanks Gael, that's all I wanted to hear! Super - I'm going to test this on our environment now and hopefully fix this issue. Cheers! |
| Comment by Matt Rásó-Barnett (Inactive) [ 05/Jul/21 ] |
|
Hi Gael, sorry I missed this before, but is this a client and server patch? I tried deploying just to our MGS server initially, but it didn't have any effect, before looking more closely. Have you deployed this across both clients & servers in your environment? Thanks again, |
| Comment by DELBARY Gael [ 06/Jul/21 ] |
|
Hi Matt, Unfortunately that's a server patch only. In our case, the effect was immediate. Perhaps, you have a similar issue but not exactly the same. Have you got an opened Jira ticket on your issue? Gael |
| Comment by Matt Rásó-Barnett (Inactive) [ 06/Jul/21 ] |
|
Thanks for confirming. Yes perhaps I have something else going on too, I've opened my own issue under https://jira.whamcloud.com/browse/LU-14802 Thanks again, |
| Comment by Stephane Thiell [ 06/Jul/21 ] |
|
Hey Matt, sorry to hear about your current issues. We have not applied the patch https://review.whamcloud.com/#/c/41309/ (backport for |