Details
Description
Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + LU-11285,LU-12017,LU-11761): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.
Since the upgrade of the filesystem, I've noticed those messages:
[Sun Sep 8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M [Sun Sep 8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M [Sun Sep 8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M [Sun Sep 8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sun Sep 8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M [Sun Sep 8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M [Sun Sep 8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M
Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.
So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:
top - 22:35:21 up 3 days, 17:27, 1 user, load average: 499.00, 487.24, 455.13 Tasks: 2201 total, 101 running, 2100 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 30.2 sy, 0.0 ni, 69.5 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 26356524+total, 99668392 free, 4215400 used, 15968145+buff/cache KiB Swap: 4194300 total, 4194300 free, 0 used. 23785484+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22485 root 20 0 0 0 0 R 37.8 0.0 181:28.81 ll_mgs_0029 22394 root 20 0 0 0 0 R 35.9 0.0 175:39.61 ll_mgs_0001 22476 root 20 0 0 0 0 R 33.9 0.0 171:57.25 ll_mgs_0020 22467 root 20 0 0 0 0 R 31.9 0.0 171:41.76 ll_mgs_0011 22479 root 20 0 0 0 0 R 31.6 0.0 171:59.09 ll_mgs_0023 22483 root 20 0 0 0 0 R 27.6 0.0 166:36.77 ll_mgs_0027 22471 root 20 0 0 0 0 R 21.4 0.0 172:23.91 ll_mgs_0015 22451 root 20 0 0 0 0 R 19.7 0.0 170:26.72 ll_mgs_0003 22455 root 20 0 0 0 0 R 19.7 0.0 160:29.77 ll_mgs_0007 22478 root 20 0 0 0 0 R 15.5 0.0 174:40.18 ll_mgs_0022 22457 root 20 0 0 0 0 R 13.8 0.0 172:22.42 ll_mgs_0008 22484 root 20 0 0 0 0 R 11.5 0.0 186:13.69 ll_mgs_0028 22473 root 20 0 0 0 0 R 9.9 0.0 159:36.69 ll_mgs_0017 16487 root 20 0 164268 4572 1604 R 8.2 0.0 0:00.29 top 22459 root 20 0 0 0 0 R 7.2 0.0 171:38.46 ll_mgs_0010 22475 root 20 0 0 0 0 R 6.6 0.0 167:11.56 ll_mgs_0019 22487 root 20 0 0 0 0 R 6.2 0.0 162:49.27 ll_mgs_0031 24401 root 20 0 0 0 0 S 4.6 0.0 15:56.76 mdt00_098 22472 root 20 0 0 0 0 R 3.9 0.0 170:10.49 ll_mgs_0016 22482 root 20 0 0 0 0 R 3.6 0.0 168:09.82 ll_mgs_0026 22269 root 20 0 0 0 0 S 3.3 0.0 101:06.47 kiblnd_sd_00_00 22452 root 20 0 0 0 0 R 3.3 0.0 170:29.48 ll_mgs_0004 22453 root 20 0 0 0 0 R 3.0 0.0 155:18.83 ll_mgs_0005 22486 root 20 0 0 0 0 R 3.0 0.0 172:00.85 ll_mgs_0030 24415 root 20 0 0 0 0 S 3.0 0.0 3:12.69 mdt00_109
I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.
I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log
Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-13356 lctl conf_param hung on the MGS node
-
- Resolved
-
Hey Matt, sorry to hear about your current issues. We have not applied the patch https://review.whamcloud.com/#/c/41309/ (backport for
LU-13356) in production yet, I was waiting for it to land to 2.12.7. I also thought it was a server-only patch.As you have seen in my ticket LU-14695, we have been having issues with the MGS propagating the configuration to the other targets when new OSTS are added, and also we have some kind of llog corruption that we're still trying to understand. But in our case, new clients can still mount properly (could be some luck). I hope Whamcloud will be able to help understand these config llog / MGS -related problems. I'm going to watch your ticket! Fingers crossed.