Details
Description
Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + LU-11285,LU-12017,LU-11761): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.
Since the upgrade of the filesystem, I've noticed those messages:
[Sun Sep 8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M [Sun Sep 8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M [Sun Sep 8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M [Sun Sep 8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [Sun Sep 8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M [Sun Sep 8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M [Sun Sep 8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M [Sun Sep 8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M
Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.
So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:
top - 22:35:21 up 3 days, 17:27, 1 user, load average: 499.00, 487.24, 455.13 Tasks: 2201 total, 101 running, 2100 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 30.2 sy, 0.0 ni, 69.5 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 26356524+total, 99668392 free, 4215400 used, 15968145+buff/cache KiB Swap: 4194300 total, 4194300 free, 0 used. 23785484+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22485 root 20 0 0 0 0 R 37.8 0.0 181:28.81 ll_mgs_0029 22394 root 20 0 0 0 0 R 35.9 0.0 175:39.61 ll_mgs_0001 22476 root 20 0 0 0 0 R 33.9 0.0 171:57.25 ll_mgs_0020 22467 root 20 0 0 0 0 R 31.9 0.0 171:41.76 ll_mgs_0011 22479 root 20 0 0 0 0 R 31.6 0.0 171:59.09 ll_mgs_0023 22483 root 20 0 0 0 0 R 27.6 0.0 166:36.77 ll_mgs_0027 22471 root 20 0 0 0 0 R 21.4 0.0 172:23.91 ll_mgs_0015 22451 root 20 0 0 0 0 R 19.7 0.0 170:26.72 ll_mgs_0003 22455 root 20 0 0 0 0 R 19.7 0.0 160:29.77 ll_mgs_0007 22478 root 20 0 0 0 0 R 15.5 0.0 174:40.18 ll_mgs_0022 22457 root 20 0 0 0 0 R 13.8 0.0 172:22.42 ll_mgs_0008 22484 root 20 0 0 0 0 R 11.5 0.0 186:13.69 ll_mgs_0028 22473 root 20 0 0 0 0 R 9.9 0.0 159:36.69 ll_mgs_0017 16487 root 20 0 164268 4572 1604 R 8.2 0.0 0:00.29 top 22459 root 20 0 0 0 0 R 7.2 0.0 171:38.46 ll_mgs_0010 22475 root 20 0 0 0 0 R 6.6 0.0 167:11.56 ll_mgs_0019 22487 root 20 0 0 0 0 R 6.2 0.0 162:49.27 ll_mgs_0031 24401 root 20 0 0 0 0 S 4.6 0.0 15:56.76 mdt00_098 22472 root 20 0 0 0 0 R 3.9 0.0 170:10.49 ll_mgs_0016 22482 root 20 0 0 0 0 R 3.6 0.0 168:09.82 ll_mgs_0026 22269 root 20 0 0 0 0 S 3.3 0.0 101:06.47 kiblnd_sd_00_00 22452 root 20 0 0 0 0 R 3.3 0.0 170:29.48 ll_mgs_0004 22453 root 20 0 0 0 0 R 3.0 0.0 155:18.83 ll_mgs_0005 22486 root 20 0 0 0 0 R 3.0 0.0 172:00.85 ll_mgs_0030 24415 root 20 0 0 0 0 S 3.0 0.0 3:12.69 mdt00_109
I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.
I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log
Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-13356 lctl conf_param hung on the MGS node
- Resolved