Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.2
Labels:
None
Environment:

Hide
CentOS 7.6, 2.12.2_116 + ~~LU-11285~~,~~LU-12017~~,~~LU-11761~~

Show
CentOS 7.6, 2.12.2_116 + LU-11285 , LU-12017 , LU-11761

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + ~~LU-11285~~,~~LU-12017~~,~~LU-11761~~): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.

Since the upgrade of the filesystem, I've noticed those messages:

[Sun Sep  8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M
[Sun Sep  8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M
[Sun Sep  8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M
[Sun Sep  8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sun Sep  8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M
[Sun Sep  8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M
[Sun Sep  8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M

Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.

So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:

top - 22:35:21 up 3 days, 17:27,  1 user,  load average: 499.00, 487.24, 455.13
Tasks: 2201 total, 101 running, 2100 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us, 30.2 sy,  0.0 ni, 69.5 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 26356524+total, 99668392 free,  4215400 used, 15968145+buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 23785484+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 22485 root      20   0       0      0      0 R  37.8  0.0 181:28.81 ll_mgs_0029
 22394 root      20   0       0      0      0 R  35.9  0.0 175:39.61 ll_mgs_0001
 22476 root      20   0       0      0      0 R  33.9  0.0 171:57.25 ll_mgs_0020
 22467 root      20   0       0      0      0 R  31.9  0.0 171:41.76 ll_mgs_0011
 22479 root      20   0       0      0      0 R  31.6  0.0 171:59.09 ll_mgs_0023
 22483 root      20   0       0      0      0 R  27.6  0.0 166:36.77 ll_mgs_0027
 22471 root      20   0       0      0      0 R  21.4  0.0 172:23.91 ll_mgs_0015
 22451 root      20   0       0      0      0 R  19.7  0.0 170:26.72 ll_mgs_0003
 22455 root      20   0       0      0      0 R  19.7  0.0 160:29.77 ll_mgs_0007
 22478 root      20   0       0      0      0 R  15.5  0.0 174:40.18 ll_mgs_0022
 22457 root      20   0       0      0      0 R  13.8  0.0 172:22.42 ll_mgs_0008
 22484 root      20   0       0      0      0 R  11.5  0.0 186:13.69 ll_mgs_0028
 22473 root      20   0       0      0      0 R   9.9  0.0 159:36.69 ll_mgs_0017
 16487 root      20   0  164268   4572   1604 R   8.2  0.0   0:00.29 top
 22459 root      20   0       0      0      0 R   7.2  0.0 171:38.46 ll_mgs_0010
 22475 root      20   0       0      0      0 R   6.6  0.0 167:11.56 ll_mgs_0019
 22487 root      20   0       0      0      0 R   6.2  0.0 162:49.27 ll_mgs_0031
 24401 root      20   0       0      0      0 S   4.6  0.0  15:56.76 mdt00_098
 22472 root      20   0       0      0      0 R   3.9  0.0 170:10.49 ll_mgs_0016
 22482 root      20   0       0      0      0 R   3.6  0.0 168:09.82 ll_mgs_0026
 22269 root      20   0       0      0      0 S   3.3  0.0 101:06.47 kiblnd_sd_00_00
 22452 root      20   0       0      0      0 R   3.3  0.0 170:29.48 ll_mgs_0004
 22453 root      20   0       0      0      0 R   3.0  0.0 155:18.83 ll_mgs_0005
 22486 root      20   0       0      0      0 R   3.0  0.0 172:00.85 ll_mgs_0030
 24415 root      20   0       0      0      0 S   3.0  0.0   3:12.69 mdt00_109

I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.

I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log

Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1-foreachbt-2019-09-08-22-38-08.log
1.50 MB
09/Sep/19 5:22 PM
vmcore-dmesg-oak-md1-s1-lu12735.txt
777 kB
17/Feb/21 9:53 PM

Issue Links

is related to

LU-13356 lctl conf_param hung on the MGS node

Resolved

MGS misbehaving in 2.12.2+

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates