Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12735

MGS misbehaving in 2.12.2+

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + LU-11285,LU-12017,LU-11761): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.

      Since the upgrade of the filesystem, I've noticed those messages:

      [Sun Sep  8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M
      [Sun Sep  8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M
      [Sun Sep  8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
      [Sun Sep  8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M
      [Sun Sep  8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M
      [Sun Sep  8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M
      

      Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.

      So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:
       

      top - 22:35:21 up 3 days, 17:27,  1 user,  load average: 499.00, 487.24, 455.13
      Tasks: 2201 total, 101 running, 2100 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.1 us, 30.2 sy,  0.0 ni, 69.5 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
      KiB Mem : 26356524+total, 99668392 free,  4215400 used, 15968145+buff/cache
      KiB Swap:  4194300 total,  4194300 free,        0 used. 23785484+avail Mem
      
         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
       22485 root      20   0       0      0      0 R  37.8  0.0 181:28.81 ll_mgs_0029
       22394 root      20   0       0      0      0 R  35.9  0.0 175:39.61 ll_mgs_0001
       22476 root      20   0       0      0      0 R  33.9  0.0 171:57.25 ll_mgs_0020
       22467 root      20   0       0      0      0 R  31.9  0.0 171:41.76 ll_mgs_0011
       22479 root      20   0       0      0      0 R  31.6  0.0 171:59.09 ll_mgs_0023
       22483 root      20   0       0      0      0 R  27.6  0.0 166:36.77 ll_mgs_0027
       22471 root      20   0       0      0      0 R  21.4  0.0 172:23.91 ll_mgs_0015
       22451 root      20   0       0      0      0 R  19.7  0.0 170:26.72 ll_mgs_0003
       22455 root      20   0       0      0      0 R  19.7  0.0 160:29.77 ll_mgs_0007
       22478 root      20   0       0      0      0 R  15.5  0.0 174:40.18 ll_mgs_0022
       22457 root      20   0       0      0      0 R  13.8  0.0 172:22.42 ll_mgs_0008
       22484 root      20   0       0      0      0 R  11.5  0.0 186:13.69 ll_mgs_0028
       22473 root      20   0       0      0      0 R   9.9  0.0 159:36.69 ll_mgs_0017
       16487 root      20   0  164268   4572   1604 R   8.2  0.0   0:00.29 top
       22459 root      20   0       0      0      0 R   7.2  0.0 171:38.46 ll_mgs_0010
       22475 root      20   0       0      0      0 R   6.6  0.0 167:11.56 ll_mgs_0019
       22487 root      20   0       0      0      0 R   6.2  0.0 162:49.27 ll_mgs_0031
       24401 root      20   0       0      0      0 S   4.6  0.0  15:56.76 mdt00_098
       22472 root      20   0       0      0      0 R   3.9  0.0 170:10.49 ll_mgs_0016
       22482 root      20   0       0      0      0 R   3.6  0.0 168:09.82 ll_mgs_0026
       22269 root      20   0       0      0      0 S   3.3  0.0 101:06.47 kiblnd_sd_00_00
       22452 root      20   0       0      0      0 R   3.3  0.0 170:29.48 ll_mgs_0004
       22453 root      20   0       0      0      0 R   3.0  0.0 155:18.83 ll_mgs_0005
       22486 root      20   0       0      0      0 R   3.0  0.0 172:00.85 ll_mgs_0030
       24415 root      20   0       0      0      0 S   3.0  0.0   3:12.69 mdt00_109
       

      I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.

      I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log

      Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
      Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: