Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + LU-11285,LU-12017,LU-11761): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.

      Since the upgrade of the filesystem, I've noticed those messages:

      [Sun Sep  8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M
      [Sun Sep  8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M
      [Sun Sep  8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
      [Sun Sep  8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M
      [Sun Sep  8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M
      [Sun Sep  8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0)^M
      [Sun Sep  8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M
      

      Because everything seems to work OK, we didn't do anything. I think I've seen that sometimes even with 2.10 and a solution is to remount the MGS.

      So last night, Fir was unaccessible and we got alerts. There wasn't any crash but the primary MGS/MDS fir-md1-s1 was under heavy load:
       

      top - 22:35:21 up 3 days, 17:27,  1 user,  load average: 499.00, 487.24, 455.13
      Tasks: 2201 total, 101 running, 2100 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.1 us, 30.2 sy,  0.0 ni, 69.5 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
      KiB Mem : 26356524+total, 99668392 free,  4215400 used, 15968145+buff/cache
      KiB Swap:  4194300 total,  4194300 free,        0 used. 23785484+avail Mem
      
         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
       22485 root      20   0       0      0      0 R  37.8  0.0 181:28.81 ll_mgs_0029
       22394 root      20   0       0      0      0 R  35.9  0.0 175:39.61 ll_mgs_0001
       22476 root      20   0       0      0      0 R  33.9  0.0 171:57.25 ll_mgs_0020
       22467 root      20   0       0      0      0 R  31.9  0.0 171:41.76 ll_mgs_0011
       22479 root      20   0       0      0      0 R  31.6  0.0 171:59.09 ll_mgs_0023
       22483 root      20   0       0      0      0 R  27.6  0.0 166:36.77 ll_mgs_0027
       22471 root      20   0       0      0      0 R  21.4  0.0 172:23.91 ll_mgs_0015
       22451 root      20   0       0      0      0 R  19.7  0.0 170:26.72 ll_mgs_0003
       22455 root      20   0       0      0      0 R  19.7  0.0 160:29.77 ll_mgs_0007
       22478 root      20   0       0      0      0 R  15.5  0.0 174:40.18 ll_mgs_0022
       22457 root      20   0       0      0      0 R  13.8  0.0 172:22.42 ll_mgs_0008
       22484 root      20   0       0      0      0 R  11.5  0.0 186:13.69 ll_mgs_0028
       22473 root      20   0       0      0      0 R   9.9  0.0 159:36.69 ll_mgs_0017
       16487 root      20   0  164268   4572   1604 R   8.2  0.0   0:00.29 top
       22459 root      20   0       0      0      0 R   7.2  0.0 171:38.46 ll_mgs_0010
       22475 root      20   0       0      0      0 R   6.6  0.0 167:11.56 ll_mgs_0019
       22487 root      20   0       0      0      0 R   6.2  0.0 162:49.27 ll_mgs_0031
       24401 root      20   0       0      0      0 S   4.6  0.0  15:56.76 mdt00_098
       22472 root      20   0       0      0      0 R   3.9  0.0 170:10.49 ll_mgs_0016
       22482 root      20   0       0      0      0 R   3.6  0.0 168:09.82 ll_mgs_0026
       22269 root      20   0       0      0      0 S   3.3  0.0 101:06.47 kiblnd_sd_00_00
       22452 root      20   0       0      0      0 R   3.3  0.0 170:29.48 ll_mgs_0004
       22453 root      20   0       0      0      0 R   3.0  0.0 155:18.83 ll_mgs_0005
       22486 root      20   0       0      0      0 R   3.0  0.0 172:00.85 ll_mgs_0030
       24415 root      20   0       0      0      0 S   3.0  0.0   3:12.69 mdt00_109
       

      I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.

      I'm attaching foreach bt of the crash dump as fir-md1-s1-foreachbt-2019-09-08-22-38-08.log

      Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08
      Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-12735] MGS misbehaving in 2.12.2+

            Hey Matt, sorry to hear about your current issues. We have not applied the patch https://review.whamcloud.com/#/c/41309/ (backport for LU-13356) in production yet, I was waiting for it to land to 2.12.7. I also thought it was a server-only patch.
            As you have seen in my ticket LU-14695, we have been having issues with the MGS propagating the configuration to the other targets when new OSTS are added, and also we have some kind of llog corruption that we're still trying to understand. But in our case, new clients can still mount properly (could be some luck). I hope Whamcloud will be able to help understand these config llog / MGS -related problems. I'm going to watch your ticket! Fingers crossed.

            sthiell Stephane Thiell added a comment - Hey Matt, sorry to hear about your current issues. We have not applied the patch https://review.whamcloud.com/#/c/41309/ (backport for LU-13356 ) in production yet, I was waiting for it to land to 2.12.7. I also thought it was a server-only patch. As you have seen in my ticket LU-14695 , we have been having issues with the MGS propagating the configuration to the other targets when new OSTS are added, and also we have some kind of llog corruption that we're still trying to understand. But in our case, new clients can still mount properly (could be some luck). I hope Whamcloud will be able to help understand these config llog / MGS -related problems. I'm going to watch your ticket! Fingers crossed.

            Thanks for confirming. Yes perhaps I have something else going on too, I've opened my own issue under https://jira.whamcloud.com/browse/LU-14802

            Thanks again,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Thanks for confirming. Yes perhaps I have something else going on too, I've opened my own issue under https://jira.whamcloud.com/browse/LU-14802 Thanks again, Matt

            Hi Matt,

            Unfortunately that's a server patch only. In our case, the effect was immediate. Perhaps, you have a similar issue but not exactly the same. Have you got an opened Jira ticket on your issue?

            Gael

            delbaryg DELBARY Gael (Inactive) added a comment - Hi Matt, Unfortunately that's a server patch only . In our case, the effect was immediate. Perhaps, you have a similar issue but not exactly the same. Have you got an opened Jira ticket on your issue? Gael

            Hi Gael, sorry I missed this before, but is this a client and server patch? I tried deploying just to our MGS server initially, but it didn't have any effect, before looking more closely. Have you deployed this across both clients & servers in your environment?

            Thanks again,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Gael, sorry I missed this before, but is this a client and server patch? I tried deploying just to our MGS server initially, but it didn't have any effect, before looking more closely. Have you deployed this across both clients & servers in your environment? Thanks again, Matt

            Thanks Gael, that's all I wanted to hear! Super - I'm going to test this on our environment now and hopefully fix this issue.

            Cheers!
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Thanks Gael, that's all I wanted to hear! Super - I'm going to test this on our environment now and hopefully fix this issue. Cheers! Matt

            Hi Matt,

            I don't answer in place of Stéphane but on our side it fully fixes this issue. We have the https://jira.whamcloud.com/browse/LU-13356 backport on all our Lustre in production.

            Gael

            delbaryg DELBARY Gael (Inactive) added a comment - Hi Matt, I don't answer in place of Stéphane but on our side it fully fixes this issue. We have the https://jira.whamcloud.com/browse/LU-13356 backport on all our Lustre in production. Gael

            Hello, sorry to comment on a resolved issue, but I believe I'm hitting this issue as well at the moment.

            Can I ask if using the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 is the full solution here?

            @Stefane - are you using this patch in your 2.12 servers at the moment? I don't see it applied to b2_12 upstream at the moment which is a bit concerning - it's been quite debilitating for us, so I'm curious why it's not made it into 2.12.7.

            Thanks,

            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hello, sorry to comment on a resolved issue, but I believe I'm hitting this issue as well at the moment. Can I ask if using the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 is the full solution here? @Stefane - are you using this patch in your 2.12 servers at the moment? I don't see it applied to b2_12 upstream at the moment which is a bit concerning - it's been quite debilitating for us, so I'm curious why it's not made it into 2.12.7. Thanks, Matt
            pjones Peter Jones added a comment -

            Good news - thanks

            pjones Peter Jones added a comment - Good news - thanks

            Etienne – Great news, thanks! Hope this patch can make it to 2.12.7.

            sthiell Stephane Thiell added a comment - Etienne – Great news, thanks! Hope this patch can make it to 2.12.7.
            eaujames Etienne Aujames added a comment - - edited

            Hi,

            We have applied the patch https://review.whamcloud.com/41309 ("LU-13356 client: don't use OBD_CONNECT_MNE_SWAB") in production for several weeks now and the issue doesn't appear again.

            The issue used to occurred after an OST failover.

            eaujames Etienne Aujames added a comment - - edited Hi, We have applied the patch  https://review.whamcloud.com/41309 (" LU-13356 client: don't use OBD_CONNECT_MNE_SWAB") in production for several weeks now and the issue doesn't appear again. The issue used to occurred after an OST failover.

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: