Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2965

MDS evicted OST after attempting recovery.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.0
    • 3
    • 7230

    Attachments

      1. c4-0c0s5n0.log
        33 kB
      2. client.log
        4 kB
      3. mds-test-shot.log
        8 kB
      4. mds-test-shot-whole.log
        1.27 MB
      5. oss-test-shot.log
        8 kB
      6. spare.log
        869 kB

      Activity

        [LU-2965] MDS evicted OST after attempting recovery.
        pjones Peter Jones added a comment -

        Thanks James - that is good news!

        pjones Peter Jones added a comment - Thanks James - that is good news!

        For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened.

        simmonsja James A Simmons added a comment - For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened.

        Reducing priority until more information is available to understand the issue.

        jlevi Jodi Levi (Inactive) added a comment - Reducing priority until more information is available to understand the issue.

        From spare.log (why is it called "spare", by the way?):

        Mar 8 21:06:01 widow-spare06 kernel: [ 7478.823574] Lustre: routed1-OST00eb-osc-ffff8804319c7000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
        Mar 8 21:06:01 widow-spare06 kernel: [ 7478.845592] Lustre: routed1-OST01ab-osc-ffff8804319c7000: Connection restored to routed1-OST01ab (at 10.36.227.92@o2ib)
        Mar 8 21:12:33 widow-spare06 kernel: imklog 5.8.10, log source = /proc/kmsg started.
        Mar 8 21:12:33 widow-spare06 kernel: [ 0.000000] Initializing cgroup subsys cpuset
        [...]
        Mar 8 21:13:34 widow-spare06 kernel: [ 120.446824] ipmi device interface
        [EOF]

        The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34?

        liwei Li Wei (Inactive) added a comment - From spare.log (why is it called "spare", by the way?): Mar 8 21:06:01 widow-spare06 kernel: [ 7478.823574] Lustre: routed1-OST00eb-osc-ffff8804319c7000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib) Mar 8 21:06:01 widow-spare06 kernel: [ 7478.845592] Lustre: routed1-OST01ab-osc-ffff8804319c7000: Connection restored to routed1-OST01ab (at 10.36.227.92@o2ib) Mar 8 21:12:33 widow-spare06 kernel: imklog 5.8.10, log source = /proc/kmsg started. Mar 8 21:12:33 widow-spare06 kernel: [ 0.000000] Initializing cgroup subsys cpuset [...] Mar 8 21:13:34 widow-spare06 kernel: [ 120.446824] ipmi device interface [EOF] The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34?
        green Oleg Drokin added a comment -

        I guess the assumption clients stopped pinging is because they were evicted for not pinging

        green Oleg Drokin added a comment - I guess the assumption clients stopped pinging is because they were evicted for not pinging

        10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        simmonsja James A Simmons added a comment - 10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        James,

        According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available?

        Also, how did you infer that some clients had stopped pinging the servers, please?

        liwei Li Wei (Inactive) added a comment - James, According to the oss-test-shot.log, only one client was evicted by the OSS around 21:30: "61445347-9977-82cd-59dd-430903b6625f (at 10.36.227.198@o2ib)". It seems c4-0c0s5n0 was "1546@gni", not the o2ib client. Is the log on the o2ib client still available? Also, how did you infer that some clients had stopped pinging the servers, please?

        Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers.

        simmonsja James A Simmons added a comment - Looking at it logs at 175 seconds after every client recovered some stopped pinging the servers.

        Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened.

        simmonsja James A Simmons added a comment - Nope. I didn't test suppress pings at that time. I started the suppress pings test 2 hours later than when all the recover problems happened.
        green Oleg Drokin added a comment -

        I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason".

        Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging?
        Were you already playing with suppress pinging at the time, I wonder?

        green Oleg Drokin added a comment - I just confirmed with James that the actual problem is "even though recovery completed just fine after a server failure, clients were evicted later for some reason". Given the timing of 1354 seconds - that's exactly how much passed since recovery - appears the clients were not pinging? Were you already playing with suppress pinging at the time, I wonder?

        People

          bobijam Zhenyu Xu
          simmonsja James A Simmons
          Votes:
          0 Vote for this issue
          Watchers:
          9 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: