Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2965

MDS evicted OST after attempting recovery.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.0
    • 3
    • 7230

    Attachments

      1. c4-0c0s5n0.log
        33 kB
      2. client.log
        4 kB
      3. mds-test-shot.log
        8 kB
      4. mds-test-shot-whole.log
        1.27 MB
      5. oss-test-shot.log
        8 kB
      6. spare.log
        869 kB

      Activity

        [LU-2965] MDS evicted OST after attempting recovery.
        pjones Peter Jones added a comment -

        Thanks James - that is good news!

        pjones Peter Jones added a comment - Thanks James - that is good news!

        For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened.

        simmonsja James A Simmons added a comment - For our last test shot IR worked flawlessly. I think we can close this ticket. If we encounter the bug again this ticket can be reopened.

        Reducing priority until more information is available to understand the issue.

        jlevi Jodi Levi (Inactive) added a comment - Reducing priority until more information is available to understand the issue.

        From spare.log (why is it called "spare", by the way?):

        Mar 8 21:06:01 widow-spare06 kernel: [ 7478.823574] Lustre: routed1-OST00eb-osc-ffff8804319c7000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib)
        Mar 8 21:06:01 widow-spare06 kernel: [ 7478.845592] Lustre: routed1-OST01ab-osc-ffff8804319c7000: Connection restored to routed1-OST01ab (at 10.36.227.92@o2ib)
        Mar 8 21:12:33 widow-spare06 kernel: imklog 5.8.10, log source = /proc/kmsg started.
        Mar 8 21:12:33 widow-spare06 kernel: [ 0.000000] Initializing cgroup subsys cpuset
        [...]
        Mar 8 21:13:34 widow-spare06 kernel: [ 120.446824] ipmi device interface
        [EOF]

        The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34?

        liwei Li Wei (Inactive) added a comment - From spare.log (why is it called "spare", by the way?): Mar 8 21:06:01 widow-spare06 kernel: [ 7478.823574] Lustre: routed1-OST00eb-osc-ffff8804319c7000: Connection restored to routed1-OST00eb (at 10.36.227.92@o2ib) Mar 8 21:06:01 widow-spare06 kernel: [ 7478.845592] Lustre: routed1-OST01ab-osc-ffff8804319c7000: Connection restored to routed1-OST01ab (at 10.36.227.92@o2ib) Mar 8 21:12:33 widow-spare06 kernel: imklog 5.8.10, log source = /proc/kmsg started. Mar 8 21:12:33 widow-spare06 kernel: [ 0.000000] Initializing cgroup subsys cpuset [...] Mar 8 21:13:34 widow-spare06 kernel: [ 120.446824] ipmi device interface [EOF] The client completed recoveries at least with OST00eb and OST01ab, but restarted for some reason. OST00eb last heard from the client at around 21:08:53---two to three minutes after the recovery finished. If this was not something else, the timing was close to the 150s ping interval, suggesting the request heard was probably a ping. The next ping should happen 150s to 300s later, which correspond to 21:11:23 and 21:13:53. Could this eviction merely because the client was simply not remounted after the restart? Any log available after 21:13:34?
        green Oleg Drokin added a comment -

        I guess the assumption clients stopped pinging is because they were evicted for not pinging

        green Oleg Drokin added a comment - I guess the assumption clients stopped pinging is because they were evicted for not pinging

        10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        simmonsja James A Simmons added a comment - 10.36.227.198 was our 2.4 non cray client. I attached the log (spare.log) here.

        People

          bobijam Zhenyu Xu
          simmonsja James A Simmons
          Votes:
          0 Vote for this issue
          Watchers:
          9 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: