Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.4.0
    • 3
    • 5230

    Description

      According to the analysis of LU-1717 we are frequently losing Lustre messages on Sequoia's IB network. We have no LNet routers, and IB is a reliable network. We are not seeing any timeouts or lnet errors that would suggest that we are seeing IB transmission problems.

      Why are messages being lost that can result in LU-1717 error messages? I'm worried that we're papering over a larger problem by silencing those errors.

      Attachments

        Issue Links

          Activity

            [LU-2187] Why are we losing messages?

            We will never get back to this one.

            morrone Christopher Morrone (Inactive) added a comment - We will never get back to this one.

            Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'.

            isaac Isaac Huang (Inactive) added a comment - Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'.

            We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl.

            morrone Christopher Morrone (Inactive) added a comment - - edited We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl .

            Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?

            johann Johann Lombardi (Inactive) added a comment - Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?

            Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs?

            doug Doug Oucharek (Inactive) added a comment - Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs?
            green Oleg Drokin added a comment -

            I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments?

            green Oleg Drokin added a comment - I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments?

            When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log.

            doug Doug Oucharek (Inactive) added a comment - When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log.

            People

              doug Doug Oucharek (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: