Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.4.0
    • 3
    • 5230

    Description

      According to the analysis of LU-1717 we are frequently losing Lustre messages on Sequoia's IB network. We have no LNet routers, and IB is a reliable network. We are not seeing any timeouts or lnet errors that would suggest that we are seeing IB transmission problems.

      Why are messages being lost that can result in LU-1717 error messages? I'm worried that we're papering over a larger problem by silencing those errors.

      Attachments

        Issue Links

          Activity

            [LU-2187] Why are we losing messages?

            We will never get back to this one.

            morrone Christopher Morrone (Inactive) added a comment - We will never get back to this one.

            Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'.

            isaac Isaac Huang (Inactive) added a comment - Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'.

            We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl.

            morrone Christopher Morrone (Inactive) added a comment - - edited We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl .

            Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?

            johann Johann Lombardi (Inactive) added a comment - Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?

            Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs?

            doug Doug Oucharek (Inactive) added a comment - Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs?
            green Oleg Drokin added a comment -

            I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments?

            green Oleg Drokin added a comment - I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments?

            When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log.

            doug Doug Oucharek (Inactive) added a comment - When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log.

            Isaac, I think it is very unlikely that we'll see anything useful in there. I'll certainly keep an eye on it, but I do not believe that we're seeing real network problems here. We've investigated the network pretty thoroughly and it all checks out.

            I think we need to look higher in the stack. I don't think we've really lost the message. But my suspicion is that the server sent the reply too slowly, and the client sent the retry before the server every replied once.

            So how do we investigate that problem?

            morrone Christopher Morrone (Inactive) added a comment - Isaac, I think it is very unlikely that we'll see anything useful in there. I'll certainly keep an eye on it, but I do not believe that we're seeing real network problems here. We've investigated the network pretty thoroughly and it all checks out. I think we need to look higher in the stack. I don't think we've really lost the message. But my suspicion is that the server sent the reply too slowly, and the client sent the retry before the server every replied once. So how do we investigate that problem?

            There's a few things to do to diagnose it:

            • /proc/sys/lnet/* would be very useful. Please gather them once such errors happen again.
            • "lctl --net o2ib0 conn_list (or list_conn)" gives useful data at o2iblnd layer. Please run it once such errors happen again.
            • Many LNet/LND errors don't go to dmesg by default, please on a node where such errors have occurred, do a 'echo +neterror > /proc/sys/lnet/printk' after each reboot.
            • Please also check IB errors by running 'ibcheckerrors'.
            isaac Isaac Huang (Inactive) added a comment - There's a few things to do to diagnose it: /proc/sys/lnet/* would be very useful. Please gather them once such errors happen again. "lctl --net o2ib0 conn_list (or list_conn)" gives useful data at o2iblnd layer. Please run it once such errors happen again. Many LNet/LND errors don't go to dmesg by default, please on a node where such errors have occurred, do a 'echo +neterror > /proc/sys/lnet/printk' after each reboot. Please also check IB errors by running 'ibcheckerrors'.

            I can't say definitively, but I strongly suspect that the messages aren't really lost, but this is more bad Lustre behaviour making poor AT assumptions or delaying send of messages or something.

            morrone Christopher Morrone (Inactive) added a comment - I can't say definitively, but I strongly suspect that the messages aren't really lost, but this is more bad Lustre behaviour making poor AT assumptions or delaying send of messages or something.

            People

              doug Doug Oucharek (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: