Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-436

client refused reconnection, still busy with 1 active RPCs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 1.8.6
    • None
    • Lustre 1.8.3 on LLNL chaos release 1.3.4 (2.6.18-93.2redsky_chaos). Redsky 2QoS torus IB network cluster, software raid on Oracle J4400 JBODs - RAID6 8+2 w/external journal & bitmap
    • 3
    • 10072

    Description

      We are intermittently seeing problems with our scratch file system where any of 24 OSS nodes gets congested and essentially makes the file system unusable. We are trying to nail down the rogue user code or codes that seem to trigger it but we believe the cause is a large number of small reads or writes to a OST.

      Looking at dmesgs we see a lot of "...refused reconnection, still busy with 1 active RPCs" and the load on the system goes through the roof typically with load averages greater than 400-500. Trying to do some forensics we parsed the nodes that are reported in dmesg, went to those nodes and tried doing a lsof on the file system which basically hanged. Thinking these were good candidates we powered them down but it did not change any of the server conditions. As in the past, our only resolution was to cycle the OSS which cleared everything.

      I was looking through bugs and it seems like it is very similar to LU-7 that Chris reported in November with the exception that in this case we are not going through a router.

      I guess what I am looking for or hoping for is some type if diagnostic that can help determine the source of the congestion vs. the sledge hammer approach of rebooting the server and causing a wider disruption.

      I realize this issue is very general but I just wanted to get a dialog going. I have plenty of log data and lustre dumps if that would be helpful.

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: