Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12439

Convert "Response timed out..." in lnet_finalize_expired_responses to CDEBUG

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • None
    • None
    • 9223372036854775807

    Description

      I've noticed that this error message causes a lot of noise when a router node goes down. For example:

      I rebooted two routers on a system with just 40 compute nodes. The first of these error messages popped up about 1 minute or so after I initiated a reboot of the routers:

      Reboot started at - Fri Jun 14 14:34:35 CDT 2019

      saturn-smw:/var/opt/cray/log/p2-current # grep -m 1 lnet_finalize_expired_responses console-20190614
      2019-06-14T14:35:33.704182-05:00 c0-1c1s9n3 LNet: 10316:0:(lib-move.c:2888:lnet_finalize_expired_responses()) Response timed out: md = ffff8810119a32a8: nid = 485@gni4
      

      In the time it took the routers to reboot, about 8 minutes, there were 797 entries from lnet_finalize_expired_responses in the console log:

      saturn-smw:/var/opt/cray/log/p2-current # grep -c lnet_finalize_expired_responses console-20190614
      797
      saturn-smw:/var/opt/cray/log/p2-current #

      I don't see much value from this message for system administrators, so I think it should be converted to a CDEBUG

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: