Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
I've noticed that this error message causes a lot of noise when a router node goes down. For example:
I rebooted two routers on a system with just 40 compute nodes. The first of these error messages popped up about 1 minute or so after I initiated a reboot of the routers:
Reboot started at - Fri Jun 14 14:34:35 CDT 2019
saturn-smw:/var/opt/cray/log/p2-current # grep -m 1 lnet_finalize_expired_responses console-20190614 2019-06-14T14:35:33.704182-05:00 c0-1c1s9n3 LNet: 10316:0:(lib-move.c:2888:lnet_finalize_expired_responses()) Response timed out: md = ffff8810119a32a8: nid = 485@gni4
In the time it took the routers to reboot, about 8 minutes, there were 797 entries from lnet_finalize_expired_responses in the console log:
saturn-smw:/var/opt/cray/log/p2-current # grep -c lnet_finalize_expired_responses console-20190614 797 saturn-smw:/var/opt/cray/log/p2-current #
I don't see much value from this message for system administrators, so I think it should be converted to a CDEBUG
Attachments
Activity
Fix Version/s | New: Lustre 2.13.0 [ 14290 ] | |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Description |
Original:
I've noticed that this error message causes a lot of noise when a router node goes down. For example:
I rebooted two routers on a system with just 40 compute nodes. The first of these error messages popped up about 1 minute or so after I initiated a reboot of the routers: Reboot started at - Fri Jun 14 14:34:35 CDT 2019 {noformat} saturn-smw:/var/opt/cray/log/p2-current # grep -m 1 lnet_finalize_expired_responses console-20190614 2019-06-14T14:35:33.704182-05:00 c0-1c1s9n3 LNet: 10316:0:(lib-move.c:2888:lnet_finalize_expired_responses()) Response timed out: md = ffff8810119a32a8: nid = 485@gni4 {noformat} In the time it took the routers to reboot, about 8 minutes, there were 797 entries from lnet_finalize_expired_responses in the console log: ```saturn-smw:/var/opt/cray/log/p2-current # grep -c lnet_finalize_expired_responses console-20190614 797 saturn-smw:/var/opt/cray/log/p2-current #``` I don't see much value from this message for system administrators, so I think it should be converted to a CDEBUG |
New:
I've noticed that this error message causes a lot of noise when a router node goes down. For example:
I rebooted two routers on a system with just 40 compute nodes. The first of these error messages popped up about 1 minute or so after I initiated a reboot of the routers: Reboot started at - Fri Jun 14 14:34:35 CDT 2019 {noformat} saturn-smw:/var/opt/cray/log/p2-current # grep -m 1 lnet_finalize_expired_responses console-20190614 2019-06-14T14:35:33.704182-05:00 c0-1c1s9n3 LNet: 10316:0:(lib-move.c:2888:lnet_finalize_expired_responses()) Response timed out: md = ffff8810119a32a8: nid = 485@gni4 {noformat} In the time it took the routers to reboot, about 8 minutes, there were 797 entries from lnet_finalize_expired_responses in the console log: {noformat}saturn-smw:/var/opt/cray/log/p2-current # grep -c lnet_finalize_expired_responses console-20190614 797 saturn-smw:/var/opt/cray/log/p2-current #{noformat} I don't see much value from this message for system administrators, so I think it should be converted to a CDEBUG |
Landed for 2.13