Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 1.8.9, Lustre 2.4.1
-
RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
-
3
-
10722
Description
We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.
We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.
Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.
While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.
January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.