[LU-4006] LNET Messages staying in Queue - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: Lustre 1.8.9, Lustre 2.4.1
Labels:
- mn4
Environment:
RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client

Severity:
3
Rank (Obsolete):
10722

Description

We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lnet_stats.sh
2 kB
25/Sep/13 2:00 AM
LU-4006.tgz
0.2 kB
01/Oct/13 5:15 PM

Activity

People

Assignee:: Isaac Huang (Inactive)

Reporter:: Jason Hill (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 25/Sep/13 1:14 AM

Updated:: 26/Jan/24 9:19 PM

Resolved:: 06/Mar/14 4:11 PM