[LU-961] lfs df -h hangs Created: 04/Jan/12  Updated: 05/Jan/12  Resolved: 05/Jan/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Brian Murrell (Inactive) Assignee: Brian Murrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

TCP network


Attachments: Text File client1.txt     Text File mds1_messages.txt     Text File mds2_messages.txt     Text File oss1_messages.txt     Text File oss1_messages.txt    
Severity: 3
Rank (Obsolete): 6498

 Description   

When I try to use lfs df -h the command hangs:

# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID       767.8M       35.1M      681.5M   5% /mnt/lustre[MDT:0]

Any ideas why this is hanging?



 Comments   
Comment by Brian Padfield (Inactive) [ 04/Jan/12 ]

Can we get message logs from the MDS and OSS(s)?

Comment by Brian Murrell (Inactive) [ 04/Jan/12 ]

I have way too many OSSes to add the logs from all of them. Can you be more specific about which OSSes you want logs from?

Comment by Brian Padfield (Inactive) [ 04/Jan/12 ]

How about the MDS and the OSS that has OST0000?

Comment by Brian Murrell (Inactive) [ 04/Jan/12 ]

Find them attached.

I am just in the process of rebooting my whole cluster to see if the problem is resolved by doing that.

Comment by Brian Murrell (Inactive) [ 04/Jan/12 ]

After rebooting things still don't look good:

# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID       767.8M       35.1M      681.5M   5% /mnt/lustre[MDT:0]
OST0000             : inactive device
lustre-OST0001_UUID      1007.9M       52.2M      904.5M   5% /mnt/lustre[OST:1]

filesystem summary:      1007.9M       52.2M      904.5M   5% /mnt/lustre
Comment by Michael Helmer (Inactive) [ 04/Jan/12 ]

I am seeing the following network related errors in both the MDS and the OSS logs. Can you confirm that your LNET network is up and functioning as expected? Particularly i would like to see if the OSS that you provided logs from can ping the MDS over LNET (i.e. lctl ping <ipofMDS>), and reversely can the MDS ping the OSS?

Jan  4 10:18:35 mds1 kernel: LustreError: 4047:0:(socklnd.c:2420:ksocknal_base_startup()) Can't spawn socknal scheduler[0]: -513
Jan  4 10:18:35 mds1 kernel: LustreError: 105-4: Error -100 starting up LNI tcp
Jan  4 10:18:35 mds1 kernel: LustreError: 4047:0:(events.c:728:ptlrpc_init_portals()) network initialisation failed
Comment by Brian Murrell (Inactive) [ 04/Jan/12 ]

Hi,

It seems that communication between oss1 and mds2 is working:

[root@oss1 ~]# lctl ping mds2
12345-0@lo
12345-192.168.122.155@tcp
[root@mds2 ~]# lctl ping oss1
12345-0@lo
12345-192.168.122.147@tcp

I'm noticing the time on the messages that you are referring to. I must apologize for not mentioning when I attached the logs for not mentioning that the cluster was down for maintenance and testing at that point. It wasn't up for production until a few hours later than that so probably looking further in the log might be helpful.

Comment by Michael Helmer (Inactive) [ 04/Jan/12 ]

Can you please upload the full MDS and OSS logs following the reboot? It appears the MDS was still coming up from the reboot when the log was captured.

Thanks!

Comment by Brian Murrell (Inactive) [ 05/Jan/12 ]

I've attached the full message log from mds2 which is actually the MDS on this filesystem. I erroneously gave you mds1's log yesterday. mds1 is the current MGS.

I've also attached the newer oss1 logs from the point where the previous oss1 log attachment ended.

Comment by Brent VanDyke (Inactive) [ 05/Jan/12 ]

Could we get logs from one of the clients experiencing the hang? Also, is the hang experienced from all clients, only one, or only a certain sub-set of clients within a particular portion of the network?

Comment by Michael Helmer (Inactive) [ 05/Jan/12 ]

Here are the client logs

Comment by Brian Murrell (Inactive) [ 05/Jan/12 ]

Faulty network discovered.

Generated at Sat Feb 10 01:12:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.