[LU-961] lfs df -h hangs Created: 04/Jan/12 Updated: 05/Jan/12 Resolved: 05/Jan/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Brian Murrell (Inactive) | Assignee: | Brian Murrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
TCP network |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6498 |
| Description |
|
When I try to use lfs df -h the command hangs: # lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 767.8M 35.1M 681.5M 5% /mnt/lustre[MDT:0] Any ideas why this is hanging? |
| Comments |
| Comment by Brian Padfield (Inactive) [ 04/Jan/12 ] |
|
Can we get message logs from the MDS and OSS(s)? |
| Comment by Brian Murrell (Inactive) [ 04/Jan/12 ] |
|
I have way too many OSSes to add the logs from all of them. Can you be more specific about which OSSes you want logs from? |
| Comment by Brian Padfield (Inactive) [ 04/Jan/12 ] |
|
How about the MDS and the OSS that has OST0000? |
| Comment by Brian Murrell (Inactive) [ 04/Jan/12 ] |
|
Find them attached. I am just in the process of rebooting my whole cluster to see if the problem is resolved by doing that. |
| Comment by Brian Murrell (Inactive) [ 04/Jan/12 ] |
|
After rebooting things still don't look good: # lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 767.8M 35.1M 681.5M 5% /mnt/lustre[MDT:0] OST0000 : inactive device lustre-OST0001_UUID 1007.9M 52.2M 904.5M 5% /mnt/lustre[OST:1] filesystem summary: 1007.9M 52.2M 904.5M 5% /mnt/lustre |
| Comment by Michael Helmer (Inactive) [ 04/Jan/12 ] |
|
I am seeing the following network related errors in both the MDS and the OSS logs. Can you confirm that your LNET network is up and functioning as expected? Particularly i would like to see if the OSS that you provided logs from can ping the MDS over LNET (i.e. lctl ping <ipofMDS>), and reversely can the MDS ping the OSS? Jan 4 10:18:35 mds1 kernel: LustreError: 4047:0:(socklnd.c:2420:ksocknal_base_startup()) Can't spawn socknal scheduler[0]: -513 Jan 4 10:18:35 mds1 kernel: LustreError: 105-4: Error -100 starting up LNI tcp Jan 4 10:18:35 mds1 kernel: LustreError: 4047:0:(events.c:728:ptlrpc_init_portals()) network initialisation failed |
| Comment by Brian Murrell (Inactive) [ 04/Jan/12 ] |
|
Hi, It seems that communication between oss1 and mds2 is working: [root@oss1 ~]# lctl ping mds2 12345-0@lo 12345-192.168.122.155@tcp [root@mds2 ~]# lctl ping oss1 12345-0@lo 12345-192.168.122.147@tcp I'm noticing the time on the messages that you are referring to. I must apologize for not mentioning when I attached the logs for not mentioning that the cluster was down for maintenance and testing at that point. It wasn't up for production until a few hours later than that so probably looking further in the log might be helpful. |
| Comment by Michael Helmer (Inactive) [ 04/Jan/12 ] |
|
Can you please upload the full MDS and OSS logs following the reboot? It appears the MDS was still coming up from the reboot when the log was captured. Thanks! |
| Comment by Brian Murrell (Inactive) [ 05/Jan/12 ] |
|
I've attached the full message log from mds2 which is actually the MDS on this filesystem. I erroneously gave you mds1's log yesterday. mds1 is the current MGS. I've also attached the newer oss1 logs from the point where the previous oss1 log attachment ended. |
| Comment by Brent VanDyke (Inactive) [ 05/Jan/12 ] |
|
Could we get logs from one of the clients experiencing the hang? Also, is the hang experienced from all clients, only one, or only a certain sub-set of clients within a particular portion of the network? |
| Comment by Michael Helmer (Inactive) [ 05/Jan/12 ] |
|
Here are the client logs |
| Comment by Brian Murrell (Inactive) [ 05/Jan/12 ] |
|
Faulty network discovered. |