[LU-1307] Clients having issues mounting Lustre Created: 11/Apr/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dennis Nelson | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Servers: CentOS 5.5 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10136 |
| Description |
|
Customer reports that some clients have difficulties mounting Lustre filesystems. Running lustre_rmmod then mount -at lustre seemes to clear up the problem. This is right after a reboot of the system. [root@dtn1 ~]# mount -at lustre /etc/fstab: [root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf
Also, I have attached /var/log/messages showing the recent boot and the lustre errors reported. You can see in the log that I ran mount -at lustre at Apr 11 13:14:20. The customer is asking why this is happening and I do not have an explanation. |
| Comments |
| Comment by Cliff White (Inactive) [ 11/Apr/12 ] |
|
Are you certain the servers have finished recovery after the reboot? Please examine the client system log, there should be LustreErrors there which may provide more information |
| Comment by Dennis Nelson [ 11/Apr/12 ] |
|
Yes, I am sure that recovery was complete. The servers were booted yesterday and have been back in production for over 12 hours. I included the messages file on the original post, it had lustre errors in it. I believe that what might be happening is that the system is attempting to mount the filesystems before the IB network is functioning and it puts the system in an error state that it cannot recover from without unloading the modules. Is that possible? Shouldn't a new mount request attempt to make the communication to the servers again instead of just erroring out because there was an error previously? |
| Comment by Cliff White (Inactive) [ 16/May/12 ] |
|
You should use the _netdev option, and thus avoid Lustre client mount attempts prior to network startup. The explanation is simple: you are trying |
| Comment by Peter Jones [ 04/Jun/12 ] |
|
Dennis Any further questions or can we close this ticket? Thanks Peter |
| Comment by Nathan Dauchy (Inactive) [ 04/Jun/12 ] |
|
IMHO this is still a bug. Yes, the _netdev option can help. However, the lustre client should gracefully handle problems when it tries to mount prior to the IB net being fully up, and a remount should be sufficient. The need for lustre_rmmod is not intuitive to systems admins. It can even be problematic if the client has another (active) Lustre mount and it is therefore impossible to unload the lustre modules. Thanks, |
| Comment by Andreas Dilger [ 04/Jun/12 ] |
|
Some notes here:
I believe that the root of the problem is with the ptlrpc module, since it starts the network connections when loaded, and may not retry establishing those connections if the network device was originally unavailable when it started. |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |