[LU-1307] Clients having issues mounting Lustre Created: 11/Apr/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dennis Nelson Assignee: Doug Oucharek (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Servers: CentOS 5.5
Clients: RHEL 6.0


Attachments: File messages    
Severity: 3
Rank (Obsolete): 10136

 Description   

Customer reports that some clients have difficulties mounting Lustre filesystems. Running lustre_rmmod then mount -at lustre seemes to clear up the problem. This is right after a reboot of the system.

[root@dtn1 ~]# mount -at lustre
mount.lustre: mount 10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 at /scratch1 failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf
mount.lustre: mount 10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 at /scratch2 failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf
[root@dtn1 ~]# lustre_rmmod
[root@dtn1 ~]# mount -at lustre
[root@dtn1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_dtn1-lv_root
50G 17G 31G 36% /
tmpfs 24G 0 24G 0% /dev/shm
/dev/sda1 485M 52M 408M 12% /boot
10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1
2.5P 288T 2.2P 12% /scratch1
10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2
3.1P 427T 2.7P 14% /scratch2

/etc/fstab:
...
10.174.80.40@o2ib2:10.174.80.41@o2ib2:/scratch1 /scratch1 lustre defaults,flock 0 0
10.174.80.42@o2ib2:10.174.80.43@o2ib2:/scratch2 /scratch2 lustre defaults,flock 0 0

[root@dtn1 ~]# cat /etc/modprobe.d/lustre.conf

  1. Lustre module configuration file
    options lnet networks="o2ib2(ib0)"

Also, I have attached /var/log/messages showing the recent boot and the lustre errors reported.

You can see in the log that I ran mount -at lustre at Apr 11 13:14:20.
Then I ran lustre_rmmod and mount -at lustre and it worked.

The customer is asking why this is happening and I do not have an explanation.
I encountered similar issues on other clients after a reboot of the entire system.



 Comments   
Comment by Cliff White (Inactive) [ 11/Apr/12 ]

Are you certain the servers have finished recovery after the reboot? Please examine the client system log, there should be LustreErrors there which may provide more information

Comment by Dennis Nelson [ 11/Apr/12 ]

Yes, I am sure that recovery was complete. The servers were booted yesterday and have been back in production for over 12 hours. I included the messages file on the original post, it had lustre errors in it.

I believe that what might be happening is that the system is attempting to mount the filesystems before the IB network is functioning and it puts the system in an error state that it cannot recover from without unloading the modules. Is that possible? Shouldn't a new mount request attempt to make the communication to the servers again instead of just erroring out because there was an error previously?

Comment by Cliff White (Inactive) [ 16/May/12 ]

You should use the _netdev option, and thus avoid Lustre client mount attempts prior to network startup. The explanation is simple: you are trying
to mount a network file system before you have a live network. I am not sure why you would need a module unload, that should not be necessary. Simply waiting for the net to be up should be enough.

Comment by Peter Jones [ 04/Jun/12 ]

Dennis

Any further questions or can we close this ticket?

Thanks

Peter

Comment by Nathan Dauchy (Inactive) [ 04/Jun/12 ]

IMHO this is still a bug. Yes, the _netdev option can help. However, the lustre client should gracefully handle problems when it tries to mount prior to the IB net being fully up, and a remount should be sufficient. The need for lustre_rmmod is not intuitive to systems admins. It can even be problematic if the client has another (active) Lustre mount and it is therefore impossible to unload the lustre modules.

Thanks,
Nathan

Comment by Andreas Dilger [ 04/Jun/12 ]

Some notes here:

  • I agree with Cliff that using _netdev can avoid this problem in most cases, which is what I use at home, but it isn't totally clear whether the IB network startup is treated the same as ethernet or not (i.e. ensure that the "mount _netdev filesystems" step will be appropriately delayed until after IB is up).
  • I agree with Nathan that this is still a problem, since there can be other network problems that result in this "sticky" error (even with TCP), and it should be addressed.
  • the "mount.lustre" command allows a "retry=N" mount option that will allow the client to repeat the mount up to N times on failure, but this may not be enough in this case, and can potentially hang the rest of the startup process.

I believe that the root of the problem is with the ptlrpc module, since it starts the network connections when loaded, and may not retry establishing those connections if the network device was originally unavailable when it started.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:15:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.