Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 1.8.6
-
None
-
RHEL 5.5 and Lustre 1.8.0.1 on J4400's
-
3
-
10266
Description
OST 10 /dev/md30 resident on OSS3
From /var/log/messages
LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem
LDisk-fs warning (device md30): ldisk_multi_mount_protect: MMP failure info: <time in unix seconds>, last update node: OSS3, last update device /dev/md30
This is a scenario that keeps sending the customer in circles. They know for certain that an fsck is not running. Since they know that they can try to turn the mmp bit off vi the following commands:
To manually disable MMP, run:
tune2fs -O ^mmp <device>
To manually enable MMP, run:
tune2fs -O mmp <device>
These commands fail saying that valid superblock does not exist, but they can see their valid superblock (with mmp set) by running the following command:
Tune2fs -l /dev/md30
It is their understanding that a fix for this issue was released with a later version of Lustre, but aside from that, is there a way to do this?
Customer contact is tyler.s.wiegers@lmco.com
Tyler, I left a VM for you on the number you provided in email.
For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem.
Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems.
I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new.
Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves.
If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.