[LU-270] LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 1.8.6
Affects Version/s: Lustre 1.8.6
Labels:
None
Environment:
RHEL 5.5 and Lustre 1.8.0.1 on J4400's

Severity:
3
Rank (Obsolete):
10266

Description

OST 10 /dev/md30 resident on OSS3
From /var/log/messages
LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem
LDisk-fs warning (device md30): ldisk_multi_mount_protect: MMP failure info: <time in unix seconds>, last update node: OSS3, last update device /dev/md30

This is a scenario that keeps sending the customer in circles. They know for certain that an fsck is not running. Since they know that they can try to turn the mmp bit off vi the following commands:

To manually disable MMP, run:
tune2fs -O ^mmp <device>
To manually enable MMP, run:
tune2fs -O mmp <device>

These commands fail saying that valid superblock does not exist, but they can see their valid superblock (with mmp set) by running the following command:

Tune2fs -l /dev/md30

It is their understanding that a fix for this issue was released with a later version of Lustre, but aside from that, is there a way to do this?

Customer contact is tyler.s.wiegers@lmco.com

Attachments

Activity

[LU-270] LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem

Peter Jones added a comment - 09/May/11 7:48 AM

Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.

Peter Jones added a comment - 09/May/11 7:48 AM Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.

Johann Lombardi (Inactive) added a comment - 06/May/11 1:29 PM

Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632.

Johann Lombardi (Inactive) added a comment - 06/May/11 1:29 PM Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632.

Andreas Dilger added a comment - 05/May/11 11:32 AM

Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone.

However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing.

We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c

{stripes}

{new file}

" command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files.

Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients.

Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable.

If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N *

{slow OST}

is still fast enough to meet that minimum bandwidth.

Andreas Dilger added a comment - 05/May/11 11:32 AM Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone. However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing. We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c {stripes} {new file} " command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files. Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients. Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable. If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N * {slow OST} is still fast enough to meet that minimum bandwidth.

Andreas Dilger added a comment - 04/May/11 8:29 PM

Tyler, I left a VM for you on the number you provided in email.

For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem.

Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems.

I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new.

Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves.

If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.

Andreas Dilger added a comment - 04/May/11 8:29 PM Tyler, I left a VM for you on the number you provided in email. For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem. Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems. I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new. Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves. If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:44 PM

Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea?

We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect?

Thanks!

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:44 PM Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea? We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect? Thanks!

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:33 PM

There were no logs in the oss while attempting to mount

The client messages file has the following (minus date stamps to save typing time):

lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576
Skipped 4 previous similar messages
setting import lustre-OST000b_UUID INACTIVE by administrator request
Skipped 1 previous similar message
LustreError: 7116:0:(lov_obd.c:325:lov_connect_obd()) not connecting OSC lustre-OST000b_UUID; administratively disabled
Skipped 1 previous similar message
Client lustre-client has started
general protection fault: 0000 [4] SMP
last sysfs file: /class/infiniband/mlx4_1/node_desc
CPU 0
Modules linked in: ~~~~ lots of modules

After this we did a df command and it segmentation faults

Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:33 PM There were no logs in the oss while attempting to mount The client messages file has the following (minus date stamps to save typing time): lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576 Skipped 4 previous similar messages setting import lustre-OST000b_UUID INACTIVE by administrator request Skipped 1 previous similar message LustreError: 7116:0:(lov_obd.c:325:lov_connect_obd()) not connecting OSC lustre-OST000b_UUID; administratively disabled Skipped 1 previous similar message Client lustre-client has started general protection fault: 0000 [4] SMP last sysfs file: /class/infiniband/mlx4_1/node_desc CPU 0 Modules linked in: ~~~~ lots of modules After this we did a df command and it segmentation faults Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:25 PM - edited

ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive.

The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good.

Using lctl get_param osc.*.import (not state):

The mds shows state as FULL for all OSTs, which is good

The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing

We're working on the log output for attempting to mount

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:25 PM - edited ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive. The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good. Using lctl get_param osc.*.import (not state): The mds shows state as FULL for all OSTs, which is good The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing We're working on the log output for attempting to mount

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:15 PM

We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up.

If you are available for a phone call that would be great as well, we are available all night if necessary.

Thanks!

Tyler Wiegers (Inactive) added a comment - 04/May/11 7:15 PM We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up. If you are available for a phone call that would be great as well, we are available all night if necessary. Thanks!

People

Assignee:: Andreas Dilger

Reporter:: Dan Ferber (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/May/11 11:31 AM

Updated:: 26/Oct/11 7:54 PM

Resolved:: 09/May/11 7:48 AM