[LU-270] LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem Created: 03/May/11  Updated: 26/Oct/11  Resolved: 09/May/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Major
Reporter: Dan Ferber (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 5.5 and Lustre 1.8.0.1 on J4400's


Severity: 3
Rank (Obsolete): 10266

 Description   

OST 10 /dev/md30 resident on OSS3
From /var/log/messages
LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem
LDisk-fs warning (device md30): ldisk_multi_mount_protect: MMP failure info: <time in unix seconds>, last update node: OSS3, last update device /dev/md30

This is a scenario that keeps sending the customer in circles. They know for certain that an fsck is not running. Since they know that they can try to turn the mmp bit off vi the following commands:

To manually disable MMP, run:
tune2fs -O ^mmp <device>
To manually enable MMP, run:
tune2fs -O mmp <device>

These commands fail saying that valid superblock does not exist, but they can see their valid superblock (with mmp set) by running the following command:

Tune2fs -l /dev/md30

It is their understanding that a fix for this issue was released with a later version of Lustre, but aside from that, is there a way to do this?

Customer contact is tyler.s.wiegers@lmco.com



 Comments   
Comment by Andreas Dilger [ 03/May/11 ]

What version of e2fsprogs is being used here?

Comment by Tyler Wiegers (Inactive) [ 03/May/11 ]
  1. rpm -qa |grep e2fsprogs
    e2fsprogs-libs-1.39.20.el5
    e2fsprogs-1.40.11.sun1-0redhat
    e2fsprogs-1.39-20.el5
Comment by Tyler Wiegers (Inactive) [ 03/May/11 ]

The specific error output that we see while mounting this OST is the following:

  1. mount -t lustre /dev/md30 /mnt/lustre_ost10
    mount.lustre: mount /dev/md30 at /mnt/lustre_ost10 failed: Invalid argument
    This may have multiple causes.
    Are the mount options correct?
    Check the syslog for more info.

From messages:

LDISKFS-fs warning (device md30): ldiskfs_multi_mount_project: fsck is running on the filesystem
LDISKFS-fs warning (device md30): ldiskfs_multi_mount_protect: MMP failure info: last update time: 1304099783, last update node: oss3, last update device: /dev/md30
LustreError: 14496:0:(obd_mount.c:1278:server_kernel_mount()) premount /dev/md30:0x0 ldiskfs failed: -22, ldiskfs2 failed: -19. Is the ldiskfs module available?
LustreError: 14496:0:(obd_mount.c:1278:server_kernel_mount()) Skipped 3 previous similar messages
LustreError: 14496:0:(obd_mount.c:1590:server_fill_super()) Unable to mount device /dev/md30: -22
LustreError: 14496:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount (-22)

We believe that this OST may have been put in this state because we attempted to run e2fsck (messages output recommended running e2fsck against the ost). The fsck crashed out when we tried running it which we think caused the OST to enter this state.

We tried turning off the MMP feature in order to mount the OST, however when attempting the turn off MMP we got the same "fsck is running" error. We understand that there is a command in a newer version of the e2fsprogs which would clear the MMP flag which might help, but we don't have the luxury of blindly updating our systems hoping that it may help.

We appreciate your support!

Comment by Dan Ferber (Inactive) [ 03/May/11 ]

The customer currently is not undertaking any rebuild activity other than last night’s CAM and firmware upgrades for the HW, hoping to first have a recommendation on the way forward from Whamcloud from a Lustre perspective. It has definitely been discussed internally on whether they should run with 1.8.0.1 with any identified/recommended patches, or upgrade completely to 1.8.5, or something else. What are Whamcloud's recommendations there?

From the customer's perspective, the HW has come back clean (CAM and firmware upgrades succeeded), so now they need some help in looking at their configuration, implementation, or Lustre itself.

Comment by Johann Lombardi (Inactive) [ 03/May/11 ]

> We understand that there is a command in a newer version of the e2fsprogs
> which would clear the MMP flag which might help

Right, that's "tune2fs -f -E clear-mmp $dev". However 1.40.11-sun1 does not support this option it seems:
http://lists.lustre.org/pipermail/lustre-discuss/2010-August/013818.html

> but we don't have the luxury of blindly updating our systems hoping that it may help.

I really think you should upgrade e2fsprogs since many MMP bugs have been fixed since.

Comment by Tyler Wiegers (Inactive) [ 03/May/11 ]

Is there any way to manually recover an OST when the MMP flag is set other than using the tune2fs command?

Comment by Cliff White (Inactive) [ 03/May/11 ]

The tune2fs command is the only way to reset that flag. E2fsprogs is very safe to upgrade, there
is always complete backward compatibility, this is not a 'blind upgrade which might help' - it's a necessary
upgrade of the utility that exists to fix exactly your issues. This relates to both problems
you have reported.

Comment by Tyler Wiegers (Inactive) [ 03/May/11 ]

We're doing an emergency review board to approve installation of this package. We will have that installed tonight.

Pending that, we started having a third issue mounting an OST this morning (after updating disk firmware and CAM software). The error logs are below, if this needs a new bug report then that's fine, otherwise any comments would be appreciated.

oss4# mount -t lustre /dev/md11 /mnt/lustre_ost03
mount.lustre: mount /dev/md11 at /mnt/lustre_ost03 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.

From messages:

Kjournald starting. Commit interval 5 seconds
LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
LDISKFS FS on md11, external journal on md13
LDISKFS-fs: mounted filesystem with ordered data mode.
LustreError: 25721:0:(obdmount.c:272:ldd_parse()) disk data size does not match: see 0 expect 12288
LustreError: 25721:0:(obd_mount.c:1292:server_kernel_mount()) premount parse options failed: rc = -22
LustreError: 25721:0:(obd_mount.c:1590:server_fill_super()) Unable to mount device -22
LustreError: 25721:0:(obd_mount.c:1993:server_fill_super()) Unable to mount (-22)

We've tried running e2fsck with no success, e2fsck doesn't report any errors.

Comment by Tyler Wiegers (Inactive) [ 03/May/11 ]

We upgraded the e2fsprogs package, ran tune2fs with the clear-mmp option, ran e2fsck on that device, and were able to mount the OST. Good news there.

For the previous comment, we are running an e2fsck and checking out what new tune2fs options there are, I'll post back when we have some new information, but indications at the moment is that ost03 still won't mount.

Comment by Cliff White (Inactive) [ 03/May/11 ]

It would be best to open up a new bug, It is not good that you are having all these errors after your firmware upgrade.
It would be a good idea to run fsck -fn on all your disks, see if you have any other issues.

Comment by Sam Bigger (Inactive) [ 03/May/11 ]

Regarding the CAMs and drive upgrades, we have seen the corrupted OSTs before on the Riverwalks (J4400's) when disk firmware was upgraded without both Lustre and the md software raid shutdown cleanly first. Is there any chance that this particular OST10 was not cleanly shutdown? We saw many cases of software RAID corruption on the J4400's a couple of years ago, which was about the time early versions of 1.8 started to be used. There were several software RAID corruption bugs that have since been fixed. Also, we have fixed many problems since the early 1.8 releases in Lustre, so would encourage an upgrade to 1.8.5 at your earliest convenience.

If both Lustre and the MD device were shutdown cleanly, then there should have been no problems like this. So, in that case, this would likely be a new bug that potentially still exists in the latest releases of Lustre.

Comment by Andreas Dilger [ 03/May/11 ]

> LustreError: 25721:0:(obdmount.c:272:ldd_parse()) disk data size does not match: see 0 expect 12288

This indicates that the CONFIGS/mountdata file is also corrupted (zero length file). It is possible to reconstruct this file by copying it from another OST and (unfortunately) binary editing the file. There are two fields that are unique to each OST that need to be modified.

First, on an OSS node make a copy of this file from a working OST, say OST0001:

OSS# debugfs -c -R "dump CONFIGS/mountdata /tmp/mountdata.ost01"

{OST0001_dev}

Now the mountdata.ost01 file needs to be edited to reflect that it is being used for OST0003. If you have a favorite binary editor that could be used. I use "xxd" from the "vim-common" package to convert it into ASCII to be edited, and then convert it back to binary.

The important parts of the file are all at the beginning, the rest of the file is common to all OSTs:

OSS# xxd /tmp/mountdata.ost01 /tmp/mountdata.ost01.asc
OSS# vi /tmp/mountdata.ost01.asc

0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................
0000010: 0200 0000 0200 0000 0100 0000 0100 0000 ................
0000020: 6c75 7374 7265 0000 0000 0000 0000 0000 lustre..........
0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000060: 6c75 7374 7265 2d4f 5354 3030 3031 0000 lustre-OST0001..
0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
[snip]

This is the "xxd" output showing a struct lustre_disk_data. The two fields that need to be edited are 0x0018 (ldd_svindex) and 0x0060 (ldd_svname).

Edit the "0100" in the second row, fifth column to be "0300".
Edit the "OST0001" line to be "OST0003":

0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................
0000010: 0200 0000 0200 0000 0300 0000 0100 0000 ................
0000020: 6c75 7374 7265 0000 0000 0000 0000 0000 lustre..........
0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000060: 6c75 7374 7265 2d4f 5354 3030 3033 0000 lustre-OST0003..
0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................

Save the file, and convert it back to binary:

OSS# xxd -r /tmp/mountdata.ost01.asc /tmp/mountdata.ost03

Mount the OST0003 filesystem locally and copy this new file in place:

OSS# mount -t ldiskfs

{OST0003_dev}

/mnt/lustre_ost03
OSS# mv /mnt/lustre_ost03/CONFIGS/mountdata /mnt/lustre_ost03/CONFIGS/mountdata.broken
OSS# cp /tmp/mountdata.ost03 /mnt/lustre_ost03/CONFIGS/mountdata
OSS# umount /mnt/lustre_ost03

The OST should now mount normally and identify itself as OST0003.

Comment by Peter Jones [ 03/May/11 ]

Thanks Sam. It is interesting to hear a PS perspective. I know that you were involved in a number of similar deployments. It will be interesting to hear the assessment from engineering about whether a Lustre issue is indeed involved here. Andreas, what do you think?

Comment by Johann Lombardi (Inactive) [ 04/May/11 ]

> Regarding the CAMs and drive upgrades, we have seen the corrupted OSTs before on the Riverwalks
> (J4400's) when disk firmware was upgraded without both Lustre and the md software raid shutdown
> cleanly first. Is there any chance that this particular OST10 was not cleanly shutdown? We saw
> many cases of software RAID corruption on the J4400's a couple of years ago,

Beyond the HW/firmware issues, there was also a corruption problem due to the the mptsas driver
which could redirect I/Os to the wrong drive

The following comment from Sven explains how this bug was discovered:
https://bugzilla.lustre.org/show_bug.cgi?id=21819#c27

And the problem was fixed in the following bugzilla ticket:
https://bugzilla.lustre.org/show_bug.cgi?id=22632

However, it requires to install an extra package including the mptsas driver.

Are you sure to use a mptsas driver which does not suffer from the same issue?

> which was about the time early versions of 1.8 started to be used. There were several software
> RAID corruption bugs that have since been fixed. Also, we have fixed many problems since the
> early 1.8 releases in Lustre, so would encourage an upgrade to 1.8.5 at your earliest convenience.

We indeed integrated several software raid fixes in 1.8 (e.g. bugzilla 19990, 22509 & 20533).
Although i don't think any of them fixed real software RAID corruptions, it would still make
sense to upgrade to 1.8.5 to benefit from those bug fixes which address real deadlocks and oops.

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Sam,

When we did the firmware upgrades we had taken down lustre and rebooted every box to make sure it was all in a clean/unmounted state. We had 2 OST's not mounting at that point, with this most current problem popping up after the firmware upgrades. I'm not entirely convinced that the firmware upgrades actually caused this particular problem, we've been doing a lot to try to recover these OST's.

Andreas,

I will get our guys looking at the mountdata file right now. Hopefully we'll have an indication of whether this action helps in an hour or so.

Thank you all so much for your support!

Comment by Peter Jones [ 04/May/11 ]

Update from site - e2fsck completed on all OSTs and now running a full e2fsck before bringing filesystem back online

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Thanks Peter, I was actually in the process of updating the bugs with our most up to date status and actions taken (the site was down earlier this morning when I tried).

Again, we appreciate your support with all this!

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Andreas, your procedure worked flawlessly and our OST is back up and running. We verified that the mountdata file was indeed zero length.

One clarification that I would like to make though, we copied from ost7 and the following line to edit was different that what you had provided (for the entry to edit):

0000010: 0200 0000 0200 0000 0700 0000 0100 0000

This line you had indicated to modify the 7th entry, when we copied from ost07 it looked like the 5th entry should be modified instead.

Comment by Andreas Dilger [ 04/May/11 ]

You are correct - my sincere apologies. I was counting 2-byte fields starting in the second row instead of 4-byte fields starting in the first row. I've corrected the instructions in this bug in case they are re-used for similar problems in the future. We've discussed in the past to have a tool to repair this file automatically in case of corruption, and that is underscored by this issue.

It looks like you (correctly) modified the 5th column, so all is well and no further action is needed.

It looks like you couldn't have modified the 7th column, or the OST would have failed to mount. I did an audit of the code to see what is using these fields (the correct ldd_svindex field and the incorrect ldd_mount_type field). I found that the ldd_svindex field is only used in case the configuration database on the MGS is rewritten (due to --writeconf) and the OST is reconnecting to the MGS to recreate the configuration record. The ldd_mount_type field is used to determine the backing filesystem type (usually "ldiskfs" for type = 0x0001, but would have been "reiserfs" with type = 0x0003).

If you want to be a bit safer in the future, you could use the "debugfs" command posted earlier to dump this file from all of the OSTs (it can safely be done while the OST is mounted) and save them to a safe location.

Again, apologies for the mixup.

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Thanks Andreas

Where we are at right now is that all the OST's can be mounted, however lustre cannot be successfully mounted.

After having issues initially, we shut down all of our lustre clients, and cleanly rebooted all of our OSSs and MDSs. After bringing all the OSTs up, we had 2 OSTs (11 and 15) be in a "recovering" state that never finished (about 15 minutes after bringing up the client). We used lctl to abort recovery, and attempted mounting, which apeared to be successful. Running a df on /lustre after that segmentation faults.

Additionally, when running lfs df throws the following error when it gets to ost11:
error: llapi_obd_statfs failed: Bad address (-14)

Doing an lctl dl on a client have all the OSTs as "UP", but the last number on each line is different for OST11 and OST15 (it's 5 for all OSTs, 4 for OST11/15)

The mds's were showing that all the OSTs were "UP" as well, but the last numbers show all OSTs as 5

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Some additional data points.

After unmounting and resetting, ost11 and 15 complete recovery ok, but we still aren't able to mount lustre on a client.

OST 11 and 15 are showing very different % used values than all of our other OSTs (they should all be even because of the stripes we use).

In messages on our MDT server (mds2) we get messages stating that ost11 is "INACTIVE" by administrator request.

We also see eviction messages when trying to mount a client for ost11 and 15:
This client was evicted by lustre-OST000b; in progress operations using this service will fail

Comment by Andreas Dilger [ 04/May/11 ]

Did ost11 and ost15 have any filesystem corruption when you ran e2fsck on them?

When you report that the %used is different, is that from "lfs df" or "lfs df -i", or from "df" on the OSS node for the local OST mounpoints?

You can check the recovery state of all OSTs on an OSS via "lctl get_param obdfilter.*.recovery_status". They should all report "status: COMPLETE" (or "INACTIVE" if recovery was never done since the OST was mounted).

As for the OSTs being marked inactive, you can check the status of the connections on the MDS and clients via "lctl get_param osc.*.state". All of the connections should report "current_state: FULL" meaning that the OSCs are connected to the OSTs. Even so, if the OSTs are not started for some reason, it shouldn't prevent the clients from mounting.

Can you please attach an excerpt from the syslog for a client trying to mount, and also from OST11 and OST15.

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up.

If you are available for a phone call that would be great as well, we are available all night if necessary.

Thanks!

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive.

The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good.

Using lctl get_param osc.*.import (not state):

The mds shows state as FULL for all OSTs, which is good

The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing

We're working on the log output for attempting to mount

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

There were no logs in the oss while attempting to mount

The client messages file has the following (minus date stamps to save typing time):

lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576
Skipped 4 previous similar messages
setting import lustre-OST000b_UUID INACTIVE by administrator request
Skipped 1 previous similar message
LustreError: 7116:0:(lov_obd.c:325:lov_connect_obd()) not connecting OSC lustre-OST000b_UUID; administratively disabled
Skipped 1 previous similar message
Client lustre-client has started
general protection fault: 0000 [4] SMP
last sysfs file: /class/infiniband/mlx4_1/node_desc
CPU 0
Modules linked in: ~~~~ lots of modules

After this we did a df command and it segmentation faults

Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11

Comment by Tyler Wiegers (Inactive) [ 04/May/11 ]

Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea?

We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect?

Thanks!

Comment by Andreas Dilger [ 04/May/11 ]

Tyler, I left a VM for you on the number you provided in email.

For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem.

Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems.

I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new.

Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves.

If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.

Comment by Andreas Dilger [ 05/May/11 ]

Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone.

However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing.

We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c

{stripes}

{new file}

" command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files.

Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients.

Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable.

If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N *

{slow OST}

is still fast enough to meet that minimum bandwidth.

Comment by Johann Lombardi (Inactive) [ 06/May/11 ]

Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632.

Comment by Peter Jones [ 09/May/11 ]

Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.

Generated at Sat Feb 10 01:05:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.