[LU-270] LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem Created: 03/May/11 Updated: 26/Oct/11 Resolved: 09/May/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 1.8.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Dan Ferber (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 5.5 and Lustre 1.8.0.1 on J4400's |
||
| Severity: | 3 |
| Rank (Obsolete): | 10266 |
| Description |
|
OST 10 /dev/md30 resident on OSS3 This is a scenario that keeps sending the customer in circles. They know for certain that an fsck is not running. Since they know that they can try to turn the mmp bit off vi the following commands: To manually disable MMP, run: These commands fail saying that valid superblock does not exist, but they can see their valid superblock (with mmp set) by running the following command: Tune2fs -l /dev/md30 It is their understanding that a fix for this issue was released with a later version of Lustre, but aside from that, is there a way to do this? Customer contact is tyler.s.wiegers@lmco.com |
| Comments |
| Comment by Andreas Dilger [ 03/May/11 ] |
|
What version of e2fsprogs is being used here? |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
The specific error output that we see while mounting this OST is the following:
From messages: LDISKFS-fs warning (device md30): ldiskfs_multi_mount_project: fsck is running on the filesystem We believe that this OST may have been put in this state because we attempted to run e2fsck (messages output recommended running e2fsck against the ost). The fsck crashed out when we tried running it which we think caused the OST to enter this state. We tried turning off the MMP feature in order to mount the OST, however when attempting the turn off MMP we got the same "fsck is running" error. We understand that there is a command in a newer version of the e2fsprogs which would clear the MMP flag which might help, but we don't have the luxury of blindly updating our systems hoping that it may help. We appreciate your support! |
| Comment by Dan Ferber (Inactive) [ 03/May/11 ] |
|
The customer currently is not undertaking any rebuild activity other than last night’s CAM and firmware upgrades for the HW, hoping to first have a recommendation on the way forward from Whamcloud from a Lustre perspective. It has definitely been discussed internally on whether they should run with 1.8.0.1 with any identified/recommended patches, or upgrade completely to 1.8.5, or something else. What are Whamcloud's recommendations there? From the customer's perspective, the HW has come back clean (CAM and firmware upgrades succeeded), so now they need some help in looking at their configuration, implementation, or Lustre itself. |
| Comment by Johann Lombardi (Inactive) [ 03/May/11 ] |
|
> We understand that there is a command in a newer version of the e2fsprogs Right, that's "tune2fs -f -E clear-mmp $dev". However 1.40.11-sun1 does not support this option it seems: > but we don't have the luxury of blindly updating our systems hoping that it may help. I really think you should upgrade e2fsprogs since many MMP bugs have been fixed since. |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
Is there any way to manually recover an OST when the MMP flag is set other than using the tune2fs command? |
| Comment by Cliff White (Inactive) [ 03/May/11 ] |
|
The tune2fs command is the only way to reset that flag. E2fsprogs is very safe to upgrade, there |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
We're doing an emergency review board to approve installation of this package. We will have that installed tonight. Pending that, we started having a third issue mounting an OST this morning (after updating disk firmware and CAM software). The error logs are below, if this needs a new bug report then that's fine, otherwise any comments would be appreciated. oss4# mount -t lustre /dev/md11 /mnt/lustre_ost03 From messages: Kjournald starting. Commit interval 5 seconds We've tried running e2fsck with no success, e2fsck doesn't report any errors. |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
We upgraded the e2fsprogs package, ran tune2fs with the clear-mmp option, ran e2fsck on that device, and were able to mount the OST. Good news there. For the previous comment, we are running an e2fsck and checking out what new tune2fs options there are, I'll post back when we have some new information, but indications at the moment is that ost03 still won't mount. |
| Comment by Cliff White (Inactive) [ 03/May/11 ] |
|
It would be best to open up a new bug, It is not good that you are having all these errors after your firmware upgrade. |
| Comment by Sam Bigger (Inactive) [ 03/May/11 ] |
|
Regarding the CAMs and drive upgrades, we have seen the corrupted OSTs before on the Riverwalks (J4400's) when disk firmware was upgraded without both Lustre and the md software raid shutdown cleanly first. Is there any chance that this particular OST10 was not cleanly shutdown? We saw many cases of software RAID corruption on the J4400's a couple of years ago, which was about the time early versions of 1.8 started to be used. There were several software RAID corruption bugs that have since been fixed. Also, we have fixed many problems since the early 1.8 releases in Lustre, so would encourage an upgrade to 1.8.5 at your earliest convenience. If both Lustre and the MD device were shutdown cleanly, then there should have been no problems like this. So, in that case, this would likely be a new bug that potentially still exists in the latest releases of Lustre. |
| Comment by Andreas Dilger [ 03/May/11 ] |
|
> LustreError: 25721:0:(obdmount.c:272:ldd_parse()) disk data size does not match: see 0 expect 12288 This indicates that the CONFIGS/mountdata file is also corrupted (zero length file). It is possible to reconstruct this file by copying it from another OST and (unfortunately) binary editing the file. There are two fields that are unique to each OST that need to be modified. First, on an OSS node make a copy of this file from a working OST, say OST0001: OSS# debugfs -c -R "dump CONFIGS/mountdata /tmp/mountdata.ost01" {OST0001_dev}Now the mountdata.ost01 file needs to be edited to reflect that it is being used for OST0003. If you have a favorite binary editor that could be used. I use "xxd" from the "vim-common" package to convert it into ASCII to be edited, and then convert it back to binary. The important parts of the file are all at the beginning, the rest of the file is common to all OSTs: OSS# xxd /tmp/mountdata.ost01 /tmp/mountdata.ost01.asc 0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................ This is the "xxd" output showing a struct lustre_disk_data. The two fields that need to be edited are 0x0018 (ldd_svindex) and 0x0060 (ldd_svname). Edit the "0100" in the second row, fifth column to be "0300". 0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................ Save the file, and convert it back to binary: OSS# xxd -r /tmp/mountdata.ost01.asc /tmp/mountdata.ost03 Mount the OST0003 filesystem locally and copy this new file in place: OSS# mount -t ldiskfs {OST0003_dev} /mnt/lustre_ost03 The OST should now mount normally and identify itself as OST0003. |
| Comment by Peter Jones [ 03/May/11 ] |
|
Thanks Sam. It is interesting to hear a PS perspective. I know that you were involved in a number of similar deployments. It will be interesting to hear the assessment from engineering about whether a Lustre issue is indeed involved here. Andreas, what do you think? |
| Comment by Johann Lombardi (Inactive) [ 04/May/11 ] |
|
> Regarding the CAMs and drive upgrades, we have seen the corrupted OSTs before on the Riverwalks Beyond the HW/firmware issues, there was also a corruption problem due to the the mptsas driver The following comment from Sven explains how this bug was discovered: And the problem was fixed in the following bugzilla ticket: However, it requires to install an extra package including the mptsas driver. Are you sure to use a mptsas driver which does not suffer from the same issue? > which was about the time early versions of 1.8 started to be used. There were several software We indeed integrated several software raid fixes in 1.8 (e.g. bugzilla 19990, 22509 & 20533). |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Sam, When we did the firmware upgrades we had taken down lustre and rebooted every box to make sure it was all in a clean/unmounted state. We had 2 OST's not mounting at that point, with this most current problem popping up after the firmware upgrades. I'm not entirely convinced that the firmware upgrades actually caused this particular problem, we've been doing a lot to try to recover these OST's. Andreas, I will get our guys looking at the mountdata file right now. Hopefully we'll have an indication of whether this action helps in an hour or so. Thank you all so much for your support! |
| Comment by Peter Jones [ 04/May/11 ] |
|
Update from site - e2fsck completed on all OSTs and now running a full e2fsck before bringing filesystem back online |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Thanks Peter, I was actually in the process of updating the bugs with our most up to date status and actions taken (the site was down earlier this morning when I tried). Again, we appreciate your support with all this! |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Andreas, your procedure worked flawlessly and our OST is back up and running. We verified that the mountdata file was indeed zero length. One clarification that I would like to make though, we copied from ost7 and the following line to edit was different that what you had provided (for the entry to edit): 0000010: 0200 0000 0200 0000 0700 0000 0100 0000 This line you had indicated to modify the 7th entry, when we copied from ost07 it looked like the 5th entry should be modified instead. |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
You are correct - my sincere apologies. I was counting 2-byte fields starting in the second row instead of 4-byte fields starting in the first row. I've corrected the instructions in this bug in case they are re-used for similar problems in the future. We've discussed in the past to have a tool to repair this file automatically in case of corruption, and that is underscored by this issue. It looks like you (correctly) modified the 5th column, so all is well and no further action is needed. It looks like you couldn't have modified the 7th column, or the OST would have failed to mount. I did an audit of the code to see what is using these fields (the correct ldd_svindex field and the incorrect ldd_mount_type field). I found that the ldd_svindex field is only used in case the configuration database on the MGS is rewritten (due to --writeconf) and the OST is reconnecting to the MGS to recreate the configuration record. The ldd_mount_type field is used to determine the backing filesystem type (usually "ldiskfs" for type = 0x0001, but would have been "reiserfs" with type = 0x0003). If you want to be a bit safer in the future, you could use the "debugfs" command posted earlier to dump this file from all of the OSTs (it can safely be done while the OST is mounted) and save them to a safe location. Again, apologies for the mixup. |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Thanks Andreas Where we are at right now is that all the OST's can be mounted, however lustre cannot be successfully mounted. After having issues initially, we shut down all of our lustre clients, and cleanly rebooted all of our OSSs and MDSs. After bringing all the OSTs up, we had 2 OSTs (11 and 15) be in a "recovering" state that never finished (about 15 minutes after bringing up the client). We used lctl to abort recovery, and attempted mounting, which apeared to be successful. Running a df on /lustre after that segmentation faults. Additionally, when running lfs df throws the following error when it gets to ost11: Doing an lctl dl on a client have all the OSTs as "UP", but the last number on each line is different for OST11 and OST15 (it's 5 for all OSTs, 4 for OST11/15) The mds's were showing that all the OSTs were "UP" as well, but the last numbers show all OSTs as 5 |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Some additional data points. After unmounting and resetting, ost11 and 15 complete recovery ok, but we still aren't able to mount lustre on a client. OST 11 and 15 are showing very different % used values than all of our other OSTs (they should all be even because of the stripes we use). In messages on our MDT server (mds2) we get messages stating that ost11 is "INACTIVE" by administrator request. We also see eviction messages when trying to mount a client for ost11 and 15: |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
Did ost11 and ost15 have any filesystem corruption when you ran e2fsck on them? When you report that the %used is different, is that from "lfs df" or "lfs df -i", or from "df" on the OSS node for the local OST mounpoints? You can check the recovery state of all OSTs on an OSS via "lctl get_param obdfilter.*.recovery_status". They should all report "status: COMPLETE" (or "INACTIVE" if recovery was never done since the OST was mounted). As for the OSTs being marked inactive, you can check the status of the connections on the MDS and clients via "lctl get_param osc.*.state". All of the connections should report "current_state: FULL" meaning that the OSCs are connected to the OSTs. Even so, if the OSTs are not started for some reason, it shouldn't prevent the clients from mounting. Can you please attach an excerpt from the syslog for a client trying to mount, and also from OST11 and OST15. |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up. If you are available for a phone call that would be great as well, we are available all night if necessary. Thanks! |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive. The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good. Using lctl get_param osc.*.import (not state): The mds shows state as FULL for all OSTs, which is good The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing We're working on the log output for attempting to mount |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
There were no logs in the oss while attempting to mount The client messages file has the following (minus date stamps to save typing time): lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576 After this we did a df command and it segmentation faults Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11 |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea? We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect? Thanks! |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
Tyler, I left a VM for you on the number you provided in email. For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem. Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems. I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new. Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves. If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore. |
| Comment by Andreas Dilger [ 05/May/11 ] |
|
Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone. However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing. We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c {stripes}{new file} " command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files. Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients. Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable. If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N * {slow OST}is still fast enough to meet that minimum bandwidth. |
| Comment by Johann Lombardi (Inactive) [ 06/May/11 ] |
|
Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632. |
| Comment by Peter Jones [ 09/May/11 ] |
|
Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future. |