Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-270

LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 1.8.6
    • Lustre 1.8.6
    • None
    • RHEL 5.5 and Lustre 1.8.0.1 on J4400's
    • 3
    • 10266

    Description

      OST 10 /dev/md30 resident on OSS3
      From /var/log/messages
      LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem
      LDisk-fs warning (device md30): ldisk_multi_mount_protect: MMP failure info: <time in unix seconds>, last update node: OSS3, last update device /dev/md30

      This is a scenario that keeps sending the customer in circles. They know for certain that an fsck is not running. Since they know that they can try to turn the mmp bit off vi the following commands:

      To manually disable MMP, run:
      tune2fs -O ^mmp <device>
      To manually enable MMP, run:
      tune2fs -O mmp <device>

      These commands fail saying that valid superblock does not exist, but they can see their valid superblock (with mmp set) by running the following command:

      Tune2fs -l /dev/md30

      It is their understanding that a fix for this issue was released with a later version of Lustre, but aside from that, is there a way to do this?

      Customer contact is tyler.s.wiegers@lmco.com

      Attachments

        Activity

          [LU-270] LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem
          pjones Peter Jones added a comment -

          Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.

          pjones Peter Jones added a comment - Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.

          Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632.

          johann Johann Lombardi (Inactive) added a comment - Tyler, BTW, i think it still makes sense to check that you are not using a mptsas driver suffering from bugzilla ticket 22632.

          Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone.

          However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing.

          We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c

          {stripes}

          {new file}

          " command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files.

          Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients.

          Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable.

          If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N *

          {slow OST}

          is still fast enough to meet that minimum bandwidth.

          adilger Andreas Dilger added a comment - Just as an update to the bug, Tyler and I spoke at length on the phone this morning. After a restart of the OSTs and clients, the filesystem was able to mount without problems and at least "lfs df" worked for all OSTs while we were on the phone. However, the corruption on some of the OSTs, and the fact that all files are striped over all OSTs mean that some fraction of all files in the filesystem will have missing data. Since the filesystem is used only as a staging area, it is recommended that the filesystem is simply reformatted to get it back into a known state instead of spending more time isolating which files were corrupted and then having to restore them into the filesystem anyway. This will also avoid any potential bugs/or data corruption that may not be evident with limited testing. We also discussed the current default configuration of striping all files across all 16 OSTs. I recommended to Tyler to use the "lfs setstripe -c {stripes} {new file} " command to create some test files with different numbers of stripes and measure the performance to determine the minimum stripe count that will hit the peak single-client performance, since the clients are largely doing independent IO to different files. At that point, running multiple parallel read/write jobs on files with the smaller stripe count should be compared with running the same workload on all wide-striped files. Based on our discussion of the workload, it seems likely that the IO performance of a small number of OSTs (2-4) would be as fast as the current peak performance seen by the clients, while reducing contention on the OSTs when multiple clients are doing IO. Reducing the stripe count may potentially increase the aggregate performance seen by multiple clients doing concurrent IO, because there is less chance of contention (seeking) on the OSTs being used by multiple clients. Reducing the stripe count would also help isolate the clients from any problems or slowdowns caused by individual OSTs. If an OST is unavailable, then any file that is striped over that OST will also be unavailable. If an OST is slow for some reason (e.g. RAID rebuild, marginal disk hardware, etc) then the IO to that file will be limited by the slowest OST, so the more OSTs a file is striped over the more likely such a problem is to hit a particular file. That said, if there is a minimum bandwidth requirement for a single file, instead of a desire to maximize the aggregate performance of multiple clients doing independent IO, then there needs to be enough stripes on the file so that N * {slow OST} is still fast enough to meet that minimum bandwidth.

          Tyler, I left a VM for you on the number you provided in email.

          For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem.

          Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems.

          I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new.

          Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves.

          If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.

          adilger Andreas Dilger added a comment - Tyler, I left a VM for you on the number you provided in email. For the OOPS message, the easiest way to handle that would be to take a photo of the screen and attach it. Otherwise, having the actual error message (e.g. NULL pointer dereference at ...), the process name, and the list of function names from the top of the stack (i.e. those functions most recently called) would help debug that problem. Normall, if e2fsck is successful for the OST, then Lustre should generally be able to mount the filesystem and run with it, regardless of what corruptions there were in the past, but of course I can't know what other kinds of corruptions there might be that are causing strange problems. I definitely would not classify such problems as something that happens often, so while understanding what is going wrong and fixing it is useful to us, you need to make a decision on the value of the data in the filesystem to the users vs. the downtime it is taking to debug this problem. Of course it would be easier and faster to debug with direct access to the logs, but there are many such sites disconnected from the internet that are running Lustre, so this is nothing new. Depending on the site's tolerance for letting data out, there are a number of ways we've worked with such sites in the past. One way is to print the logs and then scan them on an internet-connected system and attach them to the bug. This maintains an "air gap" for the system while still being relatively high bandwidth, if there is nothing sensitive in the log files themselves. If you are not already in a production situation, I would strongly recommend upgrading to Lustre 1.8.5. This is running stably on many systems, and given the difficulty in diagnosing some of the problems you have already seen, it would be unfortunate to have to diagnose problems that were already fixed in under more difficult circumstances. Conversely, I know of very few 1.8.x sites that are still running 1.8.0.1 anymore.

          Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea?

          We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect?

          Thanks!

          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - Also, there is no data on this system that we absolutely need to recover, it is purely a high speed data store for temporary data. Do you believe there is any value to continueing this troubleshooting, or is rebuilding lustre filesystems at this point a good idea? We will be delivering this system into operations as a new technology within the next couple of weeks, so our concern is that we have an opportunity to learn something that may help in future operations. Is this situation something that can happen often and that we need to plan for, or is this a huge fluke that we shouldn't ever expect? Thanks!

          There were no logs in the oss while attempting to mount

          The client messages file has the following (minus date stamps to save typing time):

          lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576
          Skipped 4 previous similar messages
          setting import lustre-OST000b_UUID INACTIVE by administrator request
          Skipped 1 previous similar message
          LustreError: 7116:0:(lov_obd.c:325:lov_connect_obd()) not connecting OSC lustre-OST000b_UUID; administratively disabled
          Skipped 1 previous similar message
          Client lustre-client has started
          general protection fault: 0000 [4] SMP
          last sysfs file: /class/infiniband/mlx4_1/node_desc
          CPU 0
          Modules linked in: ~~~~ lots of modules

          After this we did a df command and it segmentation faults

          Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11

          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - There were no logs in the oss while attempting to mount The client messages file has the following (minus date stamps to save typing time): lustre-clilov-ffff81036703fc00.lov: set parameter stripesize=1048576 Skipped 4 previous similar messages setting import lustre-OST000b_UUID INACTIVE by administrator request Skipped 1 previous similar message LustreError: 7116:0:(lov_obd.c:325:lov_connect_obd()) not connecting OSC lustre-OST000b_UUID; administratively disabled Skipped 1 previous similar message Client lustre-client has started general protection fault: 0000 [4] SMP last sysfs file: /class/infiniband/mlx4_1/node_desc CPU 0 Modules linked in: ~~~~ lots of modules After this we did a df command and it segmentation faults Also, we see different sizes for the OST's using a normal df command on the OSS. doing a lfs df on the clients show different %'s for the good OSTs, but it comes back with the Bad address (-14) error when it gets to ost11, so I can't tell what that would say. lfs df -i shows 0%, but still fails at ost11
          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - - edited

          ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive.

          The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good.

          Using lctl get_param osc.*.import (not state):

          The mds shows state as FULL for all OSTs, which is good

          The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing

          We're working on the log output for attempting to mount

          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - - edited ost15 had a fairly large amount of filesystem corruption when running the e2fsck. We used a lustre restore from lost and found command to attempt to restore that data. ost11 did not have corruption I don't beleive. The recovery status using lctl get_param obdfilter.*.recovery_status on the oss shows everything as COMPLETE, which is good. Using lctl get_param osc.*.import (not state): The mds shows state as FULL for all OSTs, which is good The client shows state as NEW for OST 11 and 15, but FULL for all others. There are also 3 entries for OST11 and 15 in this listing We're working on the log output for attempting to mount

          We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up.

          If you are available for a phone call that would be great as well, we are available all night if necessary.

          Thanks!

          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - We're getting those logs for you now, we have to re-type them since they are on a segregated system. We are strapped for time so as soon as you can respond that would be great, if we don't have this back up tomorrow morning we get to rebuild lustre to get the system up. If you are available for a phone call that would be great as well, we are available all night if necessary. Thanks!

          Did ost11 and ost15 have any filesystem corruption when you ran e2fsck on them?

          When you report that the %used is different, is that from "lfs df" or "lfs df -i", or from "df" on the OSS node for the local OST mounpoints?

          You can check the recovery state of all OSTs on an OSS via "lctl get_param obdfilter.*.recovery_status". They should all report "status: COMPLETE" (or "INACTIVE" if recovery was never done since the OST was mounted).

          As for the OSTs being marked inactive, you can check the status of the connections on the MDS and clients via "lctl get_param osc.*.state". All of the connections should report "current_state: FULL" meaning that the OSCs are connected to the OSTs. Even so, if the OSTs are not started for some reason, it shouldn't prevent the clients from mounting.

          Can you please attach an excerpt from the syslog for a client trying to mount, and also from OST11 and OST15.

          adilger Andreas Dilger added a comment - Did ost11 and ost15 have any filesystem corruption when you ran e2fsck on them? When you report that the %used is different, is that from "lfs df" or "lfs df -i", or from "df" on the OSS node for the local OST mounpoints? You can check the recovery state of all OSTs on an OSS via "lctl get_param obdfilter.*.recovery_status". They should all report "status: COMPLETE" (or "INACTIVE" if recovery was never done since the OST was mounted). As for the OSTs being marked inactive, you can check the status of the connections on the MDS and clients via "lctl get_param osc.*.state". All of the connections should report "current_state: FULL" meaning that the OSCs are connected to the OSTs. Even so, if the OSTs are not started for some reason, it shouldn't prevent the clients from mounting. Can you please attach an excerpt from the syslog for a client trying to mount, and also from OST11 and OST15.

          Some additional data points.

          After unmounting and resetting, ost11 and 15 complete recovery ok, but we still aren't able to mount lustre on a client.

          OST 11 and 15 are showing very different % used values than all of our other OSTs (they should all be even because of the stripes we use).

          In messages on our MDT server (mds2) we get messages stating that ost11 is "INACTIVE" by administrator request.

          We also see eviction messages when trying to mount a client for ost11 and 15:
          This client was evicted by lustre-OST000b; in progress operations using this service will fail

          tyler.s.wiegers@lmco.com Tyler Wiegers (Inactive) added a comment - Some additional data points. After unmounting and resetting, ost11 and 15 complete recovery ok, but we still aren't able to mount lustre on a client. OST 11 and 15 are showing very different % used values than all of our other OSTs (they should all be even because of the stripes we use). In messages on our MDT server (mds2) we get messages stating that ost11 is "INACTIVE" by administrator request. We also see eviction messages when trying to mount a client for ost11 and 15: This client was evicted by lustre-OST000b; in progress operations using this service will fail

          People

            adilger Andreas Dilger
            dferber Dan Ferber (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: