Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Some sites have encountered issues with messages like:

      May 10 13:54:50 oss02 kernel: blk_cloned_rq_check_limits: over max size limit.
      May 10 13:54:50 oss02 kernel: blk_cloned_rq_check_limits: over max size limit.
      May 10 13:54:50 oss02 kernel: device-mapper: multipath: Failing path 8:48.
      May 10 13:54:50 oss02 kernel: blk_cloned_rq_check_limits: over max size limit.
      May 10 13:54:50 oss02 kernel: blk_cloned_rq_check_limits: over max size limit.
      May 10 13:54:50 oss02 kernel: device-mapper: multipath: Failing path 8:240.
      
      

      This issue can cause corruption on storage side. It can be fixed by reverting an upstream patch or adding a udev script (but we will change max_sectors_kb value while mounting time, so it still can be triggered even with this script). The better way is to handle this value in mount.lustre tool. I'll make up a patch for master first.

      Attachments

        Issue Links

          Activity

            [LU-10510] Fix 'over max size limit' issue
            ys Yang Sheng added a comment -

            This issue should be fixed after https://review.whamcloud.com/31951/ landed.

            ys Yang Sheng added a comment - This issue should be fixed after https://review.whamcloud.com/31951/ landed.

            Known issue certain kernels, related to mismatch in max_sectors_kb setting between dm devices (eg. mutipath) and underlying block devices.
            If you have redhat account you can read their KB articles:

            https://access.redhat.com/solutions/247991

            https://access.redhat.com/solutions/145163
             

            chunteraa Chris Hunter (Inactive) added a comment - Known issue certain kernels, related to mismatch in max_sectors_kb setting between dm devices (eg. mutipath) and underlying block devices. If you have redhat account you can read their KB articles: https://access.redhat.com/solutions/247991 https://access.redhat.com/solutions/145163  
            adilger Andreas Dilger added a comment - - edited

            Yang Sheng, can you please look at what could be done for creating a udev script to handle this? I guess the tricky part is that we don't want to install a udev script to cover all devices from the RPM package, we only want to affect the Lustre target devices.

            One option would be to generate udev rules like /etc/udev/rules.d/99-lustre-<device>.rules at mkfs.lustre time for the multipath and children devices. That wouldn't help existing filesystems, and means users could see a performance regression when they upgraded to a version of mkfs.lustre that doesn't include this tuning, if users don't know the tunable is no longer applied by mount.lustre.

            A second option would be to change tune_max_sectors_kb() to generate and install udev tuning rules for each mounted device and children if they do not already exist. This would be least complex, but possibly a surprising action for mount.lustre to take. However, it is not worse than our current practice of changing the block device tunables at mount time

            A third option (possibly in addition to the first) would be to change tune_max_sectors_kb() to complain about the lack of udev tuning rules for the device and children, or if the multipath child device settings do not match the parent, and indicate how to create the udev rules, but not actually install the rules. Then, the admin can create such a rule and add any tunables desired to quiet mount.lustre, and this should avoid having it do the tuning itself.

            adilger Andreas Dilger added a comment - - edited Yang Sheng, can you please look at what could be done for creating a udev script to handle this? I guess the tricky part is that we don't want to install a udev script to cover all devices from the RPM package, we only want to affect the Lustre target devices. One option would be to generate udev rules like /etc/udev/rules.d/99-lustre-<device>.rules at mkfs.lustre time for the multipath and children devices. That wouldn't help existing filesystems, and means users could see a performance regression when they upgraded to a version of mkfs.lustre that doesn't include this tuning, if users don't know the tunable is no longer applied by mount.lustre . A second option would be to change tune_max_sectors_kb() to generate and install udev tuning rules for each mounted device and children if they do not already exist. This would be least complex, but possibly a surprising action for mount.lustre to take. However, it is not worse than our current practice of changing the block device tunables at mount time A third option (possibly in addition to the first) would be to change tune_max_sectors_kb() to complain about the lack of udev tuning rules for the device and children, or if the multipath child device settings do not match the parent, and indicate how to create the udev rules, but not actually install the rules. Then, the admin can create such a rule and add any tunables desired to quiet mount.lustre , and this should avoid having it do the tuning itself.

            Cliff hit this on the soak test cluster. He will collect grep . /sys/block/{dm*,sd*}/queue/max*sectors_kb for the affected MDT multipath device and the underlying SCSI devices (we don't need all of the others). My suspicion is that one of the underlying devices was reset for some reason, and max_sectors_kb is the default (maybe 128) but the multipath device is set larger (maybe 1024 or 16384) and this is causing the failures.

            Using mount.lustre -o max_sectors_kb=0 will prevent Lustre from changing these tunables in the first place, which may hurt OST performance, but likely has a lesser effect on MDT performance. However, if the system is already in this state (underlying devices have lower max_sectors_kb than parent), this needs to be fixed manually to change the underlying settings (or a reboot would work).

            adilger Andreas Dilger added a comment - Cliff hit this on the soak test cluster. He will collect grep . /sys/block/{dm*,sd*}/queue/max*sectors_kb for the affected MDT multipath device and the underlying SCSI devices (we don't need all of the others). My suspicion is that one of the underlying devices was reset for some reason, and max_sectors_kb is the default (maybe 128) but the multipath device is set larger (maybe 1024 or 16384) and this is causing the failures. Using mount.lustre -o max_sectors_kb=0 will prevent Lustre from changing these tunables in the first place, which may hurt OST performance, but likely has a lesser effect on MDT performance. However, if the system is already in this state (underlying devices have lower max_sectors_kb than parent), this needs to be fixed manually to change the underlying settings (or a reboot would work).

            People

              ys Yang Sheng
              ys Yang Sheng
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: