Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12506

Client unable to mount filesystem with very large number of MDTs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.7
    • Lustre 2.10.8, Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,
      There was a message on the lustre-discuss list about this issue back in May (http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-May/016475.html) - and I've managed to reproduce this error. I couldn't find an open ticket for it however so I wanted to create one.

      My environment is the following:

      Servers and Clients are using the upstream 2.12.2 and same kernel version:

      [root@dac-e-1 ~]# lfs --version
      lfs 2.12.2
      # Server kernel version
      3.10.0-957.10.1.el7_lustre.x86_64
      # Client kernel version (unpatched)
      3.10.0-957.10.1.el7.x86_64
      

      There are 24 servers, each containing 12x NVMe flash devices. For this test I am configuring the block-devices on each server identically, with 3 devices on each server partitioned into a 200G MDT and the remaining space as OST.

      Altogether this makes 72 MDTs, and 288 OSTs in the filesystem.

      Below are the syslog messages from the client and servers when attempting to mount the filesystem:

      Client syslog - Nid: 10.47.21.72@o2ib1
      -- Logs begin at Wed 2019-07-03 19:54:04 BST, end at Thu 2019-07-04 13:06:12 BST. --
      Jul 04 12:59:43 cpu-e-1095 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(mdc_request.c:2700:mdc_setup()) fs1-MDT0031-mdc-ffff9f4c85ad8000: failed to setup changelog char device: rc = -16
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:559:class_setup()) setup fs1-MDT0031-mdc-ffff9f4c85ad8000 failed (-16)
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.47.18.1@o2ib1: cfg command failed: rc = -16
      Jul 04 12:59:56 cpu-e-1095 kernel: Lustre:    cmd=cf003 0:fs1-MDT0031-mdc  1:fs1-MDT0031_UUID  2:10.47.18.17@o2ib1  
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 15c-8: MGC10.47.18.1@o2ib1: The configuration from log 'fs1-client' failed (-16). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_config.c:610:class_cleanup()) Device 58 not setup
      Jul 04 12:59:56 cpu-e-1095 kernel: Lustre: Unmounted fs1-client
      Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-16)
      
      Servers syslog
      [root@xcat1 ~]# xdsh csd3-buff 'journalctl -a --since "12:59" _TRANSPORT=kernel' | xdshbak -c                                                                                                  
      HOSTS -------------------------------------------------------------------------
      dac-e-1
      -------------------------------------------------------------------------------
      -- Logs begin at Thu 2019-03-21 15:42:02 GMT, end at Thu 2019-07-04 13:04:24 BST. --
      Jul 04 12:59:43 dac-e-1 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1
      Jul 04 12:59:55 dac-e-1 kernel: Lustre: MGS: Connection restored to 08925711-bdfa-621f-89ec-0364645c915c (at 10.47.21.72@o2ib1)
      Jul 04 12:59:55 dac-e-1 kernel: Lustre: Skipped 2036 previous similar messages
      
      HOSTS -------------------------------------------------------------------------
      dac-e-10, dac-e-11, dac-e-12, dac-e-13, dac-e-14, dac-e-15, dac-e-16, dac-e-17, dac-e-18, dac-e-19, dac-e-2, dac-e-20, dac-e-21, dac-e-22, dac-e-23, dac-e-24, dac-e-3, dac-e-4, dac-e-5, dac-e-6, dac-e-7, dac-e-8, dac-e-9
      -------------------------------------------------------------------------------
      -- No entries --
      

      Attached are lustre debug logs from both the client and the dac-e-1 server which contains the MGT.

      I can provide debug logs from all 24 servers if that would help, just let me know.

      I've successfully used the same configuration with 2x MDTs per server, so 48 MDTs in total, without problem, but I haven't confirmed what Scott mentioned on the mailing list about the failure starting at 56 MDTs.

      Thanks,
      Matt

      Attachments

        Issue Links

          Activity

            [LU-12506] Client unable to mount filesystem with very large number of MDTs

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37917
            Subject: LU-12506 mdc: clean up code style for mdc_locks.c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d08b729acb70fba933da40e7699b621e2643355f

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37917 Subject: LU-12506 mdc: clean up code style for mdc_locks.c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d08b729acb70fba933da40e7699b621e2643355f

            Hi John,
            Thanks! It's a better solution to replace miscdevice with dynamic devices, I have updated the patch accordingly. Thanks

            hongchao.zhang Hongchao Zhang added a comment - Hi John, Thanks! It's a better solution to replace miscdevice with dynamic devices, I have updated the patch accordingly. Thanks
            jhammond John Hammond added a comment - - edited

            This could/should be solved by using dynamic devices instead of misc devices. See https://review.whamcloud.com/#/c/37552/4/lustre/ofd/ofd_access_log.c@406 for an approach which should work here as sell.

            jhammond John Hammond added a comment - - edited This could/should be solved by using dynamic devices instead of misc devices. See https://review.whamcloud.com/#/c/37552/4/lustre/ofd/ofd_access_log.c@406 for an approach which should work here as sell.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37759
            Subject: LU-12506 changelog: support large number of MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4d1e03fd208504854fbbf3631547b00a32d8c62f

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37759 Subject: LU-12506 changelog: support large number of MDT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4d1e03fd208504854fbbf3631547b00a32d8c62f
            adilger Andreas Dilger added a comment - - edited

            This issue was introduced with patch https://review.whamcloud.com/18900 "LU-7659 mdc: expose changelog through char devices" in commit v2_9_55_0-13-g1d40214d96, so affects both 2.10 and 2.12 LTS releases. Please add that in Fixes: label in the patch commit message when fixing this issue.

            adilger Andreas Dilger added a comment - - edited This issue was introduced with patch https://review.whamcloud.com/18900 " LU-7659 mdc: expose changelog through char devices " in commit v2_9_55_0-13-g1d40214d96 , so affects both 2.10 and 2.12 LTS releases. Please add that in Fixes: label in the patch commit message when fixing this issue.

            Thanks Patrick, that's great. I'll give this a test in a couple of weeks when I have a window to do some more benchmarking on this hardware - I was interested in just seeing how far we could scale DNE striped directories, so no changelogs on this system. I'll try this and report back then.

            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Thanks Patrick, that's great. I'll give this a test in a couple of weeks when I have a window to do some more benchmarking on this hardware - I was interested in just seeing how far we could scale DNE striped directories, so no changelogs on this system. I'll try this and report back then. Cheers, Matt

            Matt,

            The above is absolutely not a fix, it's just a quick hack, but as long as you're not using changelogs, that patch on the client should let you mount with > 64 MDTs.

            pfarrell Patrick Farrell (Inactive) added a comment - Matt, The above is absolutely not a fix, it's just a quick hack, but as long as you're not using changelogs, that patch on the client should let you mount with > 64 MDTs.

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36213
            Subject: LU-12506 mdc: Remove cdev_init
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a6c1ad680f1dc5422bec4483f7c5569ed10793d6

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36213 Subject: LU-12506 mdc: Remove cdev_init Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a6c1ad680f1dc5422bec4483f7c5569ed10793d6
            adilger Andreas Dilger added a comment - - edited

            I'd commented previously in LU-11626, but that comment would be better here:

            It makes more sense to multiplex a single character device across multiple MDTs, named "/dev/lustre-changelog". To track the MDT index on the open file handle (default = <onlyfs>-MDT0000, which will work for many systems without any change) add an ioctl() to specify the MDT name for that file handle if needed.

            That avoids the need to create so many character devices, avoids the need to share a single chlg_registered_dev between multiple OBDs (one for each opener), and this interface change can be encapsulated inside the llapi code. This will also avoid the complexity in chlg_registered_dev_find_by_obd() if we have only a single chlg_registered_dev per OBD.

            There would need to be some small changes to liblustreapi_chlg.c to open the lustre_changelog device and call the ioctl() to change the MDT index instead of opening a different device for each MDT, with a fallback to the old behavior if the new device name doesn't exist. Probably the best is to change chlg_dev_path() to chlg_dev_open() and return the open file handle or an error instead of the pathname.

            On the kernel side in mdc_changelog_cdev_init(), we might consider still creating some limited number of /dev/changelog-$fsname-MDTnnnn devices (maybe max 16?) for compatibility with userspace applications/libraries that are opening the old devices and are statically linked to {liblustreapi.a}} (under LUSTRE_VERSION_CODE checks so they go away eventually). However, it shouldn't be an error if the compat devices cannot be created if there are many MDTs, since most clients will not be Changelog consumers.

            adilger Andreas Dilger added a comment - - edited I'd commented previously in LU-11626 , but that comment would be better here: It makes more sense to multiplex a single character device across multiple MDTs, named " /dev/lustre-changelog ". To track the MDT index on the open file handle (default = <onlyfs>-MDT0000 , which will work for many systems without any change) add an ioctl() to specify the MDT name for that file handle if needed. That avoids the need to create so many character devices, avoids the need to share a single chlg_registered_dev between multiple OBDs (one for each opener), and this interface change can be encapsulated inside the llapi code. This will also avoid the complexity in chlg_registered_dev_find_by_obd() if we have only a single chlg_registered_dev per OBD. There would need to be some small changes to liblustreapi_chlg.c to open the lustre_changelog device and call the ioctl() to change the MDT index instead of opening a different device for each MDT, with a fallback to the old behavior if the new device name doesn't exist. Probably the best is to change chlg_dev_path() to chlg_dev_open() and return the open file handle or an error instead of the pathname . On the kernel side in mdc_changelog_cdev_init() , we might consider still creating some limited number of /dev/changelog-$fsname-MDTnnnn devices (maybe max 16?) for compatibility with userspace applications/libraries that are opening the old devices and are statically linked to {liblustreapi.a}} (under LUSTRE_VERSION_CODE checks so they go away eventually). However, it shouldn't be an error if the compat devices cannot be created if there are many MDTs, since most clients will not be Changelog consumers.

            In Linux kernel, the misc device is limited to 64

            in drivers/char/misc.c
            ...
            #define DYNAMIC_MINORS 64 /* like dynamic majors */
            static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS);
            ...
            

            when mounting the Lustre, there will be one misc device registered for ChangeLog for each MDC

            int mdc_changelog_cdev_init(struct obd_device *obd)
            {
                    ...
                    entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
                    entry->ced_misc.name  = entry->ced_name;
                    entry->ced_misc.fops  = &chlg_fops;
                    ...    
            
                    /* Register new character device */
                    rc = misc_register(&entry->ced_misc);
                    if (rc != 0) 
                            GOTO(out_unlock, rc);
                   ...
            }       
            

            it will return -EBUSY if there are more than 64 MDTs (will be less than 64 if some misc devices are used by other modules)

            in drivers/char/misc.c
            ...
            #define DYNAMIC_MINORS 64 /* like dynamic majors */
            static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS);
            ...
            int misc_register(struct miscdevice * misc)
            {
                    ...
                    if (misc->minor == MISC_DYNAMIC_MINOR) {
                            int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS);
                            if (i >= DYNAMIC_MINORS) {
                                    mutex_unlock(&misc_mtx);
                                    return -EBUSY;
                            }
                            misc->minor = DYNAMIC_MINORS - i - 1;
                            set_bit(i, misc_minors);
                    } else {
                    ...
            }
            
            hongchao.zhang Hongchao Zhang added a comment - In Linux kernel, the misc device is limited to 64 in drivers/char/misc.c ... #define DYNAMIC_MINORS 64 /* like dynamic majors */ static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS); ... when mounting the Lustre, there will be one misc device registered for ChangeLog for each MDC int mdc_changelog_cdev_init(struct obd_device *obd) { ... entry->ced_misc.minor = MISC_DYNAMIC_MINOR; entry->ced_misc.name = entry->ced_name; entry->ced_misc.fops = &chlg_fops; ... /* Register new character device */ rc = misc_register(&entry->ced_misc); if (rc != 0) GOTO(out_unlock, rc); ... } it will return -EBUSY if there are more than 64 MDTs (will be less than 64 if some misc devices are used by other modules) in drivers/char/misc.c ... #define DYNAMIC_MINORS 64 /* like dynamic majors */ static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS); ... int misc_register(struct miscdevice * misc) { ... if (misc->minor == MISC_DYNAMIC_MINOR) { int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS); if (i >= DYNAMIC_MINORS) { mutex_unlock(&misc_mtx); return -EBUSY; } misc->minor = DYNAMIC_MINORS - i - 1; set_bit(i, misc_minors); } else { ... }
            pjones Peter Jones added a comment -

            Hongchao

            Can you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Can you please investigate? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: