[LU-12506] Client unable to mount filesystem with very large number of MDTs Created: 04/Jul/19  Updated: 06/Apr/21  Resolved: 23/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8, Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Major
Reporter: Matt Rásó-Barnett (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Attachments: File cpu-e-1095.20190704-1300.log.gz     File dac-e-1.20190704-1300.log.gz    
Issue Links:
Duplicate
Related
is related to LU-13620 pool_add_targets() defect Resolved
is related to LU-7659 Replace KUC by more standard mechanisms Reopened
is related to LU-14523 can't mount more than 13 lustre files... Open
is related to LU-13508 crash in sanity test 160j Resolved
is related to LU-13321 sanity: 160f failed "mds3: user cl6 i... Resolved
is related to LU-11626 mdc: obd might go away while referenc... Resolved
is related to LU-14058 Create tests for large number of MDTs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello,
There was a message on the lustre-discuss list about this issue back in May (http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-May/016475.html) - and I've managed to reproduce this error. I couldn't find an open ticket for it however so I wanted to create one.

My environment is the following:

Servers and Clients are using the upstream 2.12.2 and same kernel version:

[root@dac-e-1 ~]# lfs --version
lfs 2.12.2
# Server kernel version
3.10.0-957.10.1.el7_lustre.x86_64
# Client kernel version (unpatched)
3.10.0-957.10.1.el7.x86_64

There are 24 servers, each containing 12x NVMe flash devices. For this test I am configuring the block-devices on each server identically, with 3 devices on each server partitioned into a 200G MDT and the remaining space as OST.

Altogether this makes 72 MDTs, and 288 OSTs in the filesystem.

Below are the syslog messages from the client and servers when attempting to mount the filesystem:

Client syslog - Nid: 10.47.21.72@o2ib1
-- Logs begin at Wed 2019-07-03 19:54:04 BST, end at Thu 2019-07-04 13:06:12 BST. --
Jul 04 12:59:43 cpu-e-1095 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(mdc_request.c:2700:mdc_setup()) fs1-MDT0031-mdc-ffff9f4c85ad8000: failed to setup changelog char device: rc = -16
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:559:class_setup()) setup fs1-MDT0031-mdc-ffff9f4c85ad8000 failed (-16)
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.47.18.1@o2ib1: cfg command failed: rc = -16
Jul 04 12:59:56 cpu-e-1095 kernel: Lustre:    cmd=cf003 0:fs1-MDT0031-mdc  1:fs1-MDT0031_UUID  2:10.47.18.17@o2ib1  
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 15c-8: MGC10.47.18.1@o2ib1: The configuration from log 'fs1-client' failed (-16). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_config.c:610:class_cleanup()) Device 58 not setup
Jul 04 12:59:56 cpu-e-1095 kernel: Lustre: Unmounted fs1-client
Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-16)
Servers syslog
[root@xcat1 ~]# xdsh csd3-buff 'journalctl -a --since "12:59" _TRANSPORT=kernel' | xdshbak -c                                                                                                  
HOSTS -------------------------------------------------------------------------
dac-e-1
-------------------------------------------------------------------------------
-- Logs begin at Thu 2019-03-21 15:42:02 GMT, end at Thu 2019-07-04 13:04:24 BST. --
Jul 04 12:59:43 dac-e-1 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1
Jul 04 12:59:55 dac-e-1 kernel: Lustre: MGS: Connection restored to 08925711-bdfa-621f-89ec-0364645c915c (at 10.47.21.72@o2ib1)
Jul 04 12:59:55 dac-e-1 kernel: Lustre: Skipped 2036 previous similar messages

HOSTS -------------------------------------------------------------------------
dac-e-10, dac-e-11, dac-e-12, dac-e-13, dac-e-14, dac-e-15, dac-e-16, dac-e-17, dac-e-18, dac-e-19, dac-e-2, dac-e-20, dac-e-21, dac-e-22, dac-e-23, dac-e-24, dac-e-3, dac-e-4, dac-e-5, dac-e-6, dac-e-7, dac-e-8, dac-e-9
-------------------------------------------------------------------------------
-- No entries --

Attached are lustre debug logs from both the client and the dac-e-1 server which contains the MGT.

I can provide debug logs from all 24 servers if that would help, just let me know.

I've successfully used the same configuration with 2x MDTs per server, so 48 MDTs in total, without problem, but I haven't confirmed what Scott mentioned on the mailing list about the failure starting at 56 MDTs.

Thanks,
Matt



 Comments   
Comment by Peter Jones [ 04/Jul/19 ]

Hongchao

Can you please investigate?

Thanks

Peter

Comment by Hongchao Zhang [ 09/Jul/19 ]

In Linux kernel, the misc device is limited to 64

in drivers/char/misc.c
...
#define DYNAMIC_MINORS 64 /* like dynamic majors */
static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS);
...

when mounting the Lustre, there will be one misc device registered for ChangeLog for each MDC

int mdc_changelog_cdev_init(struct obd_device *obd)
{
        ...
        entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
        entry->ced_misc.name  = entry->ced_name;
        entry->ced_misc.fops  = &chlg_fops;
        ...    

        /* Register new character device */
        rc = misc_register(&entry->ced_misc);
        if (rc != 0) 
                GOTO(out_unlock, rc);
       ...
}       

it will return -EBUSY if there are more than 64 MDTs (will be less than 64 if some misc devices are used by other modules)

in drivers/char/misc.c
...
#define DYNAMIC_MINORS 64 /* like dynamic majors */
static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS);
...
int misc_register(struct miscdevice * misc)
{
        ...
        if (misc->minor == MISC_DYNAMIC_MINOR) {
                int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS);
                if (i >= DYNAMIC_MINORS) {
                        mutex_unlock(&misc_mtx);
                        return -EBUSY;
                }
                misc->minor = DYNAMIC_MINORS - i - 1;
                set_bit(i, misc_minors);
        } else {
        ...
}
Comment by Andreas Dilger [ 16/Sep/19 ]

I'd commented previously in LU-11626, but that comment would be better here:

It makes more sense to multiplex a single character device across multiple MDTs, named "/dev/lustre-changelog". To track the MDT index on the open file handle (default = <onlyfs>-MDT0000, which will work for many systems without any change) add an ioctl() to specify the MDT name for that file handle if needed.

That avoids the need to create so many character devices, avoids the need to share a single chlg_registered_dev between multiple OBDs (one for each opener), and this interface change can be encapsulated inside the llapi code. This will also avoid the complexity in chlg_registered_dev_find_by_obd() if we have only a single chlg_registered_dev per OBD.

There would need to be some small changes to liblustreapi_chlg.c to open the lustre_changelog device and call the ioctl() to change the MDT index instead of opening a different device for each MDT, with a fallback to the old behavior if the new device name doesn't exist. Probably the best is to change chlg_dev_path() to chlg_dev_open() and return the open file handle or an error instead of the pathname.

On the kernel side in mdc_changelog_cdev_init(), we might consider still creating some limited number of /dev/changelog-$fsname-MDTnnnn devices (maybe max 16?) for compatibility with userspace applications/libraries that are opening the old devices and are statically linked to {liblustreapi.a}} (under LUSTRE_VERSION_CODE checks so they go away eventually). However, it shouldn't be an error if the compat devices cannot be created if there are many MDTs, since most clients will not be Changelog consumers.

Comment by Gerrit Updater [ 17/Sep/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36213
Subject: LU-12506 mdc: Remove cdev_init
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6c1ad680f1dc5422bec4483f7c5569ed10793d6

Comment by Patrick Farrell (Inactive) [ 17/Sep/19 ]

Matt,

The above is absolutely not a fix, it's just a quick hack, but as long as you're not using changelogs, that patch on the client should let you mount with > 64 MDTs.

Comment by Matt Rásó-Barnett (Inactive) [ 18/Sep/19 ]

Thanks Patrick, that's great. I'll give this a test in a couple of weeks when I have a window to do some more benchmarking on this hardware - I was interested in just seeing how far we could scale DNE striped directories, so no changelogs on this system. I'll try this and report back then.

Cheers,
Matt

Comment by Andreas Dilger [ 27/Jan/20 ]

This issue was introduced with patch https://review.whamcloud.com/18900 "LU-7659 mdc: expose changelog through char devices" in commit v2_9_55_0-13-g1d40214d96, so affects both 2.10 and 2.12 LTS releases. Please add that in Fixes: label in the patch commit message when fixing this issue.

Comment by Gerrit Updater [ 28/Feb/20 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37759
Subject: LU-12506 changelog: support large number of MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4d1e03fd208504854fbbf3631547b00a32d8c62f

Comment by John Hammond [ 02/Mar/20 ]

This could/should be solved by using dynamic devices instead of misc devices. See https://review.whamcloud.com/#/c/37552/4/lustre/ofd/ofd_access_log.c@406 for an approach which should work here as sell.

Comment by Hongchao Zhang [ 04/Mar/20 ]

Hi John,
Thanks! It's a better solution to replace miscdevice with dynamic devices, I have updated the patch accordingly. Thanks

Comment by Gerrit Updater [ 14/Mar/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37917
Subject: LU-12506 mdc: clean up code style for mdc_locks.c
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d08b729acb70fba933da40e7699b621e2643355f

Comment by Gerrit Updater [ 24/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37759/
Subject: LU-12506 changelog: support large number of MDT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d0423abc1adc717b08de61be3556688cccd52ddf

Comment by Gerrit Updater [ 25/Mar/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38058
Subject: LU-12506 tests: clean up MDT name generation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5d323b7a9c1aa5969b90ef4fc3ec302a23d46e9

Comment by Gerrit Updater [ 23/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37917/
Subject: LU-12506 mdc: clean up code style for mdc_locks.c
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0716f5a9d98a4fa299b2cfc7cfee236313e3dbcc

Comment by Peter Jones [ 03/Jul/20 ]

mrb have you ever re-tried this test with the fix in place?

Comment by Matt Rásó-Barnett (Inactive) [ 10/Jul/20 ]

Hi Peter, I'm afraid I haven't tested it no, and I'm unlikely to be able to do so for some time now as I'm not actively working with this system to test with any more.

It might be something I get to look at again in Q3/Q4 this year as we will be installing more all-flash nodes to double the size of our current all-flash Lustre, so I imagine we will be doing some intensive work benchmarking it once the system integration is done. Indeed with the number of servers we'll have at that point, we will be getting close to needing it if we wanted to have an MDT on every server in the filesystem.

Cheers,
Matt

Comment by Peter Jones [ 11/Jul/20 ]

ok Matt fair enough. Let's engage again if/when you are ready to start raising the bar again

Comment by Alex Kulyavtsev [ 25/Sep/20 ]

Do you have this patch backported to b2_12 ?

Can it be backported to upcoming 2.12.6 release ? 

The patch is required on clients but not on servers ?

 I likely hit this issue when trying to mount two large lustre fs with 40 MDT each on the same client. MDT count 2*40=80 > 64. I can mount these lustre fs one at a time but not both at the same time.

Comment by Cory Spitz [ 02/Oct/20 ]

> The patch is required on clients but not on servers ?
Yes, https://review.whamcloud.com/#/c/37759/ only affects mdc.

Comment by Peter Jones [ 23/Oct/20 ]

The fix itself has landed for 2.14. The creation of a test is being tracked under LU-14058

Comment by Alex Kulyavtsev [ 17/Nov/20 ]

Peter,

is it possible to backport this patch to 2.12 and include it into 2.12.6 release? This will simplify upgrade on nodes with upstream client installed otherwise I will have to fork off.

I have this patch tested  on rhel with 88MDTs for the code built from the HPE source tree.

 

 

Comment by Gerrit Updater [ 11/Feb/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41485
Subject: LU-12506 tests: handle more MDTs in sanity.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 28fa92e0552f0f9135256fa4611c68e5c6396773

Pushed to LU-14058 instead.

Comment by Gerrit Updater [ 18/Mar/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42087
Subject: LU-12506 changelog: support large number of MDT
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b9380fe5ed814d91dac2d1d03ad817ffb0869766

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42087/
Subject: LU-12506 changelog: support large number of MDT
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0596a16841406b93ec1e348fcc9eecce62d9fe8b

Generated at Sat Feb 10 02:53:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.