[LU-12506] Client unable to mount filesystem with very large number of MDTs Created: 04/Jul/19 Updated: 06/Apr/21 Resolved: 23/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Rásó-Barnett (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
Hello, My environment is the following: Servers and Clients are using the upstream 2.12.2 and same kernel version: [root@dac-e-1 ~]# lfs --version lfs 2.12.2 # Server kernel version 3.10.0-957.10.1.el7_lustre.x86_64 # Client kernel version (unpatched) 3.10.0-957.10.1.el7.x86_64 There are 24 servers, each containing 12x NVMe flash devices. For this test I am configuring the block-devices on each server identically, with 3 devices on each server partitioned into a 200G MDT and the remaining space as OST. Altogether this makes 72 MDTs, and 288 OSTs in the filesystem. Below are the syslog messages from the client and servers when attempting to mount the filesystem: Client syslog - Nid: 10.47.21.72@o2ib1 -- Logs begin at Wed 2019-07-03 19:54:04 BST, end at Thu 2019-07-04 13:06:12 BST. -- Jul 04 12:59:43 cpu-e-1095 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1 Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(mdc_request.c:2700:mdc_setup()) fs1-MDT0031-mdc-ffff9f4c85ad8000: failed to setup changelog char device: rc = -16 Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:559:class_setup()) setup fs1-MDT0031-mdc-ffff9f4c85ad8000 failed (-16) Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94792:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.47.18.1@o2ib1: cfg command failed: rc = -16 Jul 04 12:59:56 cpu-e-1095 kernel: Lustre: cmd=cf003 0:fs1-MDT0031-mdc 1:fs1-MDT0031_UUID 2:10.47.18.17@o2ib1 Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 15c-8: MGC10.47.18.1@o2ib1: The configuration from log 'fs1-client' failed (-16). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_config.c:610:class_cleanup()) Device 58 not setup Jul 04 12:59:56 cpu-e-1095 kernel: Lustre: Unmounted fs1-client Jul 04 12:59:56 cpu-e-1095 kernel: LustreError: 94774:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-16) Servers syslog
[root@xcat1 ~]# xdsh csd3-buff 'journalctl -a --since "12:59" _TRANSPORT=kernel' | xdshbak -c
HOSTS -------------------------------------------------------------------------
dac-e-1
-------------------------------------------------------------------------------
-- Logs begin at Thu 2019-03-21 15:42:02 GMT, end at Thu 2019-07-04 13:04:24 BST. --
Jul 04 12:59:43 dac-e-1 kernel: Lustre: DEBUG MARKER: Attempting client mount from 10.47.21.72@o2ib1
Jul 04 12:59:55 dac-e-1 kernel: Lustre: MGS: Connection restored to 08925711-bdfa-621f-89ec-0364645c915c (at 10.47.21.72@o2ib1)
Jul 04 12:59:55 dac-e-1 kernel: Lustre: Skipped 2036 previous similar messages
HOSTS -------------------------------------------------------------------------
dac-e-10, dac-e-11, dac-e-12, dac-e-13, dac-e-14, dac-e-15, dac-e-16, dac-e-17, dac-e-18, dac-e-19, dac-e-2, dac-e-20, dac-e-21, dac-e-22, dac-e-23, dac-e-24, dac-e-3, dac-e-4, dac-e-5, dac-e-6, dac-e-7, dac-e-8, dac-e-9
-------------------------------------------------------------------------------
-- No entries --
Attached are lustre debug logs from both the client and the dac-e-1 server which contains the MGT. I can provide debug logs from all 24 servers if that would help, just let me know. I've successfully used the same configuration with 2x MDTs per server, so 48 MDTs in total, without problem, but I haven't confirmed what Scott mentioned on the mailing list about the failure starting at 56 MDTs. Thanks, |
| Comments |
| Comment by Peter Jones [ 04/Jul/19 ] |
|
Hongchao Can you please investigate? Thanks Peter |
| Comment by Hongchao Zhang [ 09/Jul/19 ] |
|
In Linux kernel, the misc device is limited to 64 in drivers/char/misc.c ... #define DYNAMIC_MINORS 64 /* like dynamic majors */ static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS); ... when mounting the Lustre, there will be one misc device registered for ChangeLog for each MDC int mdc_changelog_cdev_init(struct obd_device *obd)
{
...
entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
entry->ced_misc.name = entry->ced_name;
entry->ced_misc.fops = &chlg_fops;
...
/* Register new character device */
rc = misc_register(&entry->ced_misc);
if (rc != 0)
GOTO(out_unlock, rc);
...
}
it will return -EBUSY if there are more than 64 MDTs (will be less than 64 if some misc devices are used by other modules) in drivers/char/misc.c
...
#define DYNAMIC_MINORS 64 /* like dynamic majors */
static DECLARE_BITMAP(misc_minors, DYNAMIC_MINORS);
...
int misc_register(struct miscdevice * misc)
{
...
if (misc->minor == MISC_DYNAMIC_MINOR) {
int i = find_first_zero_bit(misc_minors, DYNAMIC_MINORS);
if (i >= DYNAMIC_MINORS) {
mutex_unlock(&misc_mtx);
return -EBUSY;
}
misc->minor = DYNAMIC_MINORS - i - 1;
set_bit(i, misc_minors);
} else {
...
}
|
| Comment by Andreas Dilger [ 16/Sep/19 ] |
|
I'd commented previously in
|
| Comment by Gerrit Updater [ 17/Sep/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36213 |
| Comment by Patrick Farrell (Inactive) [ 17/Sep/19 ] |
|
Matt, The above is absolutely not a fix, it's just a quick hack, but as long as you're not using changelogs, that patch on the client should let you mount with > 64 MDTs. |
| Comment by Matt Rásó-Barnett (Inactive) [ 18/Sep/19 ] |
|
Thanks Patrick, that's great. I'll give this a test in a couple of weeks when I have a window to do some more benchmarking on this hardware - I was interested in just seeing how far we could scale DNE striped directories, so no changelogs on this system. I'll try this and report back then. Cheers, |
| Comment by Andreas Dilger [ 27/Jan/20 ] |
|
This issue was introduced with patch https://review.whamcloud.com/18900 "LU-7659 mdc: expose changelog through char devices" in commit v2_9_55_0-13-g1d40214d96, so affects both 2.10 and 2.12 LTS releases. Please add that in Fixes: label in the patch commit message when fixing this issue. |
| Comment by Gerrit Updater [ 28/Feb/20 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37759 |
| Comment by John Hammond [ 02/Mar/20 ] |
|
This could/should be solved by using dynamic devices instead of misc devices. See https://review.whamcloud.com/#/c/37552/4/lustre/ofd/ofd_access_log.c@406 for an approach which should work here as sell. |
| Comment by Hongchao Zhang [ 04/Mar/20 ] |
|
Hi John, |
| Comment by Gerrit Updater [ 14/Mar/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37917 |
| Comment by Gerrit Updater [ 24/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37759/ |
| Comment by Gerrit Updater [ 25/Mar/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38058 |
| Comment by Gerrit Updater [ 23/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37917/ |
| Comment by Peter Jones [ 03/Jul/20 ] |
|
mrb have you ever re-tried this test with the fix in place? |
| Comment by Matt Rásó-Barnett (Inactive) [ 10/Jul/20 ] |
|
Hi Peter, I'm afraid I haven't tested it no, and I'm unlikely to be able to do so for some time now as I'm not actively working with this system to test with any more. It might be something I get to look at again in Q3/Q4 this year as we will be installing more all-flash nodes to double the size of our current all-flash Lustre, so I imagine we will be doing some intensive work benchmarking it once the system integration is done. Indeed with the number of servers we'll have at that point, we will be getting close to needing it if we wanted to have an MDT on every server in the filesystem. Cheers, |
| Comment by Peter Jones [ 11/Jul/20 ] |
|
ok Matt fair enough. Let's engage again if/when you are ready to start raising the bar again |
| Comment by Alex Kulyavtsev [ 25/Sep/20 ] |
|
Do you have this patch backported to b2_12 ? Can it be backported to upcoming 2.12.6 release ? The patch is required on clients but not on servers ? I likely hit this issue when trying to mount two large lustre fs with 40 MDT each on the same client. MDT count 2*40=80 > 64. I can mount these lustre fs one at a time but not both at the same time. |
| Comment by Cory Spitz [ 02/Oct/20 ] |
|
> The patch is required on clients but not on servers ? |
| Comment by Peter Jones [ 23/Oct/20 ] |
|
The fix itself has landed for 2.14. The creation of a test is being tracked under LU-14058 |
| Comment by Alex Kulyavtsev [ 17/Nov/20 ] |
|
Peter, is it possible to backport this patch to 2.12 and include it into 2.12.6 release? This will simplify upgrade on nodes with upstream client installed otherwise I will have to fork off. I have this patch tested on rhel with 88MDTs for the code built from the HPE source tree.
|
| Comment by Gerrit Updater [ 11/Feb/21 ] |
|
Pushed to LU-14058 instead. |
| Comment by Gerrit Updater [ 18/Mar/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42087 |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42087/ |