[LU-10760] MGS startup is happening asynchronously and taking a long time Created: 02/Mar/18 Updated: 17/Mar/18 Resolved: 17/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brian Murrell (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I have noticed that from 2.10.3 to b2_10 MGS startup is happening asynchronously and taking much longer. In 2.10.3, immediately after (or within a very small number of seconds at most) the mount for the MGS returns it reports as being up: # mount -t lustre zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13/MGS /mnt/MGS # lctl dl: 0 UP osd-zfs MGS-osd MGS-osd_UUID 4 1 UP mgs MGS MGS 4 2 UP mgc MGC10.14.83.84@tcp f2845b74-6eb5-07a5-1aaa-d7c54f76df5f 4 However the same operation on b2_10 is taking on the order of 10-11 seconds after the mount returns before lctl dl shows the MGS as being up. I have attached debug-lctl-1519949571.log.xz |
| Comments |
| Comment by Peter Jones [ 06/Mar/18 ] |
|
Nathaniel Can you please investigate? Thanks Peter |
| Comment by Nathaniel Clark [ 06/Mar/18 ] |
|
Looking at the debug-lctl-1519949571.log.xz Thu Mar 1 19:07:59 2018 [17187] (mgs_handler.c:1010:mgs_iocontrol()) Process entered Thu Mar 1 19:07:59 2018 [17187] (mgs_handler.c:1011:mgs_iocontrol()) handling ioctl cmd 0x400866bb ... Thu Mar 1 19:07:59 2018 [17187] (mgs_llog.c:4832:mgs_set_conf_param()) Process entered ... Thu Mar 1 19:07:59 2018 [17187] (mgs_llog.c:3609:mgs_write_log_param()) next param 'llite.max_cached_mb=16' ... Thu Mar 1 19:07:59 2018 [17187] (mgs_llog.c:4941:mgs_set_conf_param()) Process leaving (rc=0 : 0 : 0) Thu Mar 1 19:07:59 2018 [17187] (mgs_handler.c:1132:mgs_iocontrol()) Process leaving (rc=0 : 0 : 0) Thu Mar 1 19:07:59 2018 [17187] (obd_class.h:1168:obd_iocontrol()) Process leaving (rc=0 : 0 : 0) There are a couple other call chains like that, none take longer than a second to complete. Likewise MGS startup seems to start and end within a second Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1514:lustre_fill_super()) Process entered Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1516:lustre_fill_super()) VFS Op: sb ffff880078e06000 Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:574:lustre_init_lsi()) Process entered Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:576:lustre_init_lsi()) kmalloced 'lsi': 984 at ffff880079cd7800. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:579:lustre_init_lsi()) kmalloced 'lsi->lsi_lmd': 104 at ffff880077d2b700. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:597:lustre_init_lsi()) Process leaving (rc=18446612134357727232 : -131939351824384 : ffff880079cd7800) Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1247:lmd_parse()) Process entered Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1265:lmd_parse()) kmalloced 'lmd->lmd_params': 4096 at ffff880049e40000. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1107:lmd_parse_string()) kmalloced '*handle': 8 at ffff880078338d08. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1107:lmd_parse_string()) kmalloced '*handle': 4 at ffff880078338cc8. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1471:lmd_parse()) kmalloced 'lmd->lmd_dev': 43 at ffff880078dd4f40. Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:931:lmd_print()) mount data: Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:934:lmd_print()) device: zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk1/MGS Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:935:lmd_print()) flags: 2200 ... Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1578:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) Thu Mar 1 19:06:14 2018 [12364] (obd_mount.c:1585:lustre_fill_super()) Mount zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk1/MGS complete What conf_param are you setting that should return error? |
| Comment by Brian Murrell (Inactive) [ 09/Mar/18 ] |
|
Please find attached a lustre debug log # lctl conf_param testfs.llite.max_cached_mb=16 No device found for name MGS: Invalid argument This command must be run on the MGS. error: conf_param: No such device # echo $? 19 Right after starting the MGS and where only a moment later the same command succeeded in this lustre debug log |
| Comment by Nathaniel Clark [ 12/Mar/18 ] |
|
Looking at debug-lctl-0-1520607668.log.xz Fri Mar 9 09:44:49 2018 [ 1153] (api-ni.c:1462:lnet_startup_lndni()) Added LNI 10.14.82.96@tcp [8/256/0/180] MGS Mounts: Fri Mar 9 09:54:25 2018 [12151] (obd_mount.c:1514:lustre_fill_super()) Process entered Fri Mar 9 09:54:25 2018 [12151] (obd_mount.c:1516:lustre_fill_super()) VFS Op: sb ffff88007a2ec000 ... Fri Mar 9 09:54:25 2018 [12151] (obd_mount.c:1578:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) Fri Mar 9 09:54:25 2018 [12151] (obd_mount.c:1585:lustre_fill_super()) Mount zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk5/MGS complete MDS Mounts: Fri Mar 9 09:54:29 2018 [12419] (obd_mount.c:1514:lustre_fill_super()) Process entered Fri Mar 9 09:54:29 2018 [12419] (obd_mount.c:1516:lustre_fill_super()) VFS Op: sb ffff88007acf7800 ... Fri Mar 9 09:54:31 2018 [12419] (obd_mount.c:1578:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) Fri Mar 9 09:54:31 2018 [12419] (obd_mount.c:1585:lustre_fill_super()) Mount zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-MDT0000 complete OST Mounts: Fri Mar 9 09:55:15 2018 [14657] (obd_mount.c:1514:lustre_fill_super()) Process entered Fri Mar 9 09:55:15 2018 [14657] (obd_mount.c:1516:lustre_fill_super()) VFS Op: sb ffff88007b29e000 ... Fri Mar 9 09:55:15 2018 [14657] (obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-ldiskfs, lsifl=200062, mountfl=0 ... Fri Mar 9 09:55:15 2018 [14657] (obd_mount.c:1578:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) Fri Mar 9 09:55:15 2018 [14657] (obd_mount.c:1585:lustre_fill_super()) Mount /dev/sde complete # lctl conf_param testfs.llite.max_cached_mb=16 runs successfully Fri Mar 9 09:55:58 2018 [16654] (mgs_handler.c:1010:mgs_iocontrol()) Process entered Fri Mar 9 09:55:58 2018 [16654] (mgs_handler.c:1011:mgs_iocontrol()) handling ioctl cmd 0x400866bb ... Fri Mar 9 09:55:58 2018 [16654] (mgs_llog.c:4911:mgs_set_conf_param()) set_conf_param fs='testfs' device='testfs' param='llite.max_cached_mb=16' ... Fri Mar 9 09:55:58 2018 [16654] (linux-module.c:235:obd_class_release()) Process entered Fri Mar 9 09:55:58 2018 [16654] (linux-module.c:238:obd_class_release()) Process leaving (rc=0 : 0 : 0) The failing lctl doesn't seem to show up in the debug log. Can you run the initial (failing) lctl with strace -fytt? |
| Comment by Brian Murrell (Inactive) [ 14/Mar/18 ] |
Please find attached the strace output FWIW, the output of lctl dl immediately after the failed: # lctl conf_param testfs.llite.max_cached_mb=16 No device found for name MGS: Invalid argument This command must be run on the MGS. error: conf_param: No such device looked like this: # lctl dl # echo $? # 0 |
| Comment by Nathaniel Clark [ 14/Mar/18 ] |
|
Here is the ioctl call via strace: 24654 06:23:31.438314 ioctl(3</dev/obd>, _IOC(_IOC_READ|_IOC_WRITE, 0x66, 0x7f, 0x08), 0x7fffef9492a0) = -1 EINVAL (Invalid argument) Here is the corresponding debug log: Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:226:obd_class_open()) Process entered Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:229:obd_class_open()) Process leaving (rc=0 : 0 : 0) Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:246:obd_class_ioctl()) Process entered Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:222:class_handle_ioctl()) Process entered Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:232:class_handle_ioctl()) cmd = c008667f Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:155:obd_ioctl_getdata()) Process entered Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:181:obd_ioctl_getdata()) kmalloced '*buf': 584 at ffff880043251400. Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:219:obd_ioctl_getdata()) Process leaving (rc=0 : 0 : 0) Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:191:class_resolve_dev_name()) Process entered Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:201:class_resolve_dev_name()) device name MGS Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:204:class_resolve_dev_name()) No device for name MGS! Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:205:class_resolve_dev_name()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:212:class_resolve_dev_name()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:299:class_handle_ioctl()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:426:class_handle_ioctl()) kfreed 'buf': 584 at ffff880043251400. Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:427:class_handle_ioctl()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) Wed Mar 14 09:23:31 2018 [24654] (linux-module.c:256:obd_class_ioctl()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) |
| Comment by Nathaniel Clark [ 15/Mar/18 ] |
|
Now that I know what to look for I've found it in previous debug-lctl-0-1520607668.log.xz Fri Mar 9 09:54:25 2018 [12151] (obd_config.c:372:class_attach()) Process entered Fri Mar 9 09:54:25 2018 [12151] (genops.c:309:class_newdev()) Process entered ... Fri Mar 9 09:54:25 2018 [12151] (genops.c:140:class_get_type()) Loaded module 'mgs' Fri Mar 9 09:54:25 2018 [12151] (genops.c:373:class_newdev()) Allocate new device MGS (ffff880049e88f78) ... Fri Mar 9 09:54:25 2018 [12151] (genops.c:505:class_register_device()) Process leaving (rc=0 : 0 : 0) Fri Mar 9 09:54:25 2018 [12151] (obd_config.c:430:class_attach()) OBD: dev 1 attached type mgs with refcount 1 Fri Mar 9 09:54:25 2018 [12151] (obd_config.c:432:class_attach()) Process leaving (rc=0 : 0 : 0) But then ioctl can't find it: Fri Mar 9 10:01:08 2018 [25094] (class_obd.c:191:class_resolve_dev_name()) Process entered Fri Mar 9 10:01:08 2018 [25094] (class_obd.c:201:class_resolve_dev_name()) device name MGS Fri Mar 9 10:01:08 2018 [25094] (class_obd.c:204:class_resolve_dev_name()) No device for name MGS! Fri Mar 9 10:01:08 2018 [25094] (class_obd.c:205:class_resolve_dev_name()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) Fri Mar 9 10:01:08 2018 [25094] (class_obd.c:212:class_resolve_dev_name()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) |
| Comment by Nathaniel Clark [ 15/Mar/18 ] |
|
Re-looking at debug-lctl-0-1520607668.log.xz Fri Mar 9 10:00:35 2018 [24022] (obd_mount_server.c:1618:server_put_super()) server umount MGS complete The MGS is unmounting prior to you running conf param. Same with debug-lctl-0-1521033813.44.log.xz Wed Mar 14 09:16:44 2018 [12129] (obd_mount_server.c:236:server_start_mgs()) Start MGS service MGS ... Wed Mar 14 09:23:02 2018 [23609] (obd_mount_server.c:1618:server_put_super()) server umount MGS complete ... Wed Mar 14 09:23:31 2018 [24654] (class_obd.c:204:class_resolve_dev_name()) No device for name MGS! |
| Comment by Brian Murrell (Inactive) [ 15/Mar/18 ] |
|
utopiabound: Just want to say be careful. You might have realised this already, but just in case you haven't, there could be multiple MGS start/stop operations before the one where the subsequent lctl conf_parm ioctl can't find it. As in the case above, the almost 7 minutes between the MGS start and the No device for name MGS seems much too long. The delay I am seeing seems to be on the order of about 10 seconds, not many minutes. |
| Comment by Brian Murrell (Inactive) [ 17/Mar/18 ] |
|
This turned out to be a subtle difference in the way Pacemaker reports resource status in RHEL 7.5 which was causing IML to report a device as started before it actually was. |