[LU-11626] mdc: obd might go away while referenced by code in mdc_changelog Created: 06/Nov/18 Updated: 05/Aug/20 Resolved: 16/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | CEA | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
In lustre/mdc/mdc_changelog.c: /**
* Find the OBD device associated to a changelog character device.
* @param[in] cdev character device instance descriptor
* @return corresponding OBD device or NULL if none was found.
*/
static struct obd_device *chlg_obd_get(dev_t cdev)
{
int minor = MINOR(cdev);
struct obd_device *obd = NULL;
struct chlg_registered_dev *curr;
mutex_lock(&chlg_registered_dev_lock);
list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
if (curr->ced_misc.minor == minor) {
/* take the first available OBD device attached */
obd = list_first_entry(&curr->ced_obds,
struct obd_device,
u.cli.cl_chg_dev_linkage);
break;
}
}
mutex_unlock(&chlg_registered_dev_lock);
return obd;
}
The "take the first available OBD device attached" approach is broken as that OBD might go away while the changelog is being read. Here is an example of how things can go wrong:
That triggers an LBUG. |
| Comments |
| Comment by Peter Jones [ 06/Nov/18 ] |
|
Quentin Is this something that you plan to work on a fix for? Peter |
| Comment by Quentin Bouget [ 07/Nov/18 ] |
|
I would rather not. Neil Brown is working on it, I am trying to help, but I am not available enough to try and write a whole fix for it. The discussion is also happening on lustre-devel, anyone interested is welcome to chime in. I'll report here if we reach some kind of consensus on how to fix this. |
| Comment by Quentin Bouget [ 07/Nov/18 ] |
|
I think someone familiar with the internals of llogs could help quite a lot here. |
| Comment by Peter Jones [ 07/Nov/18 ] |
|
simmonsja is this something that you are monitoring along with other upstream tasks? |
| Comment by James A Simmons [ 07/Nov/18 ] |
|
Yes I'm involved. |
| Comment by James A Simmons [ 19/Jul/19 ] |
|
The one change log device to N mdc obd devices seems strange to me. If you have multiple mounts on a client of the same file system do each obd device point to the same obd_import? |
| Comment by Andreas Dilger [ 01/Aug/19 ] |
There is already a problem in It makes more sense to multiplex a single character device across multiple MDTs, named changelog-$fsname. To track the MDT index on the open file handle (default = 0, which will work for most filesystems without any change) add an ioctl() to change the MDT index for that file handle if needed. That avoids the need to create so many character devices, avoids the need to share a single chlg_registered_dev between multiple OBDs (one for each opener), and this interface change can be encapsulated inside the llapi code. This will also avoid the complexity in chlg_registered_dev_find_by_obd() if we have only a single chlg_registered_dev per OBD. There would need to be some small changes to liblustreapi_chlg.c to open the changelog-$fsname device and call the ioctl() to change the MDT index instead of opening a different device for each MDT, with a fallback to the old behavior if the new device name doesn't exist. Probably the best is to change chlg_dev_path() to chlg_dev_open() and return the open file handle or an error instead of the pathname. On the kernel side in mdc_changelog_cdev_init(), we should consider still creating some limited number of changelog-$fsname-MDTnnnn devices (maybe max 16?) for compatibility with userspace applications/libraries that are opening the old devices and are statically linked to liblustreapi.a (under LUSTRE_VERSION_CODE checks so they go away eventually). However, it shouldn't be an error if the compat devices cannot be created if there are many MDTs, since most clients will not be changelog consumers. |
| Comment by Andreas Dilger [ 01/Aug/19 ] |
|
bougetq, |
| Comment by Andreas Dilger [ 01/Aug/19 ] |
|
Hongchao, assigning this to you since I think the fix will also resolve |
| Comment by Andreas Dilger [ 01/Aug/19 ] |
|
Actually, having looked at this code in more detail, it seems that the obvious first step would be to either get a device reference between chlg_open() and chlg_close(), or a device reference each chlg_load() and chlg_write(). For both cases:
For open-to-close reference:
For read/write references:
|
| Comment by Quentin Bouget [ 05/Aug/19 ] |
It uses llapi_changelog_start(). |
| Comment by Andreas Dilger [ 08/Aug/19 ] |
|
It might make sense to also allow writing to the open file descriptor like "mdt:mdt_idx" to set the specific MDT to be used. That allows userspace scripts to potentially read and manage the changelog file, instead of having to find some way to call ioctl(). |
| Comment by Gerrit Updater [ 13/Aug/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35784 |
| Comment by James A Simmons [ 15/Aug/19 ] |
|
Patch https://review.whamcloud.com/35668 which landed resolved this. It would be nice to push a reproducer to the test suite from the information in this ticket. That was missing in the patch. |
| Comment by Hongchao Zhang [ 20/Aug/19 ] |
|
Hi James, [ 106.920737] general protection fault: 0000 [#1] SMP [ 106.927417] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) intel_powerclamp iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg parport_pc pcspkr parport i2c_piix4 video i2c_core ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi ahci libahci ata_piix libata e1000 crct10dif_pclmul crct10dif_common crc32c_intel serio_raw dm_mirror dm_region_hash dm_log dm_mod [ 106.941934] CPU: 0 PID: 1287 Comm: bash Kdump: loaded Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [ 106.944006] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 106.945492] task: ffff8f61898b5ee0 ti: ffff8f61803e4000 task.ti: ffff8f61803e4000 [ 106.946817] RIP: 0010:[<ffffffffc09ed510>] [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc] [ 106.948338] RSP: 0018:ffff8f61803e7e70 EFLAGS: 00010206 [ 106.949300] RAX: ffff8f6047e741e0 RBX: ffff8f6171eeac00 RCX: 0000000000000000 [ 106.950582] RDX: 0000000000000000 RSI: ffffffffc09f8d74 RDI: 0000000000000000 [ 106.951766] RBP: ffff8f61803e7ec0 R08: ffff8f6171eeac0a R09: 0000000000000000 [ 106.953063] R10: 000000000000000a R11: f000000000000000 R12: 000000000000000b [ 106.954279] R13: 00007f9075b49000 R14: 5a5a5a5a5a5a5a5a R15: 0000000000000000 [ 106.955473] FS: 00007f9075b43740(0000) GS:ffff8f619fc00000(0000) knlGS:0000000000000000 [ 106.956802] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.957723] CR2: 00005619854aa528 CR3: 0000000216e52000 CR4: 00000000000606f0 [ 106.958892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.960067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.961216] Call Trace: [ 106.961613] [<ffffffffac41b490>] vfs_write+0xc0/0x1f0 [ 106.962440] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.963499] [<ffffffffac41c2bf>] SyS_write+0x7f/0xf0 [ 106.964300] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.965349] [<ffffffffac920795>] system_call_fastpath+0x1c/0x21 [ 106.966311] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.967362] Code: 27 7e 01 00 ef 01 00 00 c7 05 21 7e 01 00 01 00 00 00 48 c7 05 1e 7e 01 00 00 00 00 00 e8 89 d2 90 ff 4d 85 f6 0f 84 33 01 00 00 <49> 8b 86 f8 00 00 00 48 85 c0 0f 84 a3 00 00 00 48 8b 10 48 85 [ 106.970783] RIP [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc] [ 106.971791] RSP <ffff8f61803e7e70> [ 0.794662] mce: Unable to init device /dev/mcelog (rc: -5) [ 0.800919] Failed to execute /init [ 0.801546] Kernel panic - not syncing: No init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance. [ 0.803450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-862.9.1.el7.x86_64 #1 [ 0.804651] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 0.805965] Call Trace: [ 0.806399] [<ffffffff8170e84e>] dump_stack+0x19/0x1b [ 0.807277] [<ffffffff81708b50>] panic+0xe8/0x21f [ 0.808041] [<ffffffff8122ccdd>] ? putname+0x3d/0x60 [ 0.808799] [<ffffffff816fd780>] ? rest_init+0x80/0x80 [ 0.809627] [<ffffffff816fd862>] kernel_init+0xe2/0xf0 [ 0.810741] [<ffffffff817205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 0.811744] [<ffffffff816fd780>] ? rest_init+0x80/0x80 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35784/ |
| Comment by Peter Jones [ 16/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 01/Oct/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36338 |
| Comment by James A Simmons [ 08/Oct/19 ] |
|
Peter, any chance this can land for the 2.12.3 release |
| Comment by Peter Jones [ 08/Oct/19 ] |
|
James We tagged RC1 this morning so, while there's still a chance, it is looking unlikely as things stand Peter |
| Comment by Gerrit Updater [ 21/Nov/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36338/ |