[LU-11626] mdc: obd might go away while referenced by code in mdc_changelog Created: 06/Nov/18  Updated: 05/Aug/20  Resolved: 16/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: CEA Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Attachments: File reproducer.bash    
Issue Links:
Duplicate
is duplicated by LU-12566 GPF when umounting client Resolved
Related
is related to LU-12506 Client unable to mount filesystem wit... Resolved
is related to LU-7659 Replace KUC by more standard mechanisms Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In lustre/mdc/mdc_changelog.c:

/**
 * Find the OBD device associated to a changelog character device.
 * @param[in]  cdev  character device instance descriptor
 * @return corresponding OBD device or NULL if none was found.
 */
static struct obd_device *chlg_obd_get(dev_t cdev)
{
	int minor = MINOR(cdev);
	struct obd_device *obd = NULL;
	struct chlg_registered_dev *curr;

	mutex_lock(&chlg_registered_dev_lock);
	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
		if (curr->ced_misc.minor == minor) {
			/* take the first available OBD device attached */
			obd = list_first_entry(&curr->ced_obds,
					       struct obd_device,
					       u.cli.cl_chg_dev_linkage);
			break;
		}
	}
	mutex_unlock(&chlg_registered_dev_lock);
	return obd;
}

The "take the first available OBD device attached" approach is broken as that OBD might go away while the changelog is being read.

Here is an example of how things can go wrong:

  • mount lustre at /mnt/lustre;
  • mount the same lustre filesystem at /mnt/lustre2;
  • open /dev/changelog-lustre-MDT0000 in writing mode;
  • unmount /mnt/lustre;
  • write 'clear:cl0:0' to the opened chardevice.

That triggers an LBUG.



 Comments   
Comment by Peter Jones [ 06/Nov/18 ]

Quentin

Is this something that you plan to work on a fix for?

Peter

Comment by Quentin Bouget [ 07/Nov/18 ]

I would rather not. Neil Brown is working on it, I am trying to help, but I am not available enough to try and write a whole fix for it.

The discussion is also happening on lustre-devel, anyone interested is welcome to chime in. I'll report here if we reach some kind of consensus on how to fix this.

Comment by Quentin Bouget [ 07/Nov/18 ]

I think someone familiar with the internals of llogs could help quite a lot here.

Comment by Peter Jones [ 07/Nov/18 ]

simmonsja is this something that you are monitoring along with other upstream tasks?

Comment by James A Simmons [ 07/Nov/18 ]

Yes I'm involved.

Comment by James A Simmons [ 19/Jul/19 ]

The one change log device to N mdc obd devices seems strange to me. If you have multiple mounts on a client of the same file system do each obd device point to the same obd_import?

Comment by Andreas Dilger [ 01/Aug/19 ]

The one change log device to N mdc obd devices seems strange to me. If you have multiple mounts on a client of the same file system do each obd device point to the same obd_import?

There is already a problem in LU-12506 with the client trying to mount a filesystem with a lot of MDTs, because it is trying to create a misc character device for each MDT. That is limited to 64 misc devices per client, so if we create a separate character device for each mount it would cause mount failures with half the number of MDTs in half or fewer.

It makes more sense to multiplex a single character device across multiple MDTs, named changelog-$fsname. To track the MDT index on the open file handle (default = 0, which will work for most filesystems without any change) add an ioctl() to change the MDT index for that file handle if needed.

That avoids the need to create so many character devices, avoids the need to share a single chlg_registered_dev between multiple OBDs (one for each opener), and this interface change can be encapsulated inside the llapi code. This will also avoid the complexity in chlg_registered_dev_find_by_obd() if we have only a single chlg_registered_dev per OBD.

There would need to be some small changes to liblustreapi_chlg.c to open the changelog-$fsname device and call the ioctl() to change the MDT index instead of opening a different device for each MDT, with a fallback to the old behavior if the new device name doesn't exist. Probably the best is to change chlg_dev_path() to chlg_dev_open() and return the open file handle or an error instead of the pathname.

On the kernel side in mdc_changelog_cdev_init(), we should consider still creating some limited number of changelog-$fsname-MDTnnnn devices (maybe max 16?) for compatibility with userspace applications/libraries that are opening the old devices and are statically linked to liblustreapi.a (under LUSTRE_VERSION_CODE checks so they go away eventually). However, it shouldn't be an error if the compat devices cannot be created if there are many MDTs, since most clients will not be changelog consumers.

Comment by Andreas Dilger [ 01/Aug/19 ]

bougetq,
I see that in patch https://review.whamcloud.com/18900 "LU-7659 mdc: expose changelog through char devices" one of the motivations is "... to allow non-llapi to read them and to make delivery more efficient." Does RobinHood open and access the changelog-$fsname-MDTxxxx devices directly, or does it still use llapi_changelog_start() to open the char device?

Comment by Andreas Dilger [ 01/Aug/19 ]

Hongchao, assigning this to you since I think the fix will also resolve LU-12506.

Comment by Andreas Dilger [ 01/Aug/19 ]

Actually, having looked at this code in more detail, it seems that the obvious first step would be to either get a device reference between chlg_open() and chlg_close(), or a device reference each chlg_load() and chlg_write().

For both cases:

  • have chlg_obd_get() actually return a reference class_incref(obd) so that it is not a dangling pointer to a structure that might be freed. There should be a chlg_obd_put() that drops the reference using class_decref(crs_obd).

For open-to-close reference:

  • the reference on crs_obd should be gotten in chlg_open() and dropped in chlg_release(). One drawback is that an open Changelog reader would leave a danging OBD device configured when the client is unmounted, though "lsof" would at least show the "/dev/changelog-$fsname-MDTxxxx" device as open, and hopefully the cleanup would still work properly when the device is closed after the client is unmounted.

For read/write references:

  • obd_device would be looked up each time via chlg_obd_get(), by saving crs_rdev = i_rdev in chlg_open() instead of crs_obd, maybe after validating that the device actually exists. This adds some overhead, but may allow the Changelog code to be more robust, as the client could be unmounted/mounted cleanly if the changelog consumer is not actively using the device.
  • chlg_write->chlg_clear() is not as performance critical as the chlg_read() path, so doing a lookup each time is probably fine.
  • chlg_read() depends on chlg_load() in the background, which uses crs_obd to lookup llog_ctxt, which is tied to a specific OBD device. Storing a direct reference crs_llog_ctxt would avoid the crs_obd dereference, but this would still prevent the OBD device to be cleaned up, so is not better than getting a reference on obd_device directly in chlg_open(). Since chlg_load() is already getting a loop around the llog operations in patch https://review.whamcloud.com/35262 "LU-12553 mdc: polling mode for changelog reader" it isn't much harder to expand that loop a few more lines and include chlg_obd_get() and chlg_obd_put() on each loop.
Comment by Quentin Bouget [ 05/Aug/19 ]

Does RobinHood open and access the changelog-$fsname-MDTxxxx devices directly, or does it still use llapi_changelog_start() to open the char device?

It uses llapi_changelog_start().

Comment by Andreas Dilger [ 08/Aug/19 ]

It might make sense to also allow writing to the open file descriptor like "mdt:mdt_idx" to set the specific MDT to be used. That allows userspace scripts to potentially read and manage the changelog file, instead of having to find some way to call ioctl().

Comment by Gerrit Updater [ 13/Aug/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35784
Subject: LU-11626 mdc: hold obd while processing changelog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f866eac40cc761b960cc92907d1d5546b6e24e38

Comment by James A Simmons [ 15/Aug/19 ]

Patch 

https://review.whamcloud.com/35668 which landed resolved this. It would be nice to push a reproducer to the test suite from the information in this ticket. That was missing in the patch.

Comment by Hongchao Zhang [ 20/Aug/19 ]

Hi James,
The issue still exists on master branch with the patch https://review.whamcloud.com/35668.

[  106.920737] general protection fault: 0000 [#1] SMP 
[  106.927417] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) intel_powerclamp iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg parport_pc pcspkr parport i2c_piix4 video i2c_core ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi ahci libahci ata_piix libata e1000 crct10dif_pclmul crct10dif_common crc32c_intel serio_raw dm_mirror dm_region_hash dm_log dm_mod
[  106.941934] CPU: 0 PID: 1287 Comm: bash Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[  106.944006] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  106.945492] task: ffff8f61898b5ee0 ti: ffff8f61803e4000 task.ti: ffff8f61803e4000
[  106.946817] RIP: 0010:[<ffffffffc09ed510>]  [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc]
[  106.948338] RSP: 0018:ffff8f61803e7e70  EFLAGS: 00010206
[  106.949300] RAX: ffff8f6047e741e0 RBX: ffff8f6171eeac00 RCX: 0000000000000000
[  106.950582] RDX: 0000000000000000 RSI: ffffffffc09f8d74 RDI: 0000000000000000
[  106.951766] RBP: ffff8f61803e7ec0 R08: ffff8f6171eeac0a R09: 0000000000000000
[  106.953063] R10: 000000000000000a R11: f000000000000000 R12: 000000000000000b
[  106.954279] R13: 00007f9075b49000 R14: 5a5a5a5a5a5a5a5a R15: 0000000000000000
[  106.955473] FS:  00007f9075b43740(0000) GS:ffff8f619fc00000(0000) knlGS:0000000000000000
[  106.956802] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.957723] CR2: 00005619854aa528 CR3: 0000000216e52000 CR4: 00000000000606f0
[  106.958892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.960067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.961216] Call Trace:
[  106.961613]  [<ffffffffac41b490>] vfs_write+0xc0/0x1f0
[  106.962440]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
[  106.963499]  [<ffffffffac41c2bf>] SyS_write+0x7f/0xf0
[  106.964300]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
[  106.965349]  [<ffffffffac920795>] system_call_fastpath+0x1c/0x21
[  106.966311]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
[  106.967362] Code: 27 7e 01 00 ef 01 00 00 c7 05 21 7e 01 00 01 00 00 00 48 c7 05 1e 7e 01 00 00 00 00 00 e8 89 d2 90 ff 4d 85 f6 0f 84 33 01 00 00 <49> 8b 86 f8 00 00 00 48 85 c0 0f 84 a3 00 00 00 48 8b 10 48 85 
[  106.970783] RIP  [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc]
[  106.971791]  RSP <ffff8f61803e7e70>
[    0.794662] mce: Unable to init device /dev/mcelog (rc: -5)
[    0.800919] Failed to execute /init
[    0.801546] Kernel panic - not syncing: No init found.  Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.
[    0.803450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-862.9.1.el7.x86_64 #1
[    0.804651] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[    0.805965] Call Trace:
[    0.806399]  [<ffffffff8170e84e>] dump_stack+0x19/0x1b
[    0.807277]  [<ffffffff81708b50>] panic+0xe8/0x21f
[    0.808041]  [<ffffffff8122ccdd>] ? putname+0x3d/0x60
[    0.808799]  [<ffffffff816fd780>] ? rest_init+0x80/0x80
[    0.809627]  [<ffffffff816fd862>] kernel_init+0xe2/0xf0
[    0.810741]  [<ffffffff817205f7>] ret_from_fork_nospec_begin+0x21/0x21
[    0.811744]  [<ffffffff816fd780>] ? rest_init+0x80/0x80
Comment by Gerrit Updater [ 16/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35784/
Subject: LU-11626 mdc: hold obd while processing changelog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d7bb6647cd4dd26949bceb6a099cd606623aff2b

Comment by Peter Jones [ 16/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 01/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36338
Subject: LU-11626 mdc: hold obd while processing changelog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 24935d83103d44dee84a16a7e304bb13180ec22d

Comment by James A Simmons [ 08/Oct/19 ]

Peter, any chance this can land for the 2.12.3 release

Comment by Peter Jones [ 08/Oct/19 ]

James

We tagged RC1 this morning so, while there's still a chance, it is looking unlikely as things stand

Peter

Comment by Gerrit Updater [ 21/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36338/
Subject: LU-11626 mdc: hold obd while processing changelog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 9d88030b8a532de3bec3f75d01ec108bfba97082

Generated at Sat Feb 10 02:45:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.