Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11626

mdc: obd might go away while referenced by code in mdc_changelog

Details

    • 3
    • 9223372036854775807

    Description

      In lustre/mdc/mdc_changelog.c:

      /**
       * Find the OBD device associated to a changelog character device.
       * @param[in]  cdev  character device instance descriptor
       * @return corresponding OBD device or NULL if none was found.
       */
      static struct obd_device *chlg_obd_get(dev_t cdev)
      {
      	int minor = MINOR(cdev);
      	struct obd_device *obd = NULL;
      	struct chlg_registered_dev *curr;
      
      	mutex_lock(&chlg_registered_dev_lock);
      	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
      		if (curr->ced_misc.minor == minor) {
      			/* take the first available OBD device attached */
      			obd = list_first_entry(&curr->ced_obds,
      					       struct obd_device,
      					       u.cli.cl_chg_dev_linkage);
      			break;
      		}
      	}
      	mutex_unlock(&chlg_registered_dev_lock);
      	return obd;
      }
      

      The "take the first available OBD device attached" approach is broken as that OBD might go away while the changelog is being read.

      Here is an example of how things can go wrong:

      • mount lustre at /mnt/lustre;
      • mount the same lustre filesystem at /mnt/lustre2;
      • open /dev/changelog-lustre-MDT0000 in writing mode;
      • unmount /mnt/lustre;
      • write 'clear:cl0:0' to the opened chardevice.

      That triggers an LBUG.

      Attachments

        Issue Links

          Activity

            [LU-11626] mdc: obd might go away while referenced by code in mdc_changelog
            pjones Peter Jones added a comment -

            James

            We tagged RC1 this morning so, while there's still a chance, it is looking unlikely as things stand

            Peter

            pjones Peter Jones added a comment - James We tagged RC1 this morning so, while there's still a chance, it is looking unlikely as things stand Peter

            Peter, any chance this can land for the 2.12.3 release

            simmonsja James A Simmons added a comment - Peter, any chance this can land for the 2.12.3 release

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36338
            Subject: LU-11626 mdc: hold obd while processing changelog
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 24935d83103d44dee84a16a7e304bb13180ec22d

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36338 Subject: LU-11626 mdc: hold obd while processing changelog Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 24935d83103d44dee84a16a7e304bb13180ec22d
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35784/
            Subject: LU-11626 mdc: hold obd while processing changelog
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d7bb6647cd4dd26949bceb6a099cd606623aff2b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35784/ Subject: LU-11626 mdc: hold obd while processing changelog Project: fs/lustre-release Branch: master Current Patch Set: Commit: d7bb6647cd4dd26949bceb6a099cd606623aff2b

            Hi James,
            The issue still exists on master branch with the patch https://review.whamcloud.com/35668.

            [  106.920737] general protection fault: 0000 [#1] SMP 
            [  106.927417] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) intel_powerclamp iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg parport_pc pcspkr parport i2c_piix4 video i2c_core ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi ahci libahci ata_piix libata e1000 crct10dif_pclmul crct10dif_common crc32c_intel serio_raw dm_mirror dm_region_hash dm_log dm_mod
            [  106.941934] CPU: 0 PID: 1287 Comm: bash Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
            [  106.944006] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
            [  106.945492] task: ffff8f61898b5ee0 ti: ffff8f61803e4000 task.ti: ffff8f61803e4000
            [  106.946817] RIP: 0010:[<ffffffffc09ed510>]  [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc]
            [  106.948338] RSP: 0018:ffff8f61803e7e70  EFLAGS: 00010206
            [  106.949300] RAX: ffff8f6047e741e0 RBX: ffff8f6171eeac00 RCX: 0000000000000000
            [  106.950582] RDX: 0000000000000000 RSI: ffffffffc09f8d74 RDI: 0000000000000000
            [  106.951766] RBP: ffff8f61803e7ec0 R08: ffff8f6171eeac0a R09: 0000000000000000
            [  106.953063] R10: 000000000000000a R11: f000000000000000 R12: 000000000000000b
            [  106.954279] R13: 00007f9075b49000 R14: 5a5a5a5a5a5a5a5a R15: 0000000000000000
            [  106.955473] FS:  00007f9075b43740(0000) GS:ffff8f619fc00000(0000) knlGS:0000000000000000
            [  106.956802] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  106.957723] CR2: 00005619854aa528 CR3: 0000000216e52000 CR4: 00000000000606f0
            [  106.958892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [  106.960067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [  106.961216] Call Trace:
            [  106.961613]  [<ffffffffac41b490>] vfs_write+0xc0/0x1f0
            [  106.962440]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
            [  106.963499]  [<ffffffffac41c2bf>] SyS_write+0x7f/0xf0
            [  106.964300]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
            [  106.965349]  [<ffffffffac920795>] system_call_fastpath+0x1c/0x21
            [  106.966311]  [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146
            [  106.967362] Code: 27 7e 01 00 ef 01 00 00 c7 05 21 7e 01 00 01 00 00 00 48 c7 05 1e 7e 01 00 00 00 00 00 e8 89 d2 90 ff 4d 85 f6 0f 84 33 01 00 00 <49> 8b 86 f8 00 00 00 48 85 c0 0f 84 a3 00 00 00 48 8b 10 48 85 
            [  106.970783] RIP  [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc]
            [  106.971791]  RSP <ffff8f61803e7e70>
            [    0.794662] mce: Unable to init device /dev/mcelog (rc: -5)
            [    0.800919] Failed to execute /init
            [    0.801546] Kernel panic - not syncing: No init found.  Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.
            [    0.803450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-862.9.1.el7.x86_64 #1
            [    0.804651] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
            [    0.805965] Call Trace:
            [    0.806399]  [<ffffffff8170e84e>] dump_stack+0x19/0x1b
            [    0.807277]  [<ffffffff81708b50>] panic+0xe8/0x21f
            [    0.808041]  [<ffffffff8122ccdd>] ? putname+0x3d/0x60
            [    0.808799]  [<ffffffff816fd780>] ? rest_init+0x80/0x80
            [    0.809627]  [<ffffffff816fd862>] kernel_init+0xe2/0xf0
            [    0.810741]  [<ffffffff817205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [    0.811744]  [<ffffffff816fd780>] ? rest_init+0x80/0x80
            
            hongchao.zhang Hongchao Zhang added a comment - Hi James, The issue still exists on master branch with the patch https://review.whamcloud.com/35668 . [ 106.920737] general protection fault: 0000 [#1] SMP [ 106.927417] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) intel_powerclamp iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg parport_pc pcspkr parport i2c_piix4 video i2c_core ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi ahci libahci ata_piix libata e1000 crct10dif_pclmul crct10dif_common crc32c_intel serio_raw dm_mirror dm_region_hash dm_log dm_mod [ 106.941934] CPU: 0 PID: 1287 Comm: bash Kdump: loaded Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [ 106.944006] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 106.945492] task: ffff8f61898b5ee0 ti: ffff8f61803e4000 task.ti: ffff8f61803e4000 [ 106.946817] RIP: 0010:[<ffffffffc09ed510>] [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc] [ 106.948338] RSP: 0018:ffff8f61803e7e70 EFLAGS: 00010206 [ 106.949300] RAX: ffff8f6047e741e0 RBX: ffff8f6171eeac00 RCX: 0000000000000000 [ 106.950582] RDX: 0000000000000000 RSI: ffffffffc09f8d74 RDI: 0000000000000000 [ 106.951766] RBP: ffff8f61803e7ec0 R08: ffff8f6171eeac0a R09: 0000000000000000 [ 106.953063] R10: 000000000000000a R11: f000000000000000 R12: 000000000000000b [ 106.954279] R13: 00007f9075b49000 R14: 5a5a5a5a5a5a5a5a R15: 0000000000000000 [ 106.955473] FS: 00007f9075b43740(0000) GS:ffff8f619fc00000(0000) knlGS:0000000000000000 [ 106.956802] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.957723] CR2: 00005619854aa528 CR3: 0000000216e52000 CR4: 00000000000606f0 [ 106.958892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.960067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.961216] Call Trace: [ 106.961613] [<ffffffffac41b490>] vfs_write+0xc0/0x1f0 [ 106.962440] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.963499] [<ffffffffac41c2bf>] SyS_write+0x7f/0xf0 [ 106.964300] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.965349] [<ffffffffac920795>] system_call_fastpath+0x1c/0x21 [ 106.966311] [<ffffffffac9206e1>] ? system_call_after_swapgs+0xae/0x146 [ 106.967362] Code: 27 7e 01 00 ef 01 00 00 c7 05 21 7e 01 00 01 00 00 00 48 c7 05 1e 7e 01 00 00 00 00 00 e8 89 d2 90 ff 4d 85 f6 0f 84 33 01 00 00 <49> 8b 86 f8 00 00 00 48 85 c0 0f 84 a3 00 00 00 48 8b 10 48 85 [ 106.970783] RIP [<ffffffffc09ed510>] chlg_write+0x450/0x6c0 [mdc] [ 106.971791] RSP <ffff8f61803e7e70> [ 0.794662] mce: Unable to init device /dev/mcelog (rc: -5) [ 0.800919] Failed to execute /init [ 0.801546] Kernel panic - not syncing: No init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance. [ 0.803450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-862.9.1.el7.x86_64 #1 [ 0.804651] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 0.805965] Call Trace: [ 0.806399] [<ffffffff8170e84e>] dump_stack+0x19/0x1b [ 0.807277] [<ffffffff81708b50>] panic+0xe8/0x21f [ 0.808041] [<ffffffff8122ccdd>] ? putname+0x3d/0x60 [ 0.808799] [<ffffffff816fd780>] ? rest_init+0x80/0x80 [ 0.809627] [<ffffffff816fd862>] kernel_init+0xe2/0xf0 [ 0.810741] [<ffffffff817205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 0.811744] [<ffffffff816fd780>] ? rest_init+0x80/0x80
            simmonsja James A Simmons added a comment - - edited

            Patch 

            https://review.whamcloud.com/35668 which landed resolved this. It would be nice to push a reproducer to the test suite from the information in this ticket. That was missing in the patch.

            simmonsja James A Simmons added a comment - - edited Patch  https://review.whamcloud.com/35668 which landed resolved this. It would be nice to push a reproducer to the test suite from the information in this ticket. That was missing in the patch.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35784
            Subject: LU-11626 mdc: hold obd while processing changelog
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f866eac40cc761b960cc92907d1d5546b6e24e38

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35784 Subject: LU-11626 mdc: hold obd while processing changelog Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f866eac40cc761b960cc92907d1d5546b6e24e38

            It might make sense to also allow writing to the open file descriptor like "mdt:mdt_idx" to set the specific MDT to be used. That allows userspace scripts to potentially read and manage the changelog file, instead of having to find some way to call ioctl().

            adilger Andreas Dilger added a comment - It might make sense to also allow writing to the open file descriptor like " mdt:mdt_idx " to set the specific MDT to be used. That allows userspace scripts to potentially read and manage the changelog file, instead of having to find some way to call ioctl() .

            Does RobinHood open and access the changelog-$fsname-MDTxxxx devices directly, or does it still use llapi_changelog_start() to open the char device?

            It uses llapi_changelog_start().

            bougetq Quentin Bouget (Inactive) added a comment - Does RobinHood open and access the changelog-$fsname-MDTxxxx devices directly, or does it still use llapi_changelog_start() to open the char device? It uses llapi_changelog_start() .
            adilger Andreas Dilger added a comment - - edited

            Actually, having looked at this code in more detail, it seems that the obvious first step would be to either get a device reference between chlg_open() and chlg_close(), or a device reference each chlg_load() and chlg_write().

            For both cases:

            • have chlg_obd_get() actually return a reference class_incref(obd) so that it is not a dangling pointer to a structure that might be freed. There should be a chlg_obd_put() that drops the reference using class_decref(crs_obd).

            For open-to-close reference:

            • the reference on crs_obd should be gotten in chlg_open() and dropped in chlg_release(). One drawback is that an open Changelog reader would leave a danging OBD device configured when the client is unmounted, though "lsof" would at least show the "/dev/changelog-$fsname-MDTxxxx" device as open, and hopefully the cleanup would still work properly when the device is closed after the client is unmounted.

            For read/write references:

            • obd_device would be looked up each time via chlg_obd_get(), by saving crs_rdev = i_rdev in chlg_open() instead of crs_obd, maybe after validating that the device actually exists. This adds some overhead, but may allow the Changelog code to be more robust, as the client could be unmounted/mounted cleanly if the changelog consumer is not actively using the device.
            • chlg_write->chlg_clear() is not as performance critical as the chlg_read() path, so doing a lookup each time is probably fine.
            • chlg_read() depends on chlg_load() in the background, which uses crs_obd to lookup llog_ctxt, which is tied to a specific OBD device. Storing a direct reference crs_llog_ctxt would avoid the crs_obd dereference, but this would still prevent the OBD device to be cleaned up, so is not better than getting a reference on obd_device directly in chlg_open(). Since chlg_load() is already getting a loop around the llog operations in patch https://review.whamcloud.com/35262 "LU-12553 mdc: polling mode for changelog reader" it isn't much harder to expand that loop a few more lines and include chlg_obd_get() and chlg_obd_put() on each loop.
            adilger Andreas Dilger added a comment - - edited Actually, having looked at this code in more detail, it seems that the obvious first step would be to either get a device reference between chlg_open() and chlg_close() , or a device reference each chlg_load() and chlg_write() . For both cases: have chlg_obd_get() actually return a reference class_incref(obd) so that it is not a dangling pointer to a structure that might be freed. There should be a chlg_obd_put() that drops the reference using class_decref(crs_obd) . For open-to-close reference: the reference on crs_obd should be gotten in chlg_open() and dropped in chlg_release() . One drawback is that an open Changelog reader would leave a danging OBD device configured when the client is unmounted, though " lsof " would at least show the " /dev/changelog-$fsname-MDTxxxx " device as open, and hopefully the cleanup would still work properly when the device is closed after the client is unmounted. For read/write references: obd_device would be looked up each time via chlg_obd_get() , by saving crs_rdev = i_rdev in chlg_open() instead of crs_obd , maybe after validating that the device actually exists. This adds some overhead, but may allow the Changelog code to be more robust, as the client could be unmounted/mounted cleanly if the changelog consumer is not actively using the device. chlg_write->chlg_clear() is not as performance critical as the chlg_read() path, so doing a lookup each time is probably fine. chlg_read() depends on chlg_load() in the background, which uses crs_obd to lookup llog_ctxt , which is tied to a specific OBD device. Storing a direct reference crs_llog_ctxt would avoid the crs_obd dereference, but this would still prevent the OBD device to be cleaned up, so is not better than getting a reference on obd_device directly in chlg_open() . Since chlg_load() is already getting a loop around the llog operations in patch https://review.whamcloud.com/35262 " LU-12553 mdc: polling mode for changelog reader " it isn't much harder to expand that loop a few more lines and include chlg_obd_get() and chlg_obd_put() on each loop.

            People

              hongchao.zhang Hongchao Zhang
              cealustre CEA
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: