Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.10.6
    • None
    • 3
    • 9223372036854775807

    Description

      We have seen this at least 4 time.

      5426573.363714] Lustre: Mounted nbp16-client
      [5428374.627398] general protection fault: 0000 [#1] 
      [5428374.627407] Lustre: Unmounted nbp14-client
      [5428374.636811] SMP 
      [5428374.639106] 
      5428374.639307] Modules linked in: vtsspp(OEN) sep5(OEN) socperf3(OEN) pax(OEN) osc(OEN) mgc(OEN) lustre(OEN) lmv(OEN) fld(OEN) mdc(OEN) fid(OEN) lov(OEN) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) beegfs(OEN) rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs(E) ib_ipoib(OEX) inet_lro(E) ib_cm(OEX) ib_uverbs(OEX) ib_umad(OEX) mlx4_ib(OEX) ib_core(OEX) mlx4_core(OEX) devlink(E) mlx_compat(OEX) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) joydev(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) drbg(E) ansi_cprng(E) ipmi_ssif(E) iTCO_wdt(E) iTCO_vendor_support(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) mgag200(E) ablk_helper(E) cryptd(E) ttm(E)
      [5428374.711255]  acpi_cpufreq(E) drm_kms_helper(E) pcspkr(E) drm(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) lpc_ich(E) mei_me(E) i2c_i801(E) mfd_core(E) mei(E) ioatdma(E) shpchp(E) ipmi_si(E) wmi(E) ipmi_devintf(E) ipmi_msghandler(E) processor(E) button(E) tcp_bic(EN) hwperf(OEX) numatools(OEX) xpmem(OEX) gru(OEX) xvma(OEX) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) autofs4(E) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) bridge(E) stp(E) llc(E) hid_generic(E) usbhid(E) ahci(E) libahci(E) ehci_pci(E) libata(E) ehci_hcd(E) igb(E) i2c_algo_bit(E) dca(E) ptp(E) scsi_mod(E) usbcore(E) pps_core(E) usb_common(E) af_packet(E) crc32c_intel(E) fjes(E) [last unloaded: socperf2_0]
      [5428374.776352] Supported: No, Unsupported modules are loaded
      [5428374.782187] CPU: 23 PID: 85345 Comm: umount Tainted: G           OE   NX 4.4.162-94.72.1.20181113-nasa #1
      [5428374.792175] Hardware name: SGI.COM ICE-XIP113/X9DRT-Dakota, BIOS DA0E2016 02/01/2016
      [5428374.800341] task: ffff88026ade1000 ti: ffff88026ade4000 task.ti: ffff88026ade4000
      [5428374.808253] RIP: 0010:[<ffffffffa07a47dd>]  [<ffffffffa07a47dd>] mdc_changelog_cdev_finish+0x3d/0x1b1 [mdc]
      [5428374.818437] RSP: 0018:ffff88026ade7b68  EFLAGS: 00010286
      [5428374.824175] RAX: 5a5a5a5a5a5a4b62 RBX: ffff88040e20e008 RCX: ffff88037b826fb0
      [5428374.831741] RDX: 5a5a5a5a5a5a5a5a RSI: ffff88037b826f40 RDI: ffff88040e20e008
      [5428374.839306] RBP: 0000000000000000 R08: 0000000000000c3a R09: 0000000000000000
      [5428374.846863] R10: 0000000000000000 R11: ffff8807c8d833c6 R12: 0000000000000000
      [5428374.854421] R13: ffff88040e20e048 R14: ffff880d1635f000 R15: ffff880cf81e6b60
      [5428374.861978] FS:  00007ffff7fd1880(0000) GS:ffff88085fb40000(0000) knlGS:0000000000000000
      [5428374.870489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [5428374.876661] CR2: 00007ffff7ff6000 CR3: 0000000371dbe000 CR4: 0000000000160670
      [5428374.884227] Stack:
      [5428374.886678]  ffff88040e20e008 0000000000000000 ffffffffa07904fa ffff88040e20e008
      [5428374.894566]  0000000000000000 0000000000000000 ffffffffa0b8bc9c ffff88026ade7bf8
      [5428374.902452]  ffffffffa0a6afb7 ffff880200000010 ffff88026ade7c08 ffff88026ade7bc8
      [5428374.910337] Call Trace:
      [5428374.913246]  [<ffffffffa07904fa>] mdc_precleanup+0x2a/0x3f0 [mdc]
      [5428374.919816]  [<ffffffffa0b8bc9c>] class_cleanup+0x26c/0xc40 [obdclass]
      [5428374.926811]  [<ffffffffa0b8e5ba>] class_process_config+0x190a/0x2360 [obdclass]
      [5428374.934582]  [<ffffffffa0b8f1ba>] class_manual_cleanup+0x1aa/0x6a0 [obdclass]
      [5428374.942177]  [<ffffffffa0f6f341>] ll_put_super+0x111/0x9f0 [lustre]
      [5428374.948881]  [<ffffffff81212a1c>] generic_shutdown_super+0x6c/0xf0
      [5428374.955497]  [<ffffffff81212aae>] kill_anon_super+0xe/0x20
      [5428374.961416]  [<ffffffff8121236f>] deactivate_locked_super+0x3f/0x70
      [5428374.968117]  [<ffffffff8122da1b>] cleanup_mnt+0x3b/0x80
      [5428374.973775]  [<ffffffff8109f718>] task_work_run+0x78/0x90
      [5428374.979609]  [<ffffffff8107d3cf>] exit_to_usermode_loop+0x91/0xc2
      [5428374.986136]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
      [5428374.992837]  [<ffffffff8161dfec>] int_ret_from_sys_call+0x8/0x6d
      [5428375.002321] DWARF2 unwinder stuck at int_ret_from_sys_call+0x8/0x6d
      [5428375.009019] 
      [5428375.010951] Leftover inexact backtrace:
                       
      [5428375.017130] Code: 3d 90 21 7b a0 48 8d b0 78 ff ff ff 0f 84 d0 00 00 00 48 8b 56 70 48 8d 4e 70 48 39 d1 48 8d 82 08 f1 ff ff 75 1c e9 9d 00 00 00 <48> 8b 90 f8 0e 00 00 48 39 d1 48 8d 82 08 f1 ff ff 0f 84 86 00 
      [5428375.037514] RIP  [<ffffffffa07a47dd>] mdc_changelog_cdev_finish+0x3d/0x1b1 [mdc]
      [5428375.045359]  RSP <ffff88026ade7b68>
      

      Attachments

        Issue Links

          Activity

            [LU-12566] GPF when umounting client

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35835/
            Subject: LU-12566 mdc: hold lock while walking changelog dev list
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: b574f5f88878167b846a90ab09ac3b98958d52e4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35835/ Subject: LU-12566 mdc: hold lock while walking changelog dev list Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: b574f5f88878167b846a90ab09ac3b98958d52e4

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35835
            Subject: LU-12566 mdc: hold lock while walking changelog dev list
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 633905829fe733b1557fa3266e0b638674eca1b3

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35835 Subject: LU-12566 mdc: hold lock while walking changelog dev list Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 633905829fe733b1557fa3266e0b638674eca1b3
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35668/
            Subject: LU-12566 mdc: hold lock while walking changelog dev list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a260c530801db7f58efa93b774f06b0ce72649a3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35668/ Subject: LU-12566 mdc: hold lock while walking changelog dev list Project: fs/lustre-release Branch: master Current Patch Set: Commit: a260c530801db7f58efa93b774f06b0ce72649a3

            I think the problem actually relates to the automount. The first lines of your posting are:

            [5426573.363714] Lustre: Mounted nbp16-client
            [5428374.627398] general protection fault: 0000 [#1] 
            [5428374.627407] Lustre: Unmounted nbp14-client
            

            so two different filesystems are mounting and unmounting at the same time. There was some unsafe list access in the unmount code that would cause the problem you hit if the list is being changed by another client mounting/unmounting at the same time.

            The patch is very simple and I verified it applies cleanly to the b2_10 code. It has just starting testing, but should hopefully be ready for your use by tomorrow unless there is some problem found in the patch.

            adilger Andreas Dilger added a comment - I think the problem actually relates to the automount. The first lines of your posting are: [5426573.363714] Lustre: Mounted nbp16-client [5428374.627398] general protection fault: 0000 [#1] [5428374.627407] Lustre: Unmounted nbp14-client so two different filesystems are mounting and unmounting at the same time. There was some unsafe list access in the unmount code that would cause the problem you hit if the list is being changed by another client mounting/unmounting at the same time. The patch is very simple and I verified it applies cleanly to the b2_10 code. It has just starting testing, but should hopefully be ready for your use by tomorrow unless there is some problem found in the patch.

            At max a client can have up to 11 filesystems and 14 MDTs.
            We don't have subdirectory mounts.
            We do run automounter, but we have been running automounter for some time, with 2.7 and 2.10 clients. We only recently started seeing these crashes. When I open this case we saw several in one day but since then we only see 1 every day.

            mhanafi Mahmoud Hanafi added a comment - At max a client can have up to 11 filesystems and 14 MDTs. We don't have subdirectory mounts. We do run automounter, but we have been running automounter for some time, with 2.7 and 2.10 clients. We only recently started seeing these crashes. When I open this case we saw several in one day but since then we only see 1 every day.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35668
            Subject: LU-12566 mdc: hold lock while walking changelog dev list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9b300db7ae6df0166dd5806d7be0cce1fb5aff4b

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35668 Subject: LU-12566 mdc: hold lock while walking changelog dev list Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9b300db7ae6df0166dd5806d7be0cce1fb5aff4b

            Reopen this ticket, since I think there is a simpler fix that could be used for the short term.

            adilger Andreas Dilger added a comment - Reopen this ticket, since I think there is a simpler fix that could be used for the short term.

            It looks like mdc_changelog_cdev_finish() is walking the chlg_registered_dev and cl_chg_dev_linkage lists without a lock, which is racy with other threads adding/removing entries from the list:

            void mdc_changelog_cdev_finish(struct obd_device *obd)
            {
                    struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
                    ENTRY;
            
                    mutex_lock(&chlg_registered_dev_lock);
                    list_del_init(&obd->u.cli.cl_chg_dev_linkage);
                    kref_put(&dev->ced_refs, chlg_dev_clear);
                    mutex_unlock(&chlg_registered_dev_lock);
                    EXIT;
            }
            

            A simple fix for this particular issue might be to move the fetching of dev under chlg_registered_dev_lock. It would also be a good idea to add LASSERT(mutex_is_locked(&chlg_registered_dev_lock)) to the start of chlg_registered_dev_find_by_obd() and chlg_registered_dev_find_by_obd() to avoid such bugs in the future.

            This doesn't fix the larger bug in LU-11626 but may at least avoid the particular issue being seen here, and be simple enough to backport to older releases.

            adilger Andreas Dilger added a comment - It looks like mdc_changelog_cdev_finish() is walking the chlg_registered_dev and cl_chg_dev_linkage lists without a lock, which is racy with other threads adding/removing entries from the list: void mdc_changelog_cdev_finish(struct obd_device *obd) { struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd); ENTRY; mutex_lock(&chlg_registered_dev_lock); list_del_init(&obd->u.cli.cl_chg_dev_linkage); kref_put(&dev->ced_refs, chlg_dev_clear); mutex_unlock(&chlg_registered_dev_lock); EXIT; } A simple fix for this particular issue might be to move the fetching of dev under chlg_registered_dev_lock . It would also be a good idea to add LASSERT(mutex_is_locked(&chlg_registered_dev_lock)) to the start of chlg_registered_dev_find_by_obd() and chlg_registered_dev_find_by_obd() to avoid such bugs in the future. This doesn't fix the larger bug in LU-11626 but may at least avoid the particular issue being seen here, and be simple enough to backport to older releases.

            How many filesystems are being mounted on this client, and what is the total number of MDTs being mounted? I'm wondering if this relates to LU-12506, which causes the client mount to fail if there are "too many" (in the range 56-64) different MDC devices registered on a client, and then there is some kind of bug during the unmount?

            Or do you have something like automount active and there are filesystems mounting/unmounting regularly (maybe with subdirectory mounts?) and there is a race where the same filesystem is mounted multiple times and unmounted?

            adilger Andreas Dilger added a comment - How many filesystems are being mounted on this client, and what is the total number of MDTs being mounted? I'm wondering if this relates to LU-12506 , which causes the client mount to fail if there are "too many" (in the range 56-64) different MDC devices registered on a client, and then there is some kind of bug during the unmount? Or do you have something like automount active and there are filesystems mounting/unmounting regularly (maybe with subdirectory mounts?) and there is a race where the same filesystem is mounted multiple times and unmounted?

            People

              hongchao.zhang Hongchao Zhang
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: