[LU-12566] GPF when umounting client Created: 18/Jul/19  Updated: 12/Sep/19  Resolved: 15/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.6
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mdc_changelog_cdev_finish.dis     File obd_struct.out    
Issue Links:
Duplicate
duplicates LU-11626 mdc: obd might go away while referenc... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have seen this at least 4 time.

5426573.363714] Lustre: Mounted nbp16-client
[5428374.627398] general protection fault: 0000 [#1] 
[5428374.627407] Lustre: Unmounted nbp14-client
[5428374.636811] SMP 
[5428374.639106] 
5428374.639307] Modules linked in: vtsspp(OEN) sep5(OEN) socperf3(OEN) pax(OEN) osc(OEN) mgc(OEN) lustre(OEN) lmv(OEN) fld(OEN) mdc(OEN) fid(OEN) lov(OEN) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) beegfs(OEN) rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs(E) ib_ipoib(OEX) inet_lro(E) ib_cm(OEX) ib_uverbs(OEX) ib_umad(OEX) mlx4_ib(OEX) ib_core(OEX) mlx4_core(OEX) devlink(E) mlx_compat(OEX) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) joydev(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) drbg(E) ansi_cprng(E) ipmi_ssif(E) iTCO_wdt(E) iTCO_vendor_support(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) mgag200(E) ablk_helper(E) cryptd(E) ttm(E)
[5428374.711255]  acpi_cpufreq(E) drm_kms_helper(E) pcspkr(E) drm(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) lpc_ich(E) mei_me(E) i2c_i801(E) mfd_core(E) mei(E) ioatdma(E) shpchp(E) ipmi_si(E) wmi(E) ipmi_devintf(E) ipmi_msghandler(E) processor(E) button(E) tcp_bic(EN) hwperf(OEX) numatools(OEX) xpmem(OEX) gru(OEX) xvma(OEX) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) autofs4(E) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) bridge(E) stp(E) llc(E) hid_generic(E) usbhid(E) ahci(E) libahci(E) ehci_pci(E) libata(E) ehci_hcd(E) igb(E) i2c_algo_bit(E) dca(E) ptp(E) scsi_mod(E) usbcore(E) pps_core(E) usb_common(E) af_packet(E) crc32c_intel(E) fjes(E) [last unloaded: socperf2_0]
[5428374.776352] Supported: No, Unsupported modules are loaded
[5428374.782187] CPU: 23 PID: 85345 Comm: umount Tainted: G           OE   NX 4.4.162-94.72.1.20181113-nasa #1
[5428374.792175] Hardware name: SGI.COM ICE-XIP113/X9DRT-Dakota, BIOS DA0E2016 02/01/2016
[5428374.800341] task: ffff88026ade1000 ti: ffff88026ade4000 task.ti: ffff88026ade4000
[5428374.808253] RIP: 0010:[<ffffffffa07a47dd>]  [<ffffffffa07a47dd>] mdc_changelog_cdev_finish+0x3d/0x1b1 [mdc]
[5428374.818437] RSP: 0018:ffff88026ade7b68  EFLAGS: 00010286
[5428374.824175] RAX: 5a5a5a5a5a5a4b62 RBX: ffff88040e20e008 RCX: ffff88037b826fb0
[5428374.831741] RDX: 5a5a5a5a5a5a5a5a RSI: ffff88037b826f40 RDI: ffff88040e20e008
[5428374.839306] RBP: 0000000000000000 R08: 0000000000000c3a R09: 0000000000000000
[5428374.846863] R10: 0000000000000000 R11: ffff8807c8d833c6 R12: 0000000000000000
[5428374.854421] R13: ffff88040e20e048 R14: ffff880d1635f000 R15: ffff880cf81e6b60
[5428374.861978] FS:  00007ffff7fd1880(0000) GS:ffff88085fb40000(0000) knlGS:0000000000000000
[5428374.870489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5428374.876661] CR2: 00007ffff7ff6000 CR3: 0000000371dbe000 CR4: 0000000000160670
[5428374.884227] Stack:
[5428374.886678]  ffff88040e20e008 0000000000000000 ffffffffa07904fa ffff88040e20e008
[5428374.894566]  0000000000000000 0000000000000000 ffffffffa0b8bc9c ffff88026ade7bf8
[5428374.902452]  ffffffffa0a6afb7 ffff880200000010 ffff88026ade7c08 ffff88026ade7bc8
[5428374.910337] Call Trace:
[5428374.913246]  [<ffffffffa07904fa>] mdc_precleanup+0x2a/0x3f0 [mdc]
[5428374.919816]  [<ffffffffa0b8bc9c>] class_cleanup+0x26c/0xc40 [obdclass]
[5428374.926811]  [<ffffffffa0b8e5ba>] class_process_config+0x190a/0x2360 [obdclass]
[5428374.934582]  [<ffffffffa0b8f1ba>] class_manual_cleanup+0x1aa/0x6a0 [obdclass]
[5428374.942177]  [<ffffffffa0f6f341>] ll_put_super+0x111/0x9f0 [lustre]
[5428374.948881]  [<ffffffff81212a1c>] generic_shutdown_super+0x6c/0xf0
[5428374.955497]  [<ffffffff81212aae>] kill_anon_super+0xe/0x20
[5428374.961416]  [<ffffffff8121236f>] deactivate_locked_super+0x3f/0x70
[5428374.968117]  [<ffffffff8122da1b>] cleanup_mnt+0x3b/0x80
[5428374.973775]  [<ffffffff8109f718>] task_work_run+0x78/0x90
[5428374.979609]  [<ffffffff8107d3cf>] exit_to_usermode_loop+0x91/0xc2
[5428374.986136]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
[5428374.992837]  [<ffffffff8161dfec>] int_ret_from_sys_call+0x8/0x6d
[5428375.002321] DWARF2 unwinder stuck at int_ret_from_sys_call+0x8/0x6d
[5428375.009019] 
[5428375.010951] Leftover inexact backtrace:
                 
[5428375.017130] Code: 3d 90 21 7b a0 48 8d b0 78 ff ff ff 0f 84 d0 00 00 00 48 8b 56 70 48 8d 4e 70 48 39 d1 48 8d 82 08 f1 ff ff 75 1c e9 9d 00 00 00 <48> 8b 90 f8 0e 00 00 48 39 d1 48 8d 82 08 f1 ff ff 0f 84 86 00 
[5428375.037514] RIP  [<ffffffffa07a47dd>] mdc_changelog_cdev_finish+0x3d/0x1b1 [mdc]
[5428375.045359]  RSP <ffff88026ade7b68>


 Comments   
Comment by Patrick Farrell (Inactive) [ 18/Jul/19 ]

Mahmoud,

It would be very helpful if you provide the actual crash dump, modules, and vmlinux as well.

Comment by Patrick Farrell (Inactive) [ 18/Jul/19 ]

Since you provided the disassembly and the obd_struct, I thought I'd add...

Looking at the disassembly, the crash occurs before the mutex is taken.  (This is a bit tricky to figure out because of all the jumps, but it's relatively easy to see if you just assume all of the conditional jumps starting at function entry are not taken - which would be valid, as they're conditionals - and then you can see a pretty straightforward path, running through 0xffffffffa07a47dd and then on to calling mutex_lock.)

That means it happened in:
chlg_registered_dev_find_by_obd

(Which doesn't appear because it's been inlined in to mdc_changelog_cdev_finish)

Which means dumping the obd struct isn't interesting, because the null pointer wasn't in there.  (You can confirm this by checking the one pointer in the obd struct we access - u.cli.cl_chg_dev_linkage.  It's fine.)

That means the interesting items are those accessed by chlg_registered_dev_find_by_obd, like:

chlg_registered_devices

But really, it's the lists in there that are interesting.

Comment by James A Simmons [ 18/Jul/19 ]

This might be https://jira.whamcloud.com/browse/LU-11626

Comment by Patrick Farrell (Inactive) [ 18/Jul/19 ]

It's worth noting that I am utterly unwilling to try to figure out how the nested loop walking translates to assembly, given the compiler translations, but...

Looking at the access:

0xef8(%rax),%rdx 

It's at an offset of 0xef8, which is 3832, which must be in to the OBD struct (the other one is way too small).

So this is likely u.cli.cl_chg_dev_linkage for a particular OBD struct (That's about the right distance in to the OBD struct, based on eyeballing it.  If I had the dump & modules I could check).  But it's presumably the linkage of a different OBD struct.

Comment by Patrick Farrell (Inactive) [ 18/Jul/19 ]

Ah, yeah, James, I suspect you are correct - That matches up nicely with the unmount here.

Did Neil ever get anywhere with that one?

Comment by James A Simmons [ 18/Jul/19 ]

No, never was seen before so no one looked to fixed it

Comment by James A Simmons [ 19/Jul/19 ]

Okay I'm looking into it. Please follow under ticket LU-11626

Comment by Peter Jones [ 19/Jul/19 ]

Duplicate of LU-11626

Comment by Andreas Dilger [ 01/Aug/19 ]

How many filesystems are being mounted on this client, and what is the total number of MDTs being mounted? I'm wondering if this relates to LU-12506, which causes the client mount to fail if there are "too many" (in the range 56-64) different MDC devices registered on a client, and then there is some kind of bug during the unmount?

Or do you have something like automount active and there are filesystems mounting/unmounting regularly (maybe with subdirectory mounts?) and there is a race where the same filesystem is mounted multiple times and unmounted?

Comment by Andreas Dilger [ 01/Aug/19 ]

It looks like mdc_changelog_cdev_finish() is walking the chlg_registered_dev and cl_chg_dev_linkage lists without a lock, which is racy with other threads adding/removing entries from the list:

void mdc_changelog_cdev_finish(struct obd_device *obd)
{
        struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
        ENTRY;

        mutex_lock(&chlg_registered_dev_lock);
        list_del_init(&obd->u.cli.cl_chg_dev_linkage);
        kref_put(&dev->ced_refs, chlg_dev_clear);
        mutex_unlock(&chlg_registered_dev_lock);
        EXIT;
}

A simple fix for this particular issue might be to move the fetching of dev under chlg_registered_dev_lock. It would also be a good idea to add LASSERT(mutex_is_locked(&chlg_registered_dev_lock)) to the start of chlg_registered_dev_find_by_obd() and chlg_registered_dev_find_by_obd() to avoid such bugs in the future.

This doesn't fix the larger bug in LU-11626 but may at least avoid the particular issue being seen here, and be simple enough to backport to older releases.

Comment by Andreas Dilger [ 01/Aug/19 ]

Reopen this ticket, since I think there is a simpler fix that could be used for the short term.

Comment by Gerrit Updater [ 01/Aug/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35668
Subject: LU-12566 mdc: hold lock while walking changelog dev list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b300db7ae6df0166dd5806d7be0cce1fb5aff4b

Comment by Mahmoud Hanafi [ 01/Aug/19 ]

At max a client can have up to 11 filesystems and 14 MDTs.
We don't have subdirectory mounts.
We do run automounter, but we have been running automounter for some time, with 2.7 and 2.10 clients. We only recently started seeing these crashes. When I open this case we saw several in one day but since then we only see 1 every day.

Comment by Andreas Dilger [ 01/Aug/19 ]

I think the problem actually relates to the automount. The first lines of your posting are:

[5426573.363714] Lustre: Mounted nbp16-client
[5428374.627398] general protection fault: 0000 [#1] 
[5428374.627407] Lustre: Unmounted nbp14-client

so two different filesystems are mounting and unmounting at the same time. There was some unsafe list access in the unmount code that would cause the problem you hit if the list is being changed by another client mounting/unmounting at the same time.

The patch is very simple and I verified it applies cleanly to the b2_10 code. It has just starting testing, but should hopefully be ready for your use by tomorrow unless there is some problem found in the patch.

Comment by Gerrit Updater [ 15/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35668/
Subject: LU-12566 mdc: hold lock while walking changelog dev list
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a260c530801db7f58efa93b774f06b0ce72649a3

Comment by Peter Jones [ 15/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35835
Subject: LU-12566 mdc: hold lock while walking changelog dev list
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 633905829fe733b1557fa3266e0b638674eca1b3

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35835/
Subject: LU-12566 mdc: hold lock while walking changelog dev list
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: b574f5f88878167b846a90ab09ac3b98958d52e4

Generated at Sat Feb 10 02:53:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.