[LU-12135] general protection faults possibly caused by lustre Created: 29/Mar/19  Updated: 11/Dec/19  Resolved: 11/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shane Nehring Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL 7.6 clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We've had a couple clients crash since updating to 2.12.0 that we think are being caused by lustre. At least that's where our OS vendor is leading us after their analysis of the kernel dumps.

 

Stack trace of one of the crashes:

[494022.831748] general protection fault: 0000 1 SMP
[494022.832460] Modules linked in: cmac nls_utf8 cifs ccm cts 8021q garp mrp rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack mpt2sas raid_class scsi_transport_sas mptctl mptbase ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security ib_isert iscsi_target_mod iptable_raw ebtable_filter ebtables ib_srpt target_core_mod ip6table_filter ip6_tables ib_srp iptable_filter scsi_transport_srp
[494022.837335] scsi_tgt ib_ucm i40iw dell_rbu vfat fat skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dell_smbios dcdbas dell_wmi_descriptor rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi opa_vnic scsi_transport_iscsi ib_umad sg wdat_wdt pcspkr lpc_ich i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler wmi mei_me acpi_power_meter mei acpi_pad auth_rpcgss sunrpc ip_tables xfs libcrc32c rdma_cm iw_cm ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e igb drm_panel_orientation_quirks megaraid_sas ptp pps_core dca ahci libahci nfit libata libnvdimm dm_mirror dm_region_hash
[494022.843520] dm_log dm_mod hfi1 rdmavt ib_core i2c_algo_bit
[494022.845757] CPU: 100 PID: 71641 Comm: ptlrpcd_00_12 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7.x86_64 #1
[494022.848170] Hardware name: Dell Inc. PowerEdge R940/0D41HC, BIOS 1.6.12 11/20/2018
[494022.849441] task: ffff8c8331ed9040 ti: ffff8c831b338000 task.ti: ffff8c831b338000
[494022.850687] RIP: 0010:[<ffffffffa0e1d764>] [<ffffffffa0e1d764>] __kmalloc+0x94/0x230
[494022.851968] RSP: 0018:ffff8c831b33b888 EFLAGS: 00010286
[494022.853314] RAX: 0000000000000000 RBX: 000000000000002a RCX: 0000000000acf8fc
[494022.854641] RDX: 0000000000acf8fb RSI: 0000000000000000 RDI: 0000000000000003
[494022.855942] RBP: ffff8c831b33b8b8 R08: 000000000001f080 R09: ffffffffc0dc7519
[494022.857300] R10: ffff8b663fc07c00 R11: 0000000000000000 R12: 0000000000008250
[494022.858620] R13: 006fffff0000086c R14: 0000000000000020 R15: ffff8b663fc07c00
[494022.859923] FS: 0000000000000000(0000) GS:ffff8bc33fc40000(0000) knlGS:0000000000000000
[494022.861295] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[494022.862614] CR2: 00007f6a6ba6f1b0 CR3: 000000ca77010000 CR4: 00000000007607e0
[494022.863948] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[494022.865224] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[494022.865990] PKRU: 00000000
[494022.866748] Call Trace:
[494022.867569] [<ffffffffc0dc7519>] ? ptlrpc_new_bulk+0x469/0x870 [ptlrpc]
[494022.868401] [<ffffffffc0dc7519>] ptlrpc_new_bulk+0x469/0x870 [ptlrpc]
[494022.869222] [<ffffffffc0dc797d>] ptlrpc_prep_bulk_imp+0x5d/0x180 [ptlrpc]
[494022.870007] [<ffffffffc0ddcee7>] ? lustre_msg_set_timeout+0x27/0xa0 [ptlrpc]
[494022.870799] [<ffffffffc0f3b6e0>] osc_brw_prep_request+0x2d0/0x1330 [osc]
[494022.871597] [<ffffffffc0d9e77b>] ? __ldlm_handle2lock+0x3b/0x3f0 [ptlrpc]
[494022.872424] [<ffffffffc0f4b4be>] ? osc_obj_dlmlock_at_pgoff+0x15e/0x2c0 [osc]
[494022.873267] [<ffffffffc0f43df2>] ? osc_req_attr_set+0x152/0x610 [osc]
[494022.874066] [<ffffffffc0f3e702>] osc_build_rpc+0x562/0x1070 [osc]
[494022.874874] [<ffffffffc0f59007>] osc_io_unplug0+0xe27/0x1920 [osc]
[494022.875717] [<ffffffffc0f57cd7>] ? osc_extent_finish+0x5e7/0xaf0 [osc]
[494022.876580] [<ffffffffc0bdc7e9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[494022.877411] [<ffffffffa0ce015c>] ? update_curr+0x14c/0x1e0
[494022.878241] [<ffffffffc0f337a3>] brw_queue_work+0x33/0xd0 [osc]
[494022.879088] [<ffffffffc0dd0d2a>] work_interpreter+0x3a/0xf0 [ptlrpc]
[494022.879921] [<ffffffffc0dcdac1>] ptlrpc_check_set.part.23+0x481/0x1df0 [ptlrpc]
[494022.880752] [<ffffffffa0c2a59e>] ? __switch_to+0xce/0x580
[494022.881577] [<ffffffffc0dcf48b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[494022.882357] [<ffffffffc0dfaafb>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
[494022.883120] [<ffffffffc0dfaed9>] ptlrpcd+0x2f9/0x550 [ptlrpc]
[494022.883872] [<ffffffffa0cd67f0>] ? wake_up_state+0x20/0x20
[494022.884626] [<ffffffffc0dfabe0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[494022.885340] [<ffffffffa0cc1c71>] kthread+0xd1/0xe0
[494022.886035] [<ffffffffa0cc1ba0>] ? insert_kthread_work+0x40/0x40
[494022.886717] [<ffffffffa1375c1d>] ret_from_fork_nospec_begin+0x7/0x21
[494022.887360] [<ffffffffa0cc1ba0>] ? insert_kthread_work+0x40/0x40
[494022.888001] Code: 3a 1f 5f 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 29 01 00 00 48 85 c0 0f 84 20 01 00 00 49 63 42 20 48 8d 4a 01 4d 8b 02 <49> 8b 5c 05 00 4c 89 e8 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49
[494022.889372] RIP [<ffffffffa0e1d764>] __kmalloc+0x94/0x230
[494022.889988] RSP <ffff8c831b33b888>

 

I can provide the core dumps if desired, but am not comfortable posting them here.

 

Please let me know if you need more information.



 Comments   
Comment by Shane Nehring [ 02/Apr/19 ]

I should clarify, we've seen GPFs in non lustre code as well as the crash shown here. We're working on additional troubleshooting with our OS vendor, but it's looking like a use after free somewhere in lustre.

Comment by Shane Nehring [ 03/May/19 ]

2.12.1's been released, and the changelog mentions a fixed use after free that's timeline fits with when we started to see this issue. Will update to that and see if issue reoccurs when we get the chance.

Comment by Shane Nehring [ 20/Jun/19 ]

We're still seeing some crashes that seem to point to a uaf somewhere, possibly in lustre with 2.12.2

Comment by Shane Nehring [ 11/Dec/19 ]

Not had this occur again for some time. Either whatever io pattern was responsible has ceased or the issue was resolved by a kernel update.

Comment by Shane Nehring [ 11/Dec/19 ]

I can't seem to close this, but it should probably be closed.

Comment by Peter Jones [ 11/Dec/19 ]

Thanks for the update snehring

Generated at Sat Feb 10 08:57:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.