[LU-9279] coral-beta-combined build 124 kernel BUG at include/linux/scatterlist.h:65! invalid opcode: 0000 [#1] SMP Created: 15/Mar/17  Updated: 18/Aug/17  Resolved: 18/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: John Salinas (Inactive) Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: LS_RZ, prod
Environment:

Lustre 2.9.0, but with special zfs: fs/zfs -b coral-betel-combined build 124


Issue Links:
Related
is related to LU-9305 Running File System Aging create writ... Resolved
is related to LU-6020 Bugfixes for GSS/Kerberos Resolved
is related to LU-9304 BUG: Bad page state in process ll_ost... Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

Running IOR, Mdtest, fsx, and FileAger on 4 clients to two OSS with dRAID pools with metadata segregation and 1 MDS we hit the following:

[78289.557925] -----------[ cut here ]-----------
[78289.564140] kernel BUG at include/linux/scatterlist.h:65!
[78289.571153] invalid opcode: 0000 1 SMP
[78289.576735] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_ssif sb_edac iTCO_wdt ipmi_devintf iTCO_vendor_support
[78289.665352] edac_core ioatdma ipmi_si sg pcspkr raid_class scsi_transport_sas ipmi_msghandler mei_me shpchp lpc_ich acpi_pad i2c_i801 mei acpi_power_meter mfd_core wmi dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core libahci drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
[78289.725227] CPU: 37 PID: 51095 Comm: ll_ost_io00_005 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
[78289.739202] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[78289.752574] task: ffff882001da8b80 ti: ffff882008ed8000 task.ti: ffff882008ed8000
[78289.762680] RIP: 0010:[<ffffffffa0996fef>] [<ffffffffa0996fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
[78289.775760] RSP: 0018:ffff882008edbab8 EFLAGS: 00010202
[78289.783469] RAX: 0000000000000002 RBX: ffff88068f158580 RCX: 0000000000000000
[78289.793264] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff882008edbad8
[78289.803076] RBP: ffff882008edbb00 R08: 00000000000195a0 R09: ffff882008edbab8
[78289.812907] R10: ffff88103e807900 R11: 0000000000000001 R12: 3534333231303635
[78289.822752] R13: 0000000032313036 R14: 0000000000000433 R15: 0000000000000000
[78289.832613] FS: 0000000000000000(0000) GS:ffff88103f0c0000(0000) knlGS:0000000000000000
[78289.843577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[78289.851940] CR2: 00007fa078dc8028 CR3: 000000000194a000 CR4: 00000000001407e0
[78289.861897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[78289.871871] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[78289.881850] Stack:
[78289.886131] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
[78289.896576] 00000000ec45cb06 0000000000000000 ffff880bc199b201 ffff880e5e4a7e00
[78289.907042] 0000000000000000 ffff882008edbb68 ffffffffa0dd5459 ffff880ff80640a8
[78289.917526] Call Trace:
[78289.922480] [<ffffffffa0dd5459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
[78289.932997] [<ffffffffa0dae21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
[78289.942464] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
[78289.950833] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
[78289.959081] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
[78289.968337] [<ffffffffa0d04560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[78289.978578] [<ffffffffa0daa225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[78289.988447] [<ffffffffa0d561ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[78289.999139] [<ffffffffa099d128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[78290.008864] [<ffffffffa0d53d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[78290.018520] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[78290.027562] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[78290.036171] [<ffffffffa0d5a260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[78290.045239] [<ffffffffa0d597c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[78290.055472] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[78290.062884] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[78290.072171] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[78290.080149] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[78290.089344] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 40 6e e0 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[78290.115466] RIP [<ffffffffa0996fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
[78290.126018] RSP <ffff882008edbab8>

Version information:
[root@wolf-3 10.8.1.3-2017-03-15-15:01:39]# rpm -qa |grep -i lustre
kmod-lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
lustre-2.9.0_dirty-1.el7.centos.x86_64
lustre-iokit-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64
[root@wolf-3 10.8.1.3-2017-03-15-15:01:39]# rpm -qa |grep zfs
libzfs2-0.7.0-rc3_21_g6324695.el7.centos.x86_64
kmod-zfs-0.7.0-rc3_21_g6324695.el7.centos.x86_64
zfs-test-0.7.0-rc3_21_g6324695.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
zfs-0.7.0-rc3_21_g6324695.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64

PID: 51095 TASK: ffff882001da8b80 CPU: 37 COMMAND: "ll_ost_io00_005"
[ffff882008edac50] list_del at ffffffff8130c6dd
[ffff882008edac68] __rmqueue at ffffffff8117069a
[ffff882008edacb0] zone_statistics at ffffffff81189b89
[ffff882008edadb0] list_del at ffffffff8130c6dd
[ffff882008edadc8] __rmqueue at ffffffff8117069a
[ffff882008edae10] zone_statistics at ffffffff81189b89
[ffff882008edaed0] list_del at ffffffff8130c6dd
[ffff882008edaee8] get_partial_node at ffffffff8163306c
[ffff882008edaf40] __alloc_pages_nodemask at ffffffff81173327
[ffff882008edafd8] mempool_alloc_slab at ffffffff8116c235
[ffff882008edb070] kmem_cache_alloc at ffffffff811c1693
[ffff882008edb0b0] mempool_alloc_slab at ffffffff8116c235
[ffff882008edb0c0] mempool_alloc at ffffffff8116c379
[ffff882008edb118] __blk_segment_map_sg at ffffffff812d0736
[ffff882008edb128] update_curr at ffffffff810c15cc
[ffff882008edb140] account_entity_dequeue at ffffffff810be46e
[ffff882008edb168] dequeue_entity at ffffffff810c1a96
[ffff882008edb1b0] list_del at ffffffff8130c6dd
[ffff882008edb1e0] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb230] mga_imageblit at ffffffffa016162f [mgag200]
[ffff882008edb250] bit_putcs at ffffffff81356997
[ffff882008edb260] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb2b0] mga_imageblit at ffffffffa016162f [mgag200]
[ffff882008edb330] sys_fillrect at ffffffffa00101a8 [sysfillrect]
[ffff882008edb350] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb380] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb3e0] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb430] mga_imageblit at ffffffffa016162f [mgag200]
[ffff882008edb450] bit_putcs at ffffffff81356997
[ffff882008edb460] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb4b0] mga_imageblit at ffffffffa016162f [mgag200]
[ffff882008edb530] sys_fillrect at ffffffffa00101a8 [sysfillrect]
[ffff882008edb550] mga_dirty_update at ffffffffa01614e7 [mgag200]
[ffff882008edb5a0] append_elf_note at ffffffff810f12e4
[ffff882008edb5f8] crash_save_cpu at ffffffff810f2619
[ffff882008edb700] cfs_crypto_hash_update_page at ffffffffa0996fef [libcfs]
[ffff882008edb778] machine_kexec at ffffffff81051e9b
[ffff882008edb7d8] crash_kexec at ffffffff810f27e2
[ffff882008edb860] cfs_crypto_hash_update_page at ffffffffa0996fef [libcfs]
[ffff882008edb8a8] oops_end at ffffffff8163f448
[ffff882008edb8d0] die at ffffffff8101859b
[ffff882008edb900] do_trap at ffffffff8163eb00
[ffff882008edb950] do_invalid_op at ffffffff81015204
[ffff882008edb968] cfs_crypto_hash_update_page at ffffffffa0996fef [libcfs]
[ffff882008edb9a0] crypto_create_tfm at ffffffff812aaa08
[ffff882008edb9d0] crypto_init_shash_ops_async at ffffffff812b2607
[ffff882008edba00] invalid_op at ffffffff8164825e
[ffff882008edba88] cfs_crypto_hash_update_page at ffffffffa0996fef [libcfs]
[ffff882008edbab0] cfs_crypto_hash_update_page at ffffffffa0996f94 [libcfs]
[ffff882008edbb08] tgt_checksum_bulk at ffffffffa0dd5459 [ptlrpc]
[ffff882008edbb70] tgt_brw_write at ffffffffa0dae21d [ptlrpc]
[ffff882008edbb98] __slab_free at ffffffff81632d15
[ffff882008edbbc8] update_curr at ffffffff810c15cc
[ffff882008edbbe0] account_entity_dequeue at ffffffff810be46e
[ffff882008edbc88] target_bulk_timeout at ffffffffa0d04560 [ptlrpc]
[ffff882008edbcd8] tgt_request_handle at ffffffffa0daa225 [ptlrpc]
[ffff882008edbd20] ptlrpc_server_handle_request at ffffffffa0d561ab [ptlrpc]
[ffff882008edbd28] lc_watchdog_touch at ffffffffa099d128 [libcfs]
[ffff882008edbd50] ptlrpc_wait_event at ffffffffa0d53d68 [ptlrpc]
[ffff882008edbd58] default_wake_function at ffffffff810b8952
[ffff882008edbd68] __wake_up_common at ffffffff810af0b8
[ffff882008edbde8] ptlrpc_main at ffffffffa0d5a260 [ptlrpc]
[ffff882008edbea8] ptlrpc_main at ffffffffa0d597c0 [ptlrpc]
[ffff882008edbec8] kthread at ffffffff810a5b8f
[ffff882008edbf30] kthread at ffffffff810a5ac0
[ffff882008edbf50] ret_from_fork at ffffffff81646a98
[ffff882008edbf80] kthread at ffffffff810a5ac0



 Comments   
Comment by John Salinas (Inactive) [ 16/Mar/17 ]

Dump is in: /scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-03-15-15:01:39/

0x6f50 is in cfs_crypto_hash_update_page (/usr/src/debug/lustre-2.9.0_dirty/libcfs/libcfs/linux/linux-crypto.c:230).
225 * \retval negative errno on failure
226 */
227 int cfs_crypto_hash_update_page(struct cfs_crypto_hash_desc *hdesc,
228 struct page *page, unsigned int offset,
229 unsigned int len)
230

{ 231 struct ahash_request *req = (void *)hdesc; 232 struct scatterlist sl; 233 234 sg_init_table(&sl, 1); 0xb80d0 is in tgt_brw_write (/usr/src/debug/lustre-2.9.0_dirty/lustre/target/tgt_handler.c:1955). 1950 local_nb[npages - 1].lnb_len - 1, 1951 client_cksum, server_cksum); 1952 }

1953
1954 int tgt_brw_write(struct tgt_session_info *tsi)
1955

{ 1956 struct ptlrpc_request *req = tgt_ses_req(tsi); 1957 struct ptlrpc_bulk_desc *desc = NULL; 1958 struct obd_export *exp = req->rq_export; 1959 struct niobuf_remote *remote_nb; 0xf3f1 is in target_send_reply_msg (/usr/src/debug/lustre-2.9.0_dirty/lustre/ldlm/ldlm_lib.c:2902). 2897 RETURN(0); 2898 }

2899
2900 static int target_send_reply_msg(struct ptlrpc_request *req,
2901 int rc, int fail_id)
2902 {
2903 if (OBD_FAIL_CHECK_ORSET(fail_id & ~OBD_FAIL_ONCE, OBD_FAIL_ONCE))

{ 2904 DEBUG_REQ(D_ERROR, req, "dropping reply"); 2905 return -ECOMM; 2906 }


0xf3f1 is in target_send_reply_msg (/usr/src/debug/lustre-2.9.0_dirty/lustre/ldlm/ldlm_lib.c:2902).
2897 RETURN(0);
2898 }
2899
2900 static int target_send_reply_msg(struct ptlrpc_request *req,
2901 int rc, int fail_id)
2902 {
2903 if (OBD_FAIL_CHECK_ORSET(fail_id & ~OBD_FAIL_ONCE, OBD_FAIL_ONCE)) {2904 DEBUG_REQ(D_ERROR, req, "dropping reply");2905 return -ECOMM;2906 }
Comment by Peter Jones [ 31/Mar/17 ]

Nathaniel

Could you please assist with this one? Oleg wonders if it is a similar issue to LU-6020

Peter

Comment by John Salinas (Inactive) [ 12/Apr/17 ]

Nathaniel do you have any questions for us?

Comment by Nathaniel Clark [ 12/Apr/17 ]

This does look like the same area as LU-6020, but those patches should all be landed to master by 2.9.0.

The crash is due to an asssertion in the scatterlist code:

0xacb1 is in cfs_crypto_hash_update_page (include/linux/scatterlist.h:65).
60 
61 /*
62 * In order for the low bit stealing approach to work, pages
63 * must be aligned at a 32-bit boundary as a minimum.
64 */
65 BUG_ON((unsigned long) page & 0x03);
66 #ifdef CONFIG_DEBUG_SG
67 BUG_ON(sg->sg_magic != SG_MAGIC);
68 BUG_ON(sg_is_chain(sg));
69 #endif

jsalians_intel,

Do any of your patches on 2.9, go anywhere near this code? Are you using the vanilla 2.9 code? It looks like the kiov_page in the ptlrpc_bulk_desc either didn't get initialized or was set from a bad value.

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

We use vanilla 2.9 code, but we have enabled tried both of these on the clients:
/usr/sbin/lctl set_param osc.*.max_pages_per_rpc=1024
/usr/sbin/lctl set_param osc.*.max_pages_per_rpc=4096

And this on the server:
lctl set_param obdfilter.lsdraid-OST0000.brw_size=16

All of our code changes are in ZFS where we have a dRAID pool instead of a RAIDz pool and that pool has metadata segregation.

Comment by Nathaniel Clark [ 13/Apr/17 ]

jsalians_intel,

Do you have a crash dump from this? (possibly in /var/crash/<whateverdateandtime>/)

Could you also attach an sosreport from the MDS?

Thanks

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

We are using the vanilla 2.9.0 code, but the ZFS code we have added a new RAID type (dRAID). We are making use of both 16MB RPCs from Lustre Client to OSS and have BRW size to 16 as well.

Hope that helps,
john

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

The dump was in: /scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-03-15-15:01:39/ – I have asked doc if it can be restored I am not sure. I guess we need a more persistent place for these ...

Comment by Nathaniel Clark [ 13/Apr/17 ]

Given this and LU-9304, could the new dRAID code be munging page pointers?

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

I could try to reproduce this on raidz if that would help. I am not sure how to tell this but a bunch of work was done to integrate ABD (arc buffer data) which included switching how zfs handles linux memory to make it more efficient.

Comment by Nathaniel Clark [ 13/Apr/17 ]

That sounds suspicious. Which branch is that on (just to simplify my searching)?

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

fs/zfs -b coral-beta-combined

I have run heavy loads with just zfs without issue

However, running with Lustre 2.9.0 + this causes issues. The ABD worked has already been merged in theory if that is the issue we could reproduce this with just 0.7.0.

Comment by Nathaniel Clark [ 14/Apr/17 ]

The ABD code uses scatterlist's to track pages. It's like the bits sg uses on page_link are leaking... but that should cause lots of problems.

Getting a crash dump from this would be the most useful way forward.

Comment by Andreas Dilger [ 25/Apr/17 ]

If there is memory corruption in LU-9305 there is a definite possibility that this is related. List corruption is the most common fallout from random memory corruption, because the list_heads are traversed frequently, are randomly spread throughout memory, and can be validated easily for correctness.

Comment by Jinshan Xiong (Inactive) [ 05/May/17 ]

Try Centos 7.3 and check if you can see the issue. I suspect this is kernel bug from crypto framework.

Comment by John Salinas (Inactive) [ 05/May/17 ]

Will do I am working on getting 0.7.0 RC4 + Centos 7.3 + 2.10 Tag

Comment by Andreas Dilger [ 12/May/17 ]

Copying the comments from LU-9305:

This matches the bug description at https://www.spinics.net/lists/linux-crypto/msg22859.html and the corresponding redhat bug is https://bugzilla.redhat.com/show_bug.cgi?id=1399754 but unfortunately I don't have permission to look into the detail.

I tend to think this is a kernel BUG and this piece of code has been changed in RHEL7.3 kernels. I also exercised the same code on CentOS 7.3 and didn't see the issue, so I would recommend to upgrade our distro to RHEL7.3.

Comment by Andreas Dilger [ 12/May/17 ]

John S., could you please comment on whether you have been able to test with RHEL7.3 and if you are still seeing the checksum errors/corruption?

Comment by Peter Jones [ 18/May/17 ]

Descoping from 2.10 in the absence of any information that this is still a live issue

Comment by Nathaniel Clark [ 18/Aug/17 ]

This appears to be fixed by rhel 7.3 kernel update.

Generated at Sat Feb 10 08:14:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.