[LU-12004] Crash in do_csum Created: 25/Feb/19  Updated: 08/May/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I see this semi-frequently in master even after LU-11697, so this must be something else.

This is typically only in racer and the full crash looks like this:

[ 8628.366285] Lustre: DEBUG MARKER: == racer test 1: racer on clients: centos-70.localnet DURATION=2700 ================================== 05:27:21 (1549708041)
[ 8629.054425] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000402:0x4:0x0], use llapi_layout_get_by_path()
[ 8630.549219] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1)
[ 8634.303466] LustreError: 14083:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 472 for EA 496 (max_mdsize 496): rc = -34
[ 8779.449264] BUG: unable to handle kernel paging request at ffff8800aa2dc000
[ 8779.449670] IP: [<ffffffff813ee500>] do_csum+0x70/0x180
[ 8779.449670] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8779.449670] CPU: 9 PID: 15375 Comm: ll_ost_io04_000  3.10.0-7.6-debug #1
[ 8779.449670] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 8779.509742] Call Trace:
[ 8779.509742]  [<ffffffff813ee61e>] ip_compute_csum+0xe/0x30
[ 8779.509742]  [<ffffffffa035e62e>] obd_dif_ip_fn+0xe/0x10 [obdclass]
[ 8779.523520]  [<ffffffffa035e6f9>] obd_page_dif_generate_buffer+0xc9/0x190 [obdclass]
[ 8779.523520]  [<ffffffffa05e18db>] tgt_checksum_niobuf_rw+0x28b/0xea0 [ptlrpc]
[ 8779.541604]  [<ffffffffa05e7e8d>] tgt_brw_read+0xc2d/0x1e60 [ptlrpc]
[ 8779.541604]  [<ffffffffa05e62a5>] tgt_request_handle+0x915/0x1610 [ptlrpc]
[ 8779.541604]  [<ffffffffa058b3d9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[ 8779.541604]  [<ffffffffa058f3bc>] ptlrpc_main+0xb7c/0x22c0 [ptlrpc]
[ 8779.541604]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[ 8779.541604]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21

note that even before ti10dif was landed I still saw this, just a bit different trace.

It seems in all cases only tgt_brw_read is hitting this



 Comments   
Comment by Oleg Drokin [ 25/Feb/19 ]

here's another report of this same nature, but in adler:

BUG: unable to handle kernel paging request at ffff8802a91ff000
IP: [<ffffffffa020e3f0>] adler32_update+0x70/0x250 [libcfs]
PGD 241b067 PUD 33ebfa067 PMD 33eab1067 PTE 80000002a91ff060
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common virtio_balloon virtio_console i2c_piix4 pcspkr ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm ata_piix drm_panel_orientation_quirks serio_raw virtio_blk libata i2c_core floppy [last unloaded: libcfs]
CPU: 11 PID: 22659 Comm: ll_ost_io05_006 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
task: ffff88029dfbe200 ti: ffff8802fa270000 task.ti: ffff8802fa270000
RIP: 0010:[<ffffffffa020e3f0>] [<ffffffffa020e3f0>] adler32_update+0x70/0x250 [libcfs]
RSP: 0018:ffff8802fa273840 EFLAGS: 00010212
RAX: 0000000000001000 RBX: 0000000000001000 RCX: 0000000000000002
RDX: 0000000000001000 RSI: ffff8802a91ff000 RDI: ffff8802a91ff000
RBP: ffff8802fa2738a8 R08: 0000000000000000 R09: 0000000000001000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffea00094fefca
R13: 0000000000001000 R14: ffffffffa0234cd0 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88033dcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8802a91ff000 CR3: 0000000223968000 CR4: 00000000000006e0
Call Trace:
[<ffffffffa020d4cd>] ? cfs_crypto_hash_alloc+0xcd/0x440 [libcfs]
[<ffffffff813910b7>] crypto_shash_update+0x47/0x120
[<ffffffff813913de>] shash_ahash_update+0x3e/0x70
[<ffffffff81391422>] shash_async_update+0x12/0x20
[<ffffffffa020d3b3>] cfs_crypto_hash_update_page+0x93/0xc0 [libcfs]
[<ffffffffa061847e>] tgt_checksum_niobuf_rw+0x8ce/0xea0 [ptlrpc]
[<ffffffffa035f0e5>] ? lprocfs_stats_unlock+0x45/0x50 [obdclass]
[<ffffffffa0361119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[<ffffffffa1232326>] ? ofd_preprw+0x5d6/0x1160 [ofd]
[<ffffffffa05dbcad>] ? __req_capsule_get+0x15d/0x700 [ptlrpc]
[<ffffffffa0393e10>] ? obd_dif_crc_fn+0x20/0x20 [obdclass]
[<ffffffffa061a41d>] tgt_brw_read+0xc2d/0x1e60 [ptlrpc]
[<ffffffff812127f4>] ? __kmalloc+0x634/0x660
[<ffffffff813eca64>] ? vsnprintf+0x234/0x6a0
[<ffffffffa0361119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[<ffffffffa05b6fe6>] ? lustre_pack_reply_v2+0x166/0x290 [ptlrpc]
[<ffffffffa05b717f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
[<ffffffffa05b7301>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[<ffffffffa061e355>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
[<ffffffffa0211fa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[<ffffffffa05c2436>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc]
[<ffffffffa05c6329>] ptlrpc_main+0xa99/0x1f60 [ptlrpc]
[<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0
[<ffffffffa05c5890>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
[<ffffffff810b4ed4>] kthread+0xe4/0xf0
[<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
Comment by James A Simmons [ 25/Feb/19 ]

Oleg can you reproduce this?

Comment by Alex Zhuravlev [ 31/Oct/22 ]

uptodate master, zfs backend:

[   59.782132] Lustre: DEBUG MARKER: == racer test 1: racer on clients: tmp.MXpJUErHf4 DURATION=2700 ========================================================== 12:36:17 (1667133377)
..
[ 2243.840113] BUG: unable to handle kernel paging request at ffff89346d5a4000
[ 2243.840277] PGD 110e01067 P4D 110e01067 PUD 1b0979067 PMD 1b080e067 PTE 800ffffe92a5b060
[ 2243.840328] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2243.840355] CPU: 0 PID: 6959 Comm: ll_ost_io00_002 Tainted: G        W  O     --------- -  - 4.18.0 #2
[ 2243.840409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2243.840443] RIP: 0010:do_csum+0x6a/0x160
[ 2243.840472] Code: c7 04 0f 85 bb 00 00 00 41 89 c1 c1 e8 04 41 d1 e9 85 c0 0f 84 ff 00 00 00 83 e8 01 45 31 c0 48 83 c0 01 48 c1 e0 06 48 01 f8 <48> 03 17 48 13 57 08 48 13 57 10 48 13 57 18 48 13 57 20 48 13 57
[ 2243.840560] RSP: 0018:ffff89349c66fb18 EFLAGS: 00010286
[ 2243.840586] RAX: ffff89346d5a5000 RBX: 0000000000001000 RCX: 0000000000000000
[ 2243.840622] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff89346d5a4000
[ 2243.840658] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000200
[ 2243.840694] R10: 0000000000000000 R11: ffff89346d5a4000 R12: ffff89346dde7002
[ 2243.840730] R13: 0000000000001000 R14: ffff89346d5a4000 R15: 0000000000000001
[ 2243.840767] FS:  0000000000000000(0000) GS:ffff8934a9c00000(0000) knlGS:0000000000000000
[ 2243.840816] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2243.840846] CR2: ffff89346d5a4000 CR3: 00000000d3b2c000 CR4: 00000000000006b0
[ 2243.840885] Call Trace:
[ 2243.840912]  ip_compute_csum+0x5/0x30
[ 2243.840975]  obd_page_dif_generate_buffer+0xf8/0x1b0 [obdclass]
[ 2243.841101]  tgt_checksum_niobuf_rw+0xa2d/0x15a0 [ptlrpc]
[ 2243.841200]  ? obd_dif_crc_fn+0x10/0x10 [obdclass]
[ 2243.841260]  ? obd_dif_crc_fn+0x10/0x10 [obdclass]
[ 2243.841340]  tgt_brw_read+0x1752/0x2010 [ptlrpc]
[ 2243.841393]  ? static_obj+0x2d/0x50
[ 2243.841422]  ? lockdep_init_map_waits+0x4b/0x210
[ 2243.841492]  ? lustre_pack_reply_v2+0x20b/0x2b0 [ptlrpc]
[ 2243.841580]  ? lustre_pack_reply_flags+0x55/0x1b0 [ptlrpc]
[ 2243.841671]  tgt_request_handle+0x977/0x1a40 [ptlrpc]
[ 2243.841756]  ptlrpc_main+0x1724/0x32c0 [ptlrpc]
[ 2243.841844]  ? ptlrpc_wait_event+0x4b0/0x4b0 [ptlrpc]
Generated at Sat Feb 10 02:48:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.