[LU-12004] Crash in do_csum Created: 25/Feb/19 Updated: 08/May/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
I see this semi-frequently in master even after This is typically only in racer and the full crash looks like this: [ 8628.366285] Lustre: DEBUG MARKER: == racer test 1: racer on clients: centos-70.localnet DURATION=2700 ================================== 05:27:21 (1549708041) [ 8629.054425] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000402:0x4:0x0], use llapi_layout_get_by_path() [ 8630.549219] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1) [ 8634.303466] LustreError: 14083:0:(mdt_lvb.c:430:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 472 for EA 496 (max_mdsize 496): rc = -34 [ 8779.449264] BUG: unable to handle kernel paging request at ffff8800aa2dc000 [ 8779.449670] IP: [<ffffffff813ee500>] do_csum+0x70/0x180 [ 8779.449670] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 8779.449670] CPU: 9 PID: 15375 Comm: ll_ost_io04_000 3.10.0-7.6-debug #1 [ 8779.449670] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 8779.509742] Call Trace: [ 8779.509742] [<ffffffff813ee61e>] ip_compute_csum+0xe/0x30 [ 8779.509742] [<ffffffffa035e62e>] obd_dif_ip_fn+0xe/0x10 [obdclass] [ 8779.523520] [<ffffffffa035e6f9>] obd_page_dif_generate_buffer+0xc9/0x190 [obdclass] [ 8779.523520] [<ffffffffa05e18db>] tgt_checksum_niobuf_rw+0x28b/0xea0 [ptlrpc] [ 8779.541604] [<ffffffffa05e7e8d>] tgt_brw_read+0xc2d/0x1e60 [ptlrpc] [ 8779.541604] [<ffffffffa05e62a5>] tgt_request_handle+0x915/0x1610 [ptlrpc] [ 8779.541604] [<ffffffffa058b3d9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [ 8779.541604] [<ffffffffa058f3bc>] ptlrpc_main+0xb7c/0x22c0 [ptlrpc] [ 8779.541604] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 8779.541604] [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21 note that even before ti10dif was landed I still saw this, just a bit different trace. It seems in all cases only tgt_brw_read is hitting this |
| Comments |
| Comment by Oleg Drokin [ 25/Feb/19 ] |
|
here's another report of this same nature, but in adler: BUG: unable to handle kernel paging request at ffff8802a91ff000 IP: [<ffffffffa020e3f0>] adler32_update+0x70/0x250 [libcfs] PGD 241b067 PUD 33ebfa067 PMD 33eab1067 PTE 80000002a91ff060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common virtio_balloon virtio_console i2c_piix4 pcspkr ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm ata_piix drm_panel_orientation_quirks serio_raw virtio_blk libata i2c_core floppy [last unloaded: libcfs] CPU: 11 PID: 22659 Comm: ll_ost_io05_006 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 task: ffff88029dfbe200 ti: ffff8802fa270000 task.ti: ffff8802fa270000 RIP: 0010:[<ffffffffa020e3f0>] [<ffffffffa020e3f0>] adler32_update+0x70/0x250 [libcfs] RSP: 0018:ffff8802fa273840 EFLAGS: 00010212 RAX: 0000000000001000 RBX: 0000000000001000 RCX: 0000000000000002 RDX: 0000000000001000 RSI: ffff8802a91ff000 RDI: ffff8802a91ff000 RBP: ffff8802fa2738a8 R08: 0000000000000000 R09: 0000000000001000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffea00094fefca R13: 0000000000001000 R14: ffffffffa0234cd0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88033dcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8802a91ff000 CR3: 0000000223968000 CR4: 00000000000006e0 Call Trace: [<ffffffffa020d4cd>] ? cfs_crypto_hash_alloc+0xcd/0x440 [libcfs] [<ffffffff813910b7>] crypto_shash_update+0x47/0x120 [<ffffffff813913de>] shash_ahash_update+0x3e/0x70 [<ffffffff81391422>] shash_async_update+0x12/0x20 [<ffffffffa020d3b3>] cfs_crypto_hash_update_page+0x93/0xc0 [libcfs] [<ffffffffa061847e>] tgt_checksum_niobuf_rw+0x8ce/0xea0 [ptlrpc] [<ffffffffa035f0e5>] ? lprocfs_stats_unlock+0x45/0x50 [obdclass] [<ffffffffa0361119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [<ffffffffa1232326>] ? ofd_preprw+0x5d6/0x1160 [ofd] [<ffffffffa05dbcad>] ? __req_capsule_get+0x15d/0x700 [ptlrpc] [<ffffffffa0393e10>] ? obd_dif_crc_fn+0x20/0x20 [obdclass] [<ffffffffa061a41d>] tgt_brw_read+0xc2d/0x1e60 [ptlrpc] [<ffffffff812127f4>] ? __kmalloc+0x634/0x660 [<ffffffff813eca64>] ? vsnprintf+0x234/0x6a0 [<ffffffffa0361119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [<ffffffffa05b6fe6>] ? lustre_pack_reply_v2+0x166/0x290 [ptlrpc] [<ffffffffa05b717f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [<ffffffffa05b7301>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [<ffffffffa061e355>] tgt_request_handle+0xaf5/0x1590 [ptlrpc] [<ffffffffa0211fa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [<ffffffffa05c2436>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc] [<ffffffffa05c6329>] ptlrpc_main+0xa99/0x1f60 [ptlrpc] [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0 [<ffffffffa05c5890>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 |
| Comment by James A Simmons [ 25/Feb/19 ] |
|
Oleg can you reproduce this? |
| Comment by Alex Zhuravlev [ 31/Oct/22 ] |
|
uptodate master, zfs backend: [ 59.782132] Lustre: DEBUG MARKER: == racer test 1: racer on clients: tmp.MXpJUErHf4 DURATION=2700 ========================================================== 12:36:17 (1667133377) .. [ 2243.840113] BUG: unable to handle kernel paging request at ffff89346d5a4000 [ 2243.840277] PGD 110e01067 P4D 110e01067 PUD 1b0979067 PMD 1b080e067 PTE 800ffffe92a5b060 [ 2243.840328] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 2243.840355] CPU: 0 PID: 6959 Comm: ll_ost_io00_002 Tainted: G W O --------- - - 4.18.0 #2 [ 2243.840409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2243.840443] RIP: 0010:do_csum+0x6a/0x160 [ 2243.840472] Code: c7 04 0f 85 bb 00 00 00 41 89 c1 c1 e8 04 41 d1 e9 85 c0 0f 84 ff 00 00 00 83 e8 01 45 31 c0 48 83 c0 01 48 c1 e0 06 48 01 f8 <48> 03 17 48 13 57 08 48 13 57 10 48 13 57 18 48 13 57 20 48 13 57 [ 2243.840560] RSP: 0018:ffff89349c66fb18 EFLAGS: 00010286 [ 2243.840586] RAX: ffff89346d5a5000 RBX: 0000000000001000 RCX: 0000000000000000 [ 2243.840622] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff89346d5a4000 [ 2243.840658] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000200 [ 2243.840694] R10: 0000000000000000 R11: ffff89346d5a4000 R12: ffff89346dde7002 [ 2243.840730] R13: 0000000000001000 R14: ffff89346d5a4000 R15: 0000000000000001 [ 2243.840767] FS: 0000000000000000(0000) GS:ffff8934a9c00000(0000) knlGS:0000000000000000 [ 2243.840816] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2243.840846] CR2: ffff89346d5a4000 CR3: 00000000d3b2c000 CR4: 00000000000006b0 [ 2243.840885] Call Trace: [ 2243.840912] ip_compute_csum+0x5/0x30 [ 2243.840975] obd_page_dif_generate_buffer+0xf8/0x1b0 [obdclass] [ 2243.841101] tgt_checksum_niobuf_rw+0xa2d/0x15a0 [ptlrpc] [ 2243.841200] ? obd_dif_crc_fn+0x10/0x10 [obdclass] [ 2243.841260] ? obd_dif_crc_fn+0x10/0x10 [obdclass] [ 2243.841340] tgt_brw_read+0x1752/0x2010 [ptlrpc] [ 2243.841393] ? static_obj+0x2d/0x50 [ 2243.841422] ? lockdep_init_map_waits+0x4b/0x210 [ 2243.841492] ? lustre_pack_reply_v2+0x20b/0x2b0 [ptlrpc] [ 2243.841580] ? lustre_pack_reply_flags+0x55/0x1b0 [ptlrpc] [ 2243.841671] tgt_request_handle+0x977/0x1a40 [ptlrpc] [ 2243.841756] ptlrpc_main+0x1724/0x32c0 [ptlrpc] [ 2243.841844] ? ptlrpc_wait_event+0x4b0/0x4b0 [ptlrpc] |