Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
sanity test_60g crashes with ‘BUG: unable to handle kernel NULL pointer dereference at 0000000000000020’. It looks like this crash started on 14 MAY 2020 wth Lustre 2.13.53.163; https://testing.whamcloud.com/test_sets/43a85730-bdef-4e31-96c1-e55123e61834.
Looking at the recent failure at https://testing.whamcloud.com/test_sets/2e77f7fb-66a5-4f96-8c6f-2da5e50d758c, MDS1,3 (vm4) crashes with
[ 6267.308955] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 02:08:06 (1608775686) [ 6268.654043] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6269.299436] Lustre: *** cfs_fail_loc=19a, val=0*** [ 6270.148017] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6270.553314] Lustre: *** cfs_fail_loc=19a, val=0*** [ 6271.630819] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6272.352231] Lustre: *** cfs_fail_loc=19a, val=0*** [ 6272.620944] LustreError: 90555:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 llog-records: rc = -116 [ 6272.623393] LustreError: 90555:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116 [ 6273.154697] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6273.473004] LustreError: 90555:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 llog-records: rc = -116 [ 6273.475379] LustreError: 90555:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116 [ 6274.668119] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6275.360154] Lustre: *** cfs_fail_loc=19a, val=0*** [ 6275.361107] Lustre: Skipped 1 previous similar message [ 6275.362104] LustreError: 317468:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5 [ 6275.364276] LustreError: 317468:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5 [ 6276.182334] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6277.678186] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6279.180357] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6279.950227] Lustre: *** cfs_fail_loc=19a, val=0*** [ 6279.951313] Lustre: Skipped 2 previous similar messages [ 6280.666107] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 6281.046877] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 6281.048259] PGD 0 P4D 0 [ 6281.048701] Oops: 0000 [#1] SMP PTI [ 6281.049294] CPU: 1 PID: 268538 Comm: mdt_rdpg00_003 Kdump: loaded Tainted: P OE --------- - - 4.18.0-240.1.1.el8_lustre.x86_64 #1 [ 6281.051461] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6281.052496] RIP: 0010:lquota_lqe_debug0+0x2c/0xd0 [lquota] [ 6281.053453] Code: 66 66 90 55 48 89 e5 48 83 ec 60 48 89 4c 24 48 4c 89 44 24 50 4c 89 4c 24 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 <48> 8b 47 20 48 8b 48 10 48 83 79 10 00 74 55 49 89 f2 48 8d 75 10 [ 6281.056543] RSP: 0018:ffffb686c0b73b48 EFLAGS: 00010246 [ 6281.057413] RAX: 0000000000000000 RBX: ffff9743f99edc60 RCX: 0000000200000005 [ 6281.058620] RDX: ffffffffc1596030 RSI: ffffffffc15aad00 RDI: 0000000000000000 [ 6281.059796] RBP: ffffb686c0b73ba8 R08: 0000000000001008 R09: 0000000000000000 [ 6281.060976] R10: fffff9954140e080 R11: ffff974411e19b0c R12: fffffffffffffffb [ 6281.062154] R13: 0000000000000000 R14: ffff9743eda32f20 R15: ffff9743f99edc68 [ 6281.063331] FS: 0000000000000000(0000) GS:ffff97443fd00000(0000) knlGS:0000000000000000 [ 6281.064659] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6281.065597] CR2: 0000000000000020 CR3: 000000007d80a002 CR4: 00000000000606e0 [ 6281.066767] Call Trace: [ 6281.067290] ? libcfs_log_return+0x1e/0x30 [libcfs] [ 6281.068170] ? osd_trans_stop+0x440/0x560 [osd_zfs] [ 6281.068988] qmt_trans_start_with_slv+0x4b2/0x780 [lquota] [ 6281.069912] qmt_dqacq0+0x3b1/0x2380 [lquota] [ 6281.070654] ? qmt_dqacq+0x668/0x790 [lquota] [ 6281.071384] qmt_dqacq+0x668/0x790 [lquota] [ 6281.072234] mdt_quota_dqacq+0x59/0x120 [mdt] [ 6281.073417] tgt_request_handle+0xc78/0x1910 [ptlrpc] [ 6281.074335] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [ 6281.075361] ptlrpc_main+0xba4/0x14a0 [ptlrpc] [ 6281.076133] ? __schedule+0x2ae/0x700 [ 6281.076794] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [ 6281.077740] kthread+0x112/0x130 [ 6281.078318] ? kthread_flush_work_fn+0x10/0x10 [ 6281.079072] ret_from_fork+0x35/0x40 [ 6281.079677] Modules linked in: lustre(OE) obdecho(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp rpcrdma rdma_ucm ib_iser rdma_cm ib_umad iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm mlx4_ib ib_uverbs sunrpc ib_core intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr virtio_balloon i2c_piix4 joydev ip_tables ext4 mbcache jbd2 mlx4_en ata_generic mlx4_core ata_piix 8139too libata 8139cp crc32c_intel serio_raw virtio_blk mii [last unloaded: llog_test]
Some of the crashes, like at https://testing.whamcloud.com/test_sets/ad77dc9b-c06b-4763-8dfb-23d682d5608d, have a different call trace
[ 6394.913521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 6394.915034] PGD 0 P4D 0 [ 6394.915506] Oops: 0000 [#1] SMP PTI [ 6394.916122] CPU: 1 PID: 46502 Comm: mdt_rdpg00_000 Kdump: loaded Tainted: P OE --------- - - 4.18.0-193.19.1.el8_lustre.x86_64 #1 [ 6394.918271] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6394.919484] RIP: 0010:lquota_lqe_debug0+0x2c/0xd0 [lquota] [ 6394.920435] Code: 66 66 90 55 48 89 e5 48 83 ec 60 48 89 4c 24 48 4c 89 44 24 50 4c 89 4c 24 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 <48> 8b 47 20 48 8b 48 10 48 83 79 10 00 74 55 49 89 f2 48 8d 75 10 [ 6394.923548] RSP: 0018:ffffa87f00b379b8 EFLAGS: 00010246 [ 6394.924431] RAX: 0000000000000000 RBX: ffff964a759a7460 RCX: 0000000200000005 [ 6394.925619] RDX: ffffffffc1782d30 RSI: ffffffffc1797d00 RDI: 0000000000000000 [ 6394.926817] RBP: ffffa87f00b37a18 R08: 0000000000001011 R09: 0000000000000000 [ 6394.928005] R10: ffffe45740ef5980 R11: ffff964a5ddea312 R12: fffffffffffffffb [ 6394.929198] R13: 0000000000000000 R14: ffff964a6281e6e0 R15: ffff964a759a7468 [ 6394.930408] FS: 0000000000000000(0000) GS:ffff964abfd00000(0000) knlGS:0000000000000000 [ 6394.931741] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6394.932705] CR2: 0000000000000020 CR3: 000000002b20a004 CR4: 00000000000606e0 [ 6394.933903] Call Trace: [ 6394.934630] ? libcfs_log_return+0x1e/0x30 [libcfs] [ 6394.935665] ? osd_trans_stop+0x440/0x560 [osd_zfs] [ 6394.936508] qmt_trans_start_with_slv+0x4b2/0x780 [lquota] [ 6394.937460] qmt_dqacq0+0x3b1/0x2380 [lquota] [ 6394.938216] ? libcfs_log_return+0x1e/0x30 [libcfs] [ 6394.939062] ? qmt_intent_policy+0x89d/0xeb0 [lquota] [ 6394.939926] ? qmt_intent_policy+0x860/0xeb0 [lquota] [ 6394.940801] qmt_intent_policy+0x89d/0xeb0 [lquota] [ 6394.942105] mdt_intent_opc+0x9a1/0xa80 [mdt] [ 6394.942915] mdt_intent_policy+0x111/0x380 [mdt] [ 6394.945058] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc] [ 6394.945965] ? cfs_hash_multi_bd_lock+0xa0/0xa0 [libcfs] [ 6394.946912] ldlm_handle_enqueue0+0x60f/0x16d0 [ptlrpc] [ 6394.947884] tgt_enqueue+0xa4/0x1f0 [ptlrpc] [ 6394.948715] tgt_request_handle+0xc78/0x1910 [ptlrpc] [ 6394.949639] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [ 6394.950707] ptlrpc_main+0xba4/0x14a0 [ptlrpc] [ 6394.951595] ? __schedule+0x257/0x650 [ 6394.952280] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [ 6394.953292] kthread+0x112/0x130 [ 6394.953887] ? kthread_flush_work_fn+0x10/0x10 [ 6394.954653] ret_from_fork+0x35/0x40
Logs for more crashes are at
https://testing.whamcloud.com/test_sets/d6b80c94-e3da-4ac6-bc2a-282f72de2a4c
https://testing.whamcloud.com/test_sets/ed800f02-efae-4354-a4ca-862609a7ca68