Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14307

sanity test 60g crashes with ‘unable to handle kernel NULL pointer dereference at 0000000000000020’ in lquota_lqe_debug

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity test_60g crashes with ‘BUG: unable to handle kernel NULL pointer dereference at 0000000000000020’. It looks like this crash started on 14 MAY 2020 wth Lustre 2.13.53.163; https://testing.whamcloud.com/test_sets/43a85730-bdef-4e31-96c1-e55123e61834.

      Looking at the recent failure at https://testing.whamcloud.com/test_sets/2e77f7fb-66a5-4f96-8c6f-2da5e50d758c, MDS1,3 (vm4) crashes with

      [ 6267.308955] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 02:08:06 (1608775686)
      [ 6268.654043] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6269.299436] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 6270.148017] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6270.553314] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 6271.630819] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6272.352231] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 6272.620944] LustreError: 90555:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 llog-records: rc = -116
      [ 6272.623393] LustreError: 90555:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
      [ 6273.154697] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6273.473004] LustreError: 90555:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 llog-records: rc = -116
      [ 6273.475379] LustreError: 90555:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0003-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
      [ 6274.668119] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6275.360154] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 6275.361107] Lustre: Skipped 1 previous similar message
      [ 6275.362104] LustreError: 317468:0:(llog_cat.c:757:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5
      [ 6275.364276] LustreError: 317468:0:(llog_cat.c:794:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5
      [ 6276.182334] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6277.678186] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6279.180357] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6279.950227] Lustre: *** cfs_fail_loc=19a, val=0***
      [ 6279.951313] Lustre: Skipped 2 previous similar messages
      [ 6280.666107] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
      [ 6281.046877] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [ 6281.048259] PGD 0 P4D 0 
      [ 6281.048701] Oops: 0000 [#1] SMP PTI
      [ 6281.049294] CPU: 1 PID: 268538 Comm: mdt_rdpg00_003 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
      [ 6281.051461] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 6281.052496] RIP: 0010:lquota_lqe_debug0+0x2c/0xd0 [lquota]
      [ 6281.053453] Code: 66 66 90 55 48 89 e5 48 83 ec 60 48 89 4c 24 48 4c 89 44 24 50 4c 89 4c 24 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 <48> 8b 47 20 48 8b 48 10 48 83 79 10 00 74 55 49 89 f2 48 8d 75 10
      [ 6281.056543] RSP: 0018:ffffb686c0b73b48 EFLAGS: 00010246
      [ 6281.057413] RAX: 0000000000000000 RBX: ffff9743f99edc60 RCX: 0000000200000005
      [ 6281.058620] RDX: ffffffffc1596030 RSI: ffffffffc15aad00 RDI: 0000000000000000
      [ 6281.059796] RBP: ffffb686c0b73ba8 R08: 0000000000001008 R09: 0000000000000000
      [ 6281.060976] R10: fffff9954140e080 R11: ffff974411e19b0c R12: fffffffffffffffb
      [ 6281.062154] R13: 0000000000000000 R14: ffff9743eda32f20 R15: ffff9743f99edc68
      [ 6281.063331] FS:  0000000000000000(0000) GS:ffff97443fd00000(0000) knlGS:0000000000000000
      [ 6281.064659] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6281.065597] CR2: 0000000000000020 CR3: 000000007d80a002 CR4: 00000000000606e0
      [ 6281.066767] Call Trace:
      [ 6281.067290]  ? libcfs_log_return+0x1e/0x30 [libcfs]
      [ 6281.068170]  ? osd_trans_stop+0x440/0x560 [osd_zfs]
      [ 6281.068988]  qmt_trans_start_with_slv+0x4b2/0x780 [lquota]
      [ 6281.069912]  qmt_dqacq0+0x3b1/0x2380 [lquota]
      [ 6281.070654]  ? qmt_dqacq+0x668/0x790 [lquota]
      [ 6281.071384]  qmt_dqacq+0x668/0x790 [lquota]
      [ 6281.072234]  mdt_quota_dqacq+0x59/0x120 [mdt]
      [ 6281.073417]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [ 6281.074335]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [ 6281.075361]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
      [ 6281.076133]  ? __schedule+0x2ae/0x700
      [ 6281.076794]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
      [ 6281.077740]  kthread+0x112/0x130
      [ 6281.078318]  ? kthread_flush_work_fn+0x10/0x10
      [ 6281.079072]  ret_from_fork+0x35/0x40
      [ 6281.079677] Modules linked in: lustre(OE) obdecho(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp rpcrdma rdma_ucm ib_iser rdma_cm ib_umad iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm mlx4_ib ib_uverbs sunrpc ib_core intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr virtio_balloon i2c_piix4 joydev ip_tables ext4 mbcache jbd2 mlx4_en ata_generic mlx4_core ata_piix 8139too libata 8139cp crc32c_intel serio_raw virtio_blk mii [last unloaded: llog_test]
      

      Some of the crashes, like at https://testing.whamcloud.com/test_sets/ad77dc9b-c06b-4763-8dfb-23d682d5608d, have a different call trace

      [ 6394.913521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [ 6394.915034] PGD 0 P4D 0 
      [ 6394.915506] Oops: 0000 [#1] SMP PTI
      [ 6394.916122] CPU: 1 PID: 46502 Comm: mdt_rdpg00_000 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-193.19.1.el8_lustre.x86_64 #1
      [ 6394.918271] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 6394.919484] RIP: 0010:lquota_lqe_debug0+0x2c/0xd0 [lquota]
      [ 6394.920435] Code: 66 66 90 55 48 89 e5 48 83 ec 60 48 89 4c 24 48 4c 89 44 24 50 4c 89 4c 24 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 <48> 8b 47 20 48 8b 48 10 48 83 79 10 00 74 55 49 89 f2 48 8d 75 10
      [ 6394.923548] RSP: 0018:ffffa87f00b379b8 EFLAGS: 00010246
      [ 6394.924431] RAX: 0000000000000000 RBX: ffff964a759a7460 RCX: 0000000200000005
      [ 6394.925619] RDX: ffffffffc1782d30 RSI: ffffffffc1797d00 RDI: 0000000000000000
      [ 6394.926817] RBP: ffffa87f00b37a18 R08: 0000000000001011 R09: 0000000000000000
      [ 6394.928005] R10: ffffe45740ef5980 R11: ffff964a5ddea312 R12: fffffffffffffffb
      [ 6394.929198] R13: 0000000000000000 R14: ffff964a6281e6e0 R15: ffff964a759a7468
      [ 6394.930408] FS:  0000000000000000(0000) GS:ffff964abfd00000(0000) knlGS:0000000000000000
      [ 6394.931741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6394.932705] CR2: 0000000000000020 CR3: 000000002b20a004 CR4: 00000000000606e0
      [ 6394.933903] Call Trace:
      [ 6394.934630]  ? libcfs_log_return+0x1e/0x30 [libcfs]
      [ 6394.935665]  ? osd_trans_stop+0x440/0x560 [osd_zfs]
      [ 6394.936508]  qmt_trans_start_with_slv+0x4b2/0x780 [lquota]
      [ 6394.937460]  qmt_dqacq0+0x3b1/0x2380 [lquota]
      [ 6394.938216]  ? libcfs_log_return+0x1e/0x30 [libcfs]
      [ 6394.939062]  ? qmt_intent_policy+0x89d/0xeb0 [lquota]
      [ 6394.939926]  ? qmt_intent_policy+0x860/0xeb0 [lquota]
      [ 6394.940801]  qmt_intent_policy+0x89d/0xeb0 [lquota]
      [ 6394.942105]  mdt_intent_opc+0x9a1/0xa80 [mdt]
      [ 6394.942915]  mdt_intent_policy+0x111/0x380 [mdt]
      [ 6394.945058]  ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc]
      [ 6394.945965]  ? cfs_hash_multi_bd_lock+0xa0/0xa0 [libcfs]
      [ 6394.946912]  ldlm_handle_enqueue0+0x60f/0x16d0 [ptlrpc]
      [ 6394.947884]  tgt_enqueue+0xa4/0x1f0 [ptlrpc]
      [ 6394.948715]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [ 6394.949639]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [ 6394.950707]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
      [ 6394.951595]  ? __schedule+0x257/0x650
      [ 6394.952280]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
      [ 6394.953292]  kthread+0x112/0x130
      [ 6394.953887]  ? kthread_flush_work_fn+0x10/0x10
      [ 6394.954653]  ret_from_fork+0x35/0x40
      

      Logs for more crashes are at
      https://testing.whamcloud.com/test_sets/d6b80c94-e3da-4ac6-bc2a-282f72de2a4c
      https://testing.whamcloud.com/test_sets/ed800f02-efae-4354-a4ca-862609a7ca68

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: