Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15326

sanity-quota test 40 crashing in llog_cat_cancel_arr_rec()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      We’re actually seeing several sanity-quota tests crash with the following trace

      [29469.651591] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
      [29469.652967] PGD 0 P4D 0 
      [29469.653393] Oops: 0000 [#1] SMP PTI
      [29469.653951] CPU: 0 PID: 603214 Comm: dist_txn-1 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-305.19.1.el8_lustre.x86_64 #1
      [29469.655910] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [29469.656976] RIP: 0010:llog_cat_cancel_arr_rec+0xcb/0x460 [obdclass]
      [29469.657976] Code: c3 48 b8 40 00 00 00 cb 02 00 00 c7 05 62 ad 0d 00 00 00 08 00 48 c7 c6 a8 3c 01 c1 48 c7 c7 20 44 06 c1 48 89 05 45 ad 0d 00 <49> 8b 84 24 b0 00 00 00 48 c7 05 42 ad 0d 00 00 00 00 00 48 c7 05
      [29469.660898] RSP: 0018:ffffbb91053e3de8 EFLAGS: 00010202
      [29469.661731] RAX: 000002cb00000040 RBX: 00000000fffffff7 RCX: ffff9cca74dcf880
      [29469.662856] RDX: ffffbb91053e3de8 RSI: ffffffffc1013ca8 RDI: ffffffffc1064420
      [29469.663978] RBP: ffff9cca8e182750 R08: ffff9cca74dcf89c R09: abcc77118461cefd
      [29469.665101] R10: ffffbb91053e3e70 R11: ffff9cca64c0596e R12: 0000000000000000
      [29469.666235] R13: ffff9cca74dcf880 R14: 0000000000000001 R15: ffff9cca74dcf89c
      [29469.667360] FS:  0000000000000000(0000) GS:ffff9ccabfc00000(0000) knlGS:0000000000000000
      [29469.668631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [29469.669536] CR2: 00000000000000b0 CR3: 0000000016010006 CR4: 00000000001706f0
      [29469.670666] Call Trace:
      [29469.671103]  llog_cat_cancel_records+0x61/0x190 [obdclass]
      [29469.672388]  distribute_txn_commit_thread+0x3f5/0xa80 [ptlrpc]
      [29469.673385]  ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc]
      [29469.674453]  kthread+0x116/0x130
      [29469.674970]  ? kthread_flush_work_fn+0x10/0x10
      [29469.675680]  ret_from_fork+0x35/0x40
      [29469.676256] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) dm_mod intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic virtio_net ata_piix crc32c_intel libata serio_raw virtio_blk net_failover failover
      [29469.685081] CR2: 00000000000000b0
      

      We saw crash for the first time with replay-dual test_22d crash for patch https://review.whamcloud.com/41424 on March 5, 2021.
      August 21, 2021 - sanity-quota test_40d at https://testing.whamcloud.com/test_sets/a00ff859-ed17-46a1-a374-9b1036d090dd
      August 25, 2021 - sanity-quota test_41 at https://testing.whamcloud.com/test_sets/563e7ce0-fcb9-4ca6-9e00-19df529e38e1
      October 27, 2021 - sanity-quota test_50 at https://testing.whamcloud.com/test_sets/2d17889e-99d1-4d17-b60a-8311a7fc74ad
      October 30, 2021 - sanity-quota test_41 at https://testing.whamcloud.com/test_sets/3bce26eb-02e4-4362-b245-8ccbc70295a5
      December 4, 2021 - sanity-quota test_40d at https://testing.whamcloud.com/test_sets/5c3cc4df-3055-46c6-a651-5bc9b03550b2

      The following crash may be related:

      [76868.809616] BUG: unable to handle kernel paging request at 00000060000000f0
      [76868.810911] PGD 0 P4D 0 
      [76868.811380] Oops: 0000 [#1] SMP PTI
      [76868.811999] CPU: 1 PID: 3223309 Comm: dist_txn-2 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-305.19.1.el8_lustre.x86_64 #1
      [76868.814127] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [76868.815154] RIP: 0010:rwsem_down_write_slowpath+0x3e9/0x610
      [76868.816134] Code: f8 01 0f 85 78 fe ff ff 48 8b 7c 24 18 41 be 02 00 00 00 e8 e9 1f 81 00 e9 55 fd ff ff 48 89 c2 48 83 e2 fc 74 21 a8 01 75 1d <8b> 42 38 85 c0 0f 84 75 fc ff ff 8b 7a 3c e8 f4 c5 f2 ff 66 90 84
      [76868.819295] RSP: 0018:ffffb4944b173ce0 EFLAGS: 00010246
      [76868.820190] RAX: 00000060000000b8 RBX: ffff8ecca011e950 RCX: 0000000000000000
      [76868.821399] RDX: 00000060000000b8 RSI: 0000000000000002 RDI: ffff8ecc7615ec00
      [76868.822607] RBP: ffffb4944b173d80 R08: 0000000000000bf8 R09: ffffb4944b173a84
      [76868.823818] R10: ffffb4944b173d40 R11: ffff8eccb80e1bf6 R12: 0000000000000000
      [76868.825026] R13: ffffffff00000000 R14: ffff8ecc7615ecb8 R15: ffff8ecc7615ec00
      [76868.826288] FS:  0000000000000000(0000) GS:ffff8eccffd00000(0000) knlGS:0000000000000000
      [76868.827708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [76868.828717] CR2: 00000060000000f0 CR3: 0000000087410004 CR4: 00000000000606e0
      [76868.829937] Call Trace:
      [76868.830663]  ? llog_init_handle+0x15d/0x9d0 [obdclass]
      [76868.831577]  ? llog_init_handle+0x15d/0x9d0 [obdclass]
      [76868.832482]  llog_cat_id2handle+0x3ce/0x6d0 [obdclass]
      [76868.833398]  llog_cat_cancel_arr_rec+0x59/0x460 [obdclass]
      [76868.834365]  llog_cat_cancel_records+0x61/0x190 [obdclass]
      [76868.835743]  distribute_txn_commit_thread+0x3f5/0xa80 [ptlrpc]
      [76868.836814]  ? distribute_txn_commit_batchid_update+0x890/0x890 [ptlrpc]
      [76868.837989]  kthread+0x116/0x130
      [76868.838581]  ? kthread_flush_work_fn+0x10/0x10
      [76868.839378]  ret_from_fork+0x35/0x40
      [76868.840040] Modules linked in: nfsd nfs_acl obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix crc32c_intel virtio_net libata serio_raw virtio_blk net_failover failover [last unloaded: llog_test]
      [76868.850790] CR2: 00000060000000f0
      

      August 6, 2021 - sanity-quota test_40d at https://testing.whamcloud.com/test_sets/ae2e9fb4-49cc-479b-a7a3-5d5651edcac7
      September 1, 2021 - sanity-quota test_50 at https://testing.whamcloud.com/test_sets/376a8df2-04a0-4f78-aea0-6667d7a2a7f2
      September 12, 2021 - sanity-quota test_52 at https://testing.whamcloud.com/test_sets/68886881-3803-42c2-becc-0578a8c7aeb0
      September 22, 2021 - sanity-quota test_63 at https://testing.whamcloud.com/test_sets/2eefc60a-83d9-4455-9ad7-8d7ca29a4899
      September 30, 2021 - sanity-quota test_63 at https://testing.whamcloud.com/test_sets/61c3b7a4-8715-4bb1-9738-b3bdd397d67f
      December 2, 2021 - sanity-quota test_50 at https://testing.whamcloud.com/test_sets/e0d2923f-5df8-45ef-a29d-c1343f5cebef

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: