Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11528

sanity-lfsck test_11a: soft lockup - CPU#0 stuck for 22s

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangshilong <wshilong@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/0081d858-d128-11e8-ad90-52540065bddc

      test_11a failed with the following error:

      trevis-3vm10 crashed during sanity-lfsck test_11a
      

      [ 8560.259642] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [tx_commit_cb:17212]
      [ 8560.260765] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod zfs(POE) zunicode(POE) zavl(POE) icp(POE) iosf_mbi crc32_pclmul ghash_clmulni_intel zcommon(POE) znvpair(POE) spl(OE) ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 i2c_core parport_pc parport ip_tables ata_generic pata_acpi ext4 mbcache jbd2 virtio_blk ata_piix
      [ 8560.271741] 8139too crct10dif_pclmul crct10dif_common libata crc32c_intel virtio_pci 8139cp serio_raw virtio_ring virtio mii floppy
      [ 8560.273464] CPU: 0 PID: 17212 Comm: tx_commit_cb Kdump: loaded Tainted: P OE ------------ 3.10.0-862.14.4.el7_lustre.x86_64 #1
      [ 8560.275047] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 8560.275786] task: ffff8cc8e51dcf10 ti: ffff8cc8d7594000 task.ti: ffff8cc8d7594000
      [ 8560.276737] RIP: 0010:[<ffffffffc0d03b69>] [<ffffffffc0d03b69>] dt_txn_hook_commit+0x49/0x60 [obdclass]
      [ 8560.278114] RSP: 0018:ffff8cc8d7597cc8 EFLAGS: 00000246
      [ 8560.278800] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8cc8df5a6000
      [ 8560.279713] RDX: ffff8cc8df5a6048 RSI: 0000000000000000 RDI: ffff8cc8d83dba00
      [ 8560.280628] RBP: ffff8cc8d7597cd8 R08: ffff8cc8df18f020 R09: 0000000000000001
      [ 8560.281543] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8cc8ffc18bc0
      [ 8560.282456] R13: ffff8cc8e51dcf78 R14: 0000000100010700 R15: ffff8cc8df555c00
      [ 8560.283369] FS: 0000000000000000(0000) GS:ffff8cc8ffc00000(0000) knlGS:0000000000000000
      [ 8560.284416] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8560.285160] CR2: 00007fc5a1b15000 CR3: 000000000f20e000 CR4: 00000000000606f0
      [ 8560.286082] Call Trace:
      [ 8560.286457] [<ffffffffc116b788>] osd_trans_commit_cb+0xe8/0x480 [osd_zfs]
      [ 8560.287462] [<ffffffffc04fca98>] dmu_tx_do_callbacks+0x48/0x70 [zfs]
      [ 8560.288336] [<ffffffffc0547724>] txg_do_callbacks+0x14/0x30 [zfs]
      [ 8560.289169] [<ffffffffc03ddd2c>] taskq_thread+0x2ac/0x4f0 [spl]
      [ 8560.289981] [<ffffffff932d2010>] ? wake_up_state+0x20/0x20
      [ 8560.290715] [<ffffffffc03dda80>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [ 8560.291584] [<ffffffff932bdf21>] kthread+0xd1/0xe0
      [ 8560.292220] [<ffffffff932bde50>] ? insert_kthread_work+0x40/0x40
      [ 8560.293026] [<ffffffff939255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [ 8560.293870] [<ffffffff932bde50>] ? insert_kthread_work+0x40/0x40
      [ 8560.294660] Code: 51 48 48 39 d0 48 8d 58 d8 74 31 0f 1f 80 00 00 00 00 48 8b 43 10 48 85 c0 74 10 48 8b 73 18 4c 89 e7 e8 5b b9 85 d2 49 8b 0c 24 <48> 8b 43 28 48 8d 51 48 48 39 d0 48 8d 58 d8 75 d6 5b 41 5c 5d
      [ 8560.298709] Kernel panic - not syncing: softlockup: hung tasks

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_11a - trevis-3vm10 crashed during sanity-lfsck test_11a

      Attachments

        Issue Links

          Activity

            [LU-11528] sanity-lfsck test_11a: soft lockup - CPU#0 stuck for 22s

            This is the same as LU-9706.

            adilger Andreas Dilger added a comment - This is the same as LU-9706 .
            adilger Andreas Dilger added a comment - +1 https://testing.whamcloud.com/test_sets/b4abcd02-0044-11e9-93ea-52540065bddc

            This patch was landed to the zfs-0.7-release branch as zfs-0.7.5-18-g8d82a19de, so it should already be included in our testing.

            adilger Andreas Dilger added a comment - This patch was landed to the zfs-0.7-release branch as zfs-0.7.5-18-g8d82a19de , so it should already be included in our testing.
            wshilong Wang Shilong (Inactive) added a comment - - edited

            One of possible related to fix:

            commit 823d48bfb182137c53b9432498f1f0564eaa8bfc
            Author: lidongyang <gnaygnodil@gmail.com>
            Date: Sat Dec 23 05:19:51 2017 +1100
            Call commit callbacks from the tail of the list

            Our zfs backed Lustre MDT had soft lockups while under heavy metadata
            workloads while handling transaction callbacks from osd_zfs.

            The problem is zfs is not taking advantage of the fast path in
            Lustre's trans callback handling, where Lustre will skip the calls
            to ptlrpc_commit_replies() when it already saw a higher transaction
            number.

            This patch corrects this, it also has a positive impact on metadata
            performance on Lustre with osd_zfs, plus some cleanup in the headers.

            A similar issue for ext4/ldiskfs is described on: LU-6527

            Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
            Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
            Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
            Closes #6986

            And the above fix released since zfs-0.8.0-rc1..

            wshilong Wang Shilong (Inactive) added a comment - - edited One of possible related to fix: commit 823d48bfb182137c53b9432498f1f0564eaa8bfc Author: lidongyang <gnaygnodil@gmail.com> Date: Sat Dec 23 05:19:51 2017 +1100 Call commit callbacks from the tail of the list Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986 And the above fix released since zfs-0.8.0-rc1..

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: