Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for liuying <emoly.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ff7c62c8-13ad-11e6-9e5d-5254006e85c2.

      The sub-test test_47 failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      MDS console showed:

      06:28:40:Call Trace:
      06:28:40: [<ffffffffa0c92a8c>] osd_trans_commit_cb+0x17c/0x3c0 [osd_zfs]
      06:28:40: [<ffffffffa021686d>] dmu_tx_do_callbacks+0x3d/0x60 [zfs]
      06:28:40: [<ffffffffa0258b58>] txg_do_callbacks+0x18/0x40 [zfs]
      06:28:40: [<ffffffffa019d727>] taskq_thread+0x1e7/0x3f0 [spl]
      06:28:40: [<ffffffff81067670>] ? default_wake_function+0x0/0x20
      06:28:40: [<ffffffffa019d540>] ? taskq_thread+0x0/0x3f0 [spl]
      06:28:40: [<ffffffff810a138e>] kthread+0x9e/0xc0
      06:28:40: [<ffffffff8100c28a>] child_rip+0xa/0x20
      06:28:40: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      06:28:40: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      06:28:40:Code: 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 48 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 <ff> d0 49 8b 0c 24 48 8b 43 28 48 8d 51 48 48 39 d0 48 8d 58 d8 
      06:28:40:RIP  [<ffffffffa136d550>] dt_txn_hook_commit+0x40/0x60 [obdclass]
      06:28:40: RSP <ffff88005a2f3d80>
      

      Info required for matching: conf-sanity 47

      Attachments

        Activity

          [LU-8119] conf-sanity test_47: timeout
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24202/
          Subject: LU-8119 osd: remove dcb entry from the list prior processing.
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: aa64476cef7d5f0050de20839925868013caa0b6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24202/ Subject: LU-8119 osd: remove dcb entry from the list prior processing. Project: fs/lustre-release Branch: master Current Patch Set: Commit: aa64476cef7d5f0050de20839925868013caa0b6

          I see several kinds of this issue still when applying high load on OSD and then umount it. Alex reported the same issues are happening in his tests.

          tappro Mikhail Pershin added a comment - I see several kinds of this issue still when applying high load on OSD and then umount it. Alex reported the same issues are happening in his tests.

          I suppose that this bug is quite rare because there is only one commit callback in most cases but from JIRA I see it happens time to time and tickets are just closed when bug is not appearing for some time. Meanwhile it is critical bug anyway.

          tappro Mikhail Pershin added a comment - I suppose that this bug is quite rare because there is only one commit callback in most cases but from JIRA I see it happens time to time and tickets are just closed when bug is not appearing for some time. Meanwhile it is critical bug anyway.

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/24202
          Subject: LU-8119 osd: remove dcb entry from the list prior processing.
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 1b51c67568583be3d70506723c4ac8fa97176dcd

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/24202 Subject: LU-8119 osd: remove dcb entry from the list prior processing. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1b51c67568583be3d70506723c4ac8fa97176dcd

          I saw this issue regularly in my tests and have patch for it. The problem is that code in osd_trans_commit_cb():

           /* call per-transaction callbacks if any */
           list_for_each_entry_safe(dcb, tmp, &oh->ot_dcb_list, dcb_linkage)
                   dcb->dcb_func(NULL, th, dcb, error);
          
          

          it misses list_del_init(); before dcb_func() call. the dcb entry can be destroyed by the dcb_func() and must be removed from the list prior that.

          Interesting that there is no such problem with ldiskfs OSD, the bug exists only in ZFS OSD code.

          I will push patch shortly.

          tappro Mikhail Pershin added a comment - I saw this issue regularly in my tests and have patch for it. The problem is that code in osd_trans_commit_cb():   /* call per-transaction callbacks if any */ list_for_each_entry_safe(dcb, tmp, &oh->ot_dcb_list, dcb_linkage) dcb->dcb_func(NULL, th, dcb, error); it misses list_del_init(); before dcb_func() call. the dcb entry can be destroyed by the dcb_func() and must be removed from the list prior that. Interesting that there is no such problem with ldiskfs OSD, the bug exists only in ZFS OSD code. I will push patch shortly.

          There's a similar stack trace for a conf-sanity test 44 timeout at https://testing.hpdd.intel.com/test_sets/ecc6376e-1892-11e6-855a-5254006e85c2. From the MDS console log:

          11:49:02:[10785.693635] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
          11:49:02:[10786.476837] general protection fault: 0000 [#1] SMP 
          11:49:02:[10786.477004] Modules linked in: osd_zfs(OE) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd grace spl(OE) auth_rpcgss zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus ata_piix syscopyarea 8139too virtio_blk sysfillrect sysimgblt drm_kms_helper ttm drm floppy libata virtio_pci i2c_core serio_raw virtio_ring 8139cp virtio mii [last unloaded: libcfs]
          11:49:02:[10786.477004] CPU: 0 PID: 6194 Comm: tx_commit_cb Tainted: P           OE  ------------   3.10.0-327.13.1.el7_lustre.x86_64 #1
          11:49:02:[10786.477004] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
          11:49:02:[10786.477004] task: ffff88007a155080 ti: ffff88004d714000 task.ti: ffff88004d714000
          11:49:02:[10786.477004] RIP: 0010:[<ffffffffa0ae0910>]  [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass]
          11:49:02:[10786.477004] RSP: 0018:ffff88004d717d88  EFLAGS: 00010207
          11:49:02:[10786.477004] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a32 RCX: ffff88004e49c000
          11:49:02:[10786.477004] RDX: ffff88004e49c048 RSI: ffffffffa13485c0 RDI: ffff880053a15810
          11:49:02:[10786.477004] RBP: ffff88004d717d98 R08: 20737365636f7250 R09: 0a64657265746e65
          11:49:02:[10786.477004] R10: ffff880054140b1f R11: 0a64657265746e65 R12: ffff880053a15800
          11:49:02:[10786.477004] R13: 0000000000000000 R14: ffff880053a15850 R15: ffff88004e49c000
          11:49:02:[10786.477004] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
          11:49:02:[10786.477004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          11:49:02:[10786.477004] CR2: 00007f348ade1000 CR3: 000000007b9e0000 CR4: 00000000000006f0
          11:49:02:[10786.477004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          11:49:02:[10786.477004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          11:49:02:[10786.477004] Stack:
          11:49:02:[10786.477004]  ffff880052c72ec0 ffff880053a15800 ffff88004d717dd0 ffffffffa14bf75d
          11:49:02:[10786.477004]  ffff880052c72ec0 ffff880052c72e10 0000000000000000 ffff880052c72e00
          11:49:02:[10786.477004]  ffff88004db7dd00 ffff88004d717e00 ffffffffa03d1884 ffff880052c72e00
          11:49:02:[10786.477004] Call Trace:
          11:49:02:[10786.477004]  [<ffffffffa14bf75d>] osd_trans_commit_cb+0xed/0x440 [osd_zfs]
          11:49:02:[10786.477004]  [<ffffffffa03d1884>] dmu_tx_do_callbacks+0x44/0x70 [zfs]
          11:49:02:[10786.477004]  [<ffffffffa04182c4>] txg_do_callbacks+0x14/0x30 [zfs]
          11:49:02:[10786.477004]  [<ffffffffa02886de>] taskq_thread+0x21e/0x430 [spl]
          11:49:02:[10786.477004]  [<ffffffff810b8c00>] ? wake_up_state+0x20/0x20
          11:49:02:[10786.477004]  [<ffffffffa02884c0>] ? taskq_thread_spawn+0x60/0x60 [spl]
          11:49:02:[10786.477004]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
          11:49:02:[10786.477004]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
          11:49:02:[10786.477004]  [<ffffffff81646018>] ret_from_fork+0x58/0x90
          11:49:02:[10786.477004]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
          11:49:02:[10786.477004] Code: 55 48 89 e5 41 54 49 89 fc 53 f6 47 48 02 75 42 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 <48> 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 ff d0 49 8b 0c 
          11:49:02:[10786.477004] RIP  [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass]
          11:49:02:[10786.477004]  RSP <ffff88004d717d88>
          
          jamesanunez James Nunez (Inactive) added a comment - There's a similar stack trace for a conf-sanity test 44 timeout at https://testing.hpdd.intel.com/test_sets/ecc6376e-1892-11e6-855a-5254006e85c2 . From the MDS console log: 11:49:02:[10785.693635] Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 11:49:02:[10786.476837] general protection fault: 0000 [#1] SMP 11:49:02:[10786.477004] Modules linked in: osd_zfs(OE) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd grace spl(OE) auth_rpcgss zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus ata_piix syscopyarea 8139too virtio_blk sysfillrect sysimgblt drm_kms_helper ttm drm floppy libata virtio_pci i2c_core serio_raw virtio_ring 8139cp virtio mii [last unloaded: libcfs] 11:49:02:[10786.477004] CPU: 0 PID: 6194 Comm: tx_commit_cb Tainted: P OE ------------ 3.10.0-327.13.1.el7_lustre.x86_64 #1 11:49:02:[10786.477004] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 11:49:02:[10786.477004] task: ffff88007a155080 ti: ffff88004d714000 task.ti: ffff88004d714000 11:49:02:[10786.477004] RIP: 0010:[<ffffffffa0ae0910>] [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass] 11:49:02:[10786.477004] RSP: 0018:ffff88004d717d88 EFLAGS: 00010207 11:49:02:[10786.477004] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a32 RCX: ffff88004e49c000 11:49:02:[10786.477004] RDX: ffff88004e49c048 RSI: ffffffffa13485c0 RDI: ffff880053a15810 11:49:02:[10786.477004] RBP: ffff88004d717d98 R08: 20737365636f7250 R09: 0a64657265746e65 11:49:02:[10786.477004] R10: ffff880054140b1f R11: 0a64657265746e65 R12: ffff880053a15800 11:49:02:[10786.477004] R13: 0000000000000000 R14: ffff880053a15850 R15: ffff88004e49c000 11:49:02:[10786.477004] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 11:49:02:[10786.477004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 11:49:02:[10786.477004] CR2: 00007f348ade1000 CR3: 000000007b9e0000 CR4: 00000000000006f0 11:49:02:[10786.477004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 11:49:02:[10786.477004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 11:49:02:[10786.477004] Stack: 11:49:02:[10786.477004] ffff880052c72ec0 ffff880053a15800 ffff88004d717dd0 ffffffffa14bf75d 11:49:02:[10786.477004] ffff880052c72ec0 ffff880052c72e10 0000000000000000 ffff880052c72e00 11:49:02:[10786.477004] ffff88004db7dd00 ffff88004d717e00 ffffffffa03d1884 ffff880052c72e00 11:49:02:[10786.477004] Call Trace: 11:49:02:[10786.477004] [<ffffffffa14bf75d>] osd_trans_commit_cb+0xed/0x440 [osd_zfs] 11:49:02:[10786.477004] [<ffffffffa03d1884>] dmu_tx_do_callbacks+0x44/0x70 [zfs] 11:49:02:[10786.477004] [<ffffffffa04182c4>] txg_do_callbacks+0x14/0x30 [zfs] 11:49:02:[10786.477004] [<ffffffffa02886de>] taskq_thread+0x21e/0x430 [spl] 11:49:02:[10786.477004] [<ffffffff810b8c00>] ? wake_up_state+0x20/0x20 11:49:02:[10786.477004] [<ffffffffa02884c0>] ? taskq_thread_spawn+0x60/0x60 [spl] 11:49:02:[10786.477004] [<ffffffff810a5acf>] kthread+0xcf/0xe0 11:49:02:[10786.477004] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 11:49:02:[10786.477004] [<ffffffff81646018>] ret_from_fork+0x58/0x90 11:49:02:[10786.477004] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 11:49:02:[10786.477004] Code: 55 48 89 e5 41 54 49 89 fc 53 f6 47 48 02 75 42 48 8b 0f 48 8b 41 48 48 8d 51 48 48 39 d0 48 8d 58 d8 74 2e 0f 1f 80 00 00 00 00 <48> 8b 43 10 48 85 c0 74 0d 48 8b 73 18 4c 89 e7 ff d0 49 8b 0c 11:49:02:[10786.477004] RIP [<ffffffffa0ae0910>] dt_txn_hook_commit+0x30/0x60 [obdclass] 11:49:02:[10786.477004] RSP <ffff88004d717d88>

          People

            tappro Mikhail Pershin
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: