Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6756

sanity test_225b: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • server: lustre-master build #3071 EL7
      client: lustre-master build #3071 EL6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32101cf6-13d8-11e5-b4b0-5254006e85c2.

      The sub-test test_225b failed with the following error:

      test failed to respond and timed out
      

      MDS log is missing

      Attachments

        Issue Links

          Activity

            [LU-6756] sanity test_225b: test failed to respond and timed out

            The issue has been fixed by reverting commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43.
            via the patch
            http://review.whamcloud.com/21878

            yong.fan nasf (Inactive) added a comment - The issue has been fixed by reverting commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43. via the patch http://review.whamcloud.com/21878

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21492
            Subject: LU-6756 osd-zfs: not hold rwlock to do blocked operation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5920fea45eed704e233d8649f73856e4a2c021ba

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21492 Subject: LU-6756 osd-zfs: not hold rwlock to do blocked operation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5920fea45eed704e233d8649f73856e4a2c021ba

            Hit on master again with the logs:
            https://testing.hpdd.intel.com/test_sets/24db32f0-4e47-11e6-a80f-5254006e85c2

            05:23:45:[ 3944.065003] BUG: soft lockup - CPU#0 stuck for 23s! [lctl:30182]
            05:23:45:[ 3944.065003] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm ata_piix drm 8139too serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy [last unloaded: obdecho]
            05:23:45:[ 3944.065003] CPU: 0 PID: 30182 Comm: lctl Tainted: P           OE  ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
            05:23:45:[ 3944.065003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
            05:23:45:[ 3944.065003] task: ffff880041bd0b80 ti: ffff8800462a8000 task.ti: ffff8800462a8000
            05:23:45:[ 3944.065003] RIP: 0010:[<ffffffff81302159>]  [<ffffffff81302159>] __write_lock_failed+0x9/0x20
            05:23:45:[ 3944.065003] RSP: 0018:ffff8800462ab738  EFLAGS: 00000297
            05:23:45:[ 3944.065003] RAX: 0000000000000000 RBX: ffff880047e78650 RCX: 0000000000000000
            05:23:45:[ 3944.065003] RDX: 0000000000000000 RSI: ffff88007ab3d400 RDI: ffff880044301c4c
            05:23:45:[ 3944.065003] RBP: ffff8800462ab738 R08: 0000000000000000 R09: 0000000000000000
            05:23:45:[ 3944.065003] R10: 0000000000000000 R11: 0000000000000002 R12: ffff880041bd1700
            05:23:45:[ 3944.065003] R13: ffff880000000003 R14: ffff88004c8e0400 R15: ffffffff810b8716
            05:23:45:[ 3944.065003] FS:  00007fb9a6767740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
            05:23:45:[ 3944.065003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            05:23:45:[ 3944.065003] CR2: 00007fb9a5c1bd70 CR3: 00000000457c5000 CR4: 00000000000006f0
            05:23:45:[ 3944.065003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            05:23:45:[ 3944.065003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            05:23:45:[ 3944.065003] Stack:
            05:23:45:[ 3944.065003]  ffff8800462ab748 ffffffff8163da17 ffff8800462ab778 ffffffffa0ec8a43
            05:23:45:[ 3944.065003]  ffff88007ab3d400 ffff880044301b90 ffff88007ab3d400 0000000000000000
            05:23:45:[ 3944.065003]  ffff8800462ab7a0 ffffffffa0ec985f ffff88007a54e000 ffff880044301b90
            05:23:45:[ 3944.065003] Call Trace:
            05:23:45:[ 3944.065003]  [<ffffffff8163da17>] _raw_write_lock+0x17/0x20
            05:23:45:[ 3944.065003]  [<ffffffffa0ec8a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs]
            05:23:45:[ 3944.065003]  [<ffffffffa0ec985f>] osd_object_sa_update+0x3f/0xb0 [osd_zfs]
            05:23:45:[ 3944.065003]  [<ffffffffa0ecd980>] osd_write+0x260/0x380 [osd_zfs]
            05:23:45:[ 3944.065003]  [<ffffffffa0a939a9>] dt_record_write+0x39/0x120 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffffa0a55c51>] llog_osd_write_rec+0xbf1/0x1560 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffffa0a4592a>] llog_write_rec+0xaa/0x280 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffffa0a4a038>] llog_cat_add_rec+0x268/0x780 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffffa0a42aba>] llog_add+0x7a/0x1a0 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffffa03b790b>] ? dnode_rele_and_unlock+0x5b/0x90 [zfs]
            05:23:45:[ 3944.065003]  [<ffffffffa11d1360>] osp_sync_add_rec.isra.16+0x270/0x9d0 [osp]
            05:23:45:[ 3944.065003]  [<ffffffffa11d4bc8>] osp_sync_add+0x48/0x50 [osp]
            05:23:45:[ 3944.065003]  [<ffffffffa11c2e44>] osp_object_destroy+0xe4/0x220 [osp]
            05:23:45:[ 3944.065003]  [<ffffffffa1128d2b>] lod_sub_object_destroy+0x1bb/0x450 [lod]
            05:23:45:[ 3944.065003]  [<ffffffffa1116cab>] lod_object_destroy+0x40b/0x690 [lod]
            05:23:45:[ 3944.065003]  [<ffffffffa0ec6a9c>] ? osd_object_write_locked+0x3c/0x60 [osd_zfs]
            05:23:45:[ 3944.065003]  [<ffffffffa117abc3>] mdd_finish_unlink+0x123/0x410 [mdd]
            05:23:45:[ 3944.065003]  [<ffffffffa1113c18>] ? lod_attr_get+0xa8/0x120 [lod]
            05:23:45:[ 3944.065003]  [<ffffffffa117d008>] mdd_unlink+0xc78/0xd80 [mdd]
            05:23:45:[ 3944.065003]  [<ffffffffa12115f4>] echo_md_destroy_internal+0xc4/0x520 [obdecho]
            05:23:45:[ 3944.065003]  [<ffffffffa121624a>] echo_md_handler.isra.42+0x191a/0x2310 [obdecho]
            05:23:45:[ 3944.065003]  [<ffffffffa0f5dcf0>] ? lfsck_key_init+0x20/0xd0 [lfsck]
            05:23:45:[ 3944.065003]  [<ffffffff811c171e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
            05:23:45:[ 3944.065003]  [<ffffffffa121d72e>] echo_client_iocontrol+0x10de/0x1c40 [obdecho]
            05:23:45:[ 3944.065003]  [<ffffffffa0a5bd2e>] class_handle_ioctl+0x19de/0x2150 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffff81197448>] ? handle_mm_fault+0x5b8/0xf50
            05:23:45:[ 3944.065003]  [<ffffffff81285e08>] ? security_capable+0x18/0x20
            05:23:45:[ 3944.065003]  [<ffffffffa0a40712>] obd_class_ioctl+0xd2/0x170 [obdclass]
            05:23:45:[ 3944.065003]  [<ffffffff811f24d5>] do_vfs_ioctl+0x2e5/0x4c0
            05:23:45:[ 3944.065003]  [<ffffffff81641f9d>] ? __do_page_fault+0x16d/0x450
            05:23:45:[ 3944.065003]  [<ffffffff811f2751>] SyS_ioctl+0xa1/0xc0
            05:23:45:[ 3944.065003]  [<ffffffff81646a89>] system_call_fastpath+0x16/0x1b
            05:23:45:[ 3944.065003] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 3e ff 07 f3 90 <83> 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00
            
            yong.fan nasf (Inactive) added a comment - Hit on master again with the logs: https://testing.hpdd.intel.com/test_sets/24db32f0-4e47-11e6-a80f-5254006e85c2 05:23:45:[ 3944.065003] BUG: soft lockup - CPU#0 stuck for 23s! [lctl:30182] 05:23:45:[ 3944.065003] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm ata_piix drm 8139too serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy [last unloaded: obdecho] 05:23:45:[ 3944.065003] CPU: 0 PID: 30182 Comm: lctl Tainted: P OE ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1 05:23:45:[ 3944.065003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 05:23:45:[ 3944.065003] task: ffff880041bd0b80 ti: ffff8800462a8000 task.ti: ffff8800462a8000 05:23:45:[ 3944.065003] RIP: 0010:[<ffffffff81302159>] [<ffffffff81302159>] __write_lock_failed+0x9/0x20 05:23:45:[ 3944.065003] RSP: 0018:ffff8800462ab738 EFLAGS: 00000297 05:23:45:[ 3944.065003] RAX: 0000000000000000 RBX: ffff880047e78650 RCX: 0000000000000000 05:23:45:[ 3944.065003] RDX: 0000000000000000 RSI: ffff88007ab3d400 RDI: ffff880044301c4c 05:23:45:[ 3944.065003] RBP: ffff8800462ab738 R08: 0000000000000000 R09: 0000000000000000 05:23:45:[ 3944.065003] R10: 0000000000000000 R11: 0000000000000002 R12: ffff880041bd1700 05:23:45:[ 3944.065003] R13: ffff880000000003 R14: ffff88004c8e0400 R15: ffffffff810b8716 05:23:45:[ 3944.065003] FS: 00007fb9a6767740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 05:23:45:[ 3944.065003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 05:23:45:[ 3944.065003] CR2: 00007fb9a5c1bd70 CR3: 00000000457c5000 CR4: 00000000000006f0 05:23:45:[ 3944.065003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 05:23:45:[ 3944.065003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 05:23:45:[ 3944.065003] Stack: 05:23:45:[ 3944.065003] ffff8800462ab748 ffffffff8163da17 ffff8800462ab778 ffffffffa0ec8a43 05:23:45:[ 3944.065003] ffff88007ab3d400 ffff880044301b90 ffff88007ab3d400 0000000000000000 05:23:45:[ 3944.065003] ffff8800462ab7a0 ffffffffa0ec985f ffff88007a54e000 ffff880044301b90 05:23:45:[ 3944.065003] Call Trace: 05:23:45:[ 3944.065003] [<ffffffff8163da17>] _raw_write_lock+0x17/0x20 05:23:45:[ 3944.065003] [<ffffffffa0ec8a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0ec985f>] osd_object_sa_update+0x3f/0xb0 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0ecd980>] osd_write+0x260/0x380 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0a939a9>] dt_record_write+0x39/0x120 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a55c51>] llog_osd_write_rec+0xbf1/0x1560 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a4592a>] llog_write_rec+0xaa/0x280 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a4a038>] llog_cat_add_rec+0x268/0x780 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a42aba>] llog_add+0x7a/0x1a0 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa03b790b>] ? dnode_rele_and_unlock+0x5b/0x90 [zfs] 05:23:45:[ 3944.065003] [<ffffffffa11d1360>] osp_sync_add_rec.isra.16+0x270/0x9d0 [osp] 05:23:45:[ 3944.065003] [<ffffffffa11d4bc8>] osp_sync_add+0x48/0x50 [osp] 05:23:45:[ 3944.065003] [<ffffffffa11c2e44>] osp_object_destroy+0xe4/0x220 [osp] 05:23:45:[ 3944.065003] [<ffffffffa1128d2b>] lod_sub_object_destroy+0x1bb/0x450 [lod] 05:23:45:[ 3944.065003] [<ffffffffa1116cab>] lod_object_destroy+0x40b/0x690 [lod] 05:23:45:[ 3944.065003] [<ffffffffa0ec6a9c>] ? osd_object_write_locked+0x3c/0x60 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa117abc3>] mdd_finish_unlink+0x123/0x410 [mdd] 05:23:45:[ 3944.065003] [<ffffffffa1113c18>] ? lod_attr_get+0xa8/0x120 [lod] 05:23:45:[ 3944.065003] [<ffffffffa117d008>] mdd_unlink+0xc78/0xd80 [mdd] 05:23:45:[ 3944.065003] [<ffffffffa12115f4>] echo_md_destroy_internal+0xc4/0x520 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa121624a>] echo_md_handler.isra.42+0x191a/0x2310 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa0f5dcf0>] ? lfsck_key_init+0x20/0xd0 [lfsck] 05:23:45:[ 3944.065003] [<ffffffff811c171e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 05:23:45:[ 3944.065003] [<ffffffffa121d72e>] echo_client_iocontrol+0x10de/0x1c40 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa0a5bd2e>] class_handle_ioctl+0x19de/0x2150 [obdclass] 05:23:45:[ 3944.065003] [<ffffffff81197448>] ? handle_mm_fault+0x5b8/0xf50 05:23:45:[ 3944.065003] [<ffffffff81285e08>] ? security_capable+0x18/0x20 05:23:45:[ 3944.065003] [<ffffffffa0a40712>] obd_class_ioctl+0xd2/0x170 [obdclass] 05:23:45:[ 3944.065003] [<ffffffff811f24d5>] do_vfs_ioctl+0x2e5/0x4c0 05:23:45:[ 3944.065003] [<ffffffff81641f9d>] ? __do_page_fault+0x16d/0x450 05:23:45:[ 3944.065003] [<ffffffff811f2751>] SyS_ioctl+0xa1/0xc0 05:23:45:[ 3944.065003] [<ffffffff81646a89>] system_call_fastpath+0x16/0x1b 05:23:45:[ 3944.065003] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 3e ff 07 f3 90 <83> 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00
            green Oleg Drokin added a comment -

            It looks like MDS has crashed, but no logs present. nothing about the MDS node in fact.

            To start being able to do anything here we need mds logs to see why did it crash.

            green Oleg Drokin added a comment - It looks like MDS has crashed, but no logs present. nothing about the MDS node in fact. To start being able to do anything here we need mds logs to see why did it crash.

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: