[LU-6756] sanity test_225b: test failed to respond and timed out Created: 23/Jun/15 Updated: 12/Aug/16 Resolved: 12/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: lustre-master build #3071 EL7 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32101cf6-13d8-11e5-b4b0-5254006e85c2. The sub-test test_225b failed with the following error: test failed to respond and timed out MDS log is missing |
| Comments |
| Comment by Oleg Drokin [ 03/Jul/15 ] |
|
It looks like MDS has crashed, but no logs present. nothing about the MDS node in fact. To start being able to do anything here we need mds logs to see why did it crash. |
| Comment by nasf (Inactive) [ 20/Jul/16 ] |
|
Hit on master again with the logs: 05:23:45:[ 3944.065003] BUG: soft lockup - CPU#0 stuck for 23s! [lctl:30182] 05:23:45:[ 3944.065003] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm ata_piix drm 8139too serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy [last unloaded: obdecho] 05:23:45:[ 3944.065003] CPU: 0 PID: 30182 Comm: lctl Tainted: P OE ------------ 3.10.0-327.22.2.el7_lustre.x86_64 #1 05:23:45:[ 3944.065003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 05:23:45:[ 3944.065003] task: ffff880041bd0b80 ti: ffff8800462a8000 task.ti: ffff8800462a8000 05:23:45:[ 3944.065003] RIP: 0010:[<ffffffff81302159>] [<ffffffff81302159>] __write_lock_failed+0x9/0x20 05:23:45:[ 3944.065003] RSP: 0018:ffff8800462ab738 EFLAGS: 00000297 05:23:45:[ 3944.065003] RAX: 0000000000000000 RBX: ffff880047e78650 RCX: 0000000000000000 05:23:45:[ 3944.065003] RDX: 0000000000000000 RSI: ffff88007ab3d400 RDI: ffff880044301c4c 05:23:45:[ 3944.065003] RBP: ffff8800462ab738 R08: 0000000000000000 R09: 0000000000000000 05:23:45:[ 3944.065003] R10: 0000000000000000 R11: 0000000000000002 R12: ffff880041bd1700 05:23:45:[ 3944.065003] R13: ffff880000000003 R14: ffff88004c8e0400 R15: ffffffff810b8716 05:23:45:[ 3944.065003] FS: 00007fb9a6767740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 05:23:45:[ 3944.065003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 05:23:45:[ 3944.065003] CR2: 00007fb9a5c1bd70 CR3: 00000000457c5000 CR4: 00000000000006f0 05:23:45:[ 3944.065003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 05:23:45:[ 3944.065003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 05:23:45:[ 3944.065003] Stack: 05:23:45:[ 3944.065003] ffff8800462ab748 ffffffff8163da17 ffff8800462ab778 ffffffffa0ec8a43 05:23:45:[ 3944.065003] ffff88007ab3d400 ffff880044301b90 ffff88007ab3d400 0000000000000000 05:23:45:[ 3944.065003] ffff8800462ab7a0 ffffffffa0ec985f ffff88007a54e000 ffff880044301b90 05:23:45:[ 3944.065003] Call Trace: 05:23:45:[ 3944.065003] [<ffffffff8163da17>] _raw_write_lock+0x17/0x20 05:23:45:[ 3944.065003] [<ffffffffa0ec8a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0ec985f>] osd_object_sa_update+0x3f/0xb0 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0ecd980>] osd_write+0x260/0x380 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa0a939a9>] dt_record_write+0x39/0x120 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a55c51>] llog_osd_write_rec+0xbf1/0x1560 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a4592a>] llog_write_rec+0xaa/0x280 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a4a038>] llog_cat_add_rec+0x268/0x780 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa0a42aba>] llog_add+0x7a/0x1a0 [obdclass] 05:23:45:[ 3944.065003] [<ffffffffa03b790b>] ? dnode_rele_and_unlock+0x5b/0x90 [zfs] 05:23:45:[ 3944.065003] [<ffffffffa11d1360>] osp_sync_add_rec.isra.16+0x270/0x9d0 [osp] 05:23:45:[ 3944.065003] [<ffffffffa11d4bc8>] osp_sync_add+0x48/0x50 [osp] 05:23:45:[ 3944.065003] [<ffffffffa11c2e44>] osp_object_destroy+0xe4/0x220 [osp] 05:23:45:[ 3944.065003] [<ffffffffa1128d2b>] lod_sub_object_destroy+0x1bb/0x450 [lod] 05:23:45:[ 3944.065003] [<ffffffffa1116cab>] lod_object_destroy+0x40b/0x690 [lod] 05:23:45:[ 3944.065003] [<ffffffffa0ec6a9c>] ? osd_object_write_locked+0x3c/0x60 [osd_zfs] 05:23:45:[ 3944.065003] [<ffffffffa117abc3>] mdd_finish_unlink+0x123/0x410 [mdd] 05:23:45:[ 3944.065003] [<ffffffffa1113c18>] ? lod_attr_get+0xa8/0x120 [lod] 05:23:45:[ 3944.065003] [<ffffffffa117d008>] mdd_unlink+0xc78/0xd80 [mdd] 05:23:45:[ 3944.065003] [<ffffffffa12115f4>] echo_md_destroy_internal+0xc4/0x520 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa121624a>] echo_md_handler.isra.42+0x191a/0x2310 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa0f5dcf0>] ? lfsck_key_init+0x20/0xd0 [lfsck] 05:23:45:[ 3944.065003] [<ffffffff811c171e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 05:23:45:[ 3944.065003] [<ffffffffa121d72e>] echo_client_iocontrol+0x10de/0x1c40 [obdecho] 05:23:45:[ 3944.065003] [<ffffffffa0a5bd2e>] class_handle_ioctl+0x19de/0x2150 [obdclass] 05:23:45:[ 3944.065003] [<ffffffff81197448>] ? handle_mm_fault+0x5b8/0xf50 05:23:45:[ 3944.065003] [<ffffffff81285e08>] ? security_capable+0x18/0x20 05:23:45:[ 3944.065003] [<ffffffffa0a40712>] obd_class_ioctl+0xd2/0x170 [obdclass] 05:23:45:[ 3944.065003] [<ffffffff811f24d5>] do_vfs_ioctl+0x2e5/0x4c0 05:23:45:[ 3944.065003] [<ffffffff81641f9d>] ? __do_page_fault+0x16d/0x450 05:23:45:[ 3944.065003] [<ffffffff811f2751>] SyS_ioctl+0xa1/0xc0 05:23:45:[ 3944.065003] [<ffffffff81646a89>] system_call_fastpath+0x16/0x1b 05:23:45:[ 3944.065003] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 3e ff 07 f3 90 <83> 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 |
| Comment by Gerrit Updater [ 25/Jul/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21492 |
| Comment by nasf (Inactive) [ 12/Aug/16 ] |
|
The issue has been fixed by reverting commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43. |