[LU-6756] sanity test_225b: test failed to respond and timed out Created: 23/Jun/15  Updated: 12/Aug/16  Resolved: 12/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

server: lustre-master build #3071 EL7
client: lustre-master build #3071 EL6


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/32101cf6-13d8-11e5-b4b0-5254006e85c2.

The sub-test test_225b failed with the following error:

test failed to respond and timed out

MDS log is missing



 Comments   
Comment by Oleg Drokin [ 03/Jul/15 ]

It looks like MDS has crashed, but no logs present. nothing about the MDS node in fact.

To start being able to do anything here we need mds logs to see why did it crash.

Comment by nasf (Inactive) [ 20/Jul/16 ]

Hit on master again with the logs:
https://testing.hpdd.intel.com/test_sets/24db32f0-4e47-11e6-a80f-5254006e85c2

05:23:45:[ 3944.065003] BUG: soft lockup - CPU#0 stuck for 23s! [lctl:30182]
05:23:45:[ 3944.065003] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd nfs_acl lockd auth_rpcgss grace spl(OE) zlib_deflate sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm ata_piix drm 8139too serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core libata floppy [last unloaded: obdecho]
05:23:45:[ 3944.065003] CPU: 0 PID: 30182 Comm: lctl Tainted: P           OE  ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
05:23:45:[ 3944.065003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
05:23:45:[ 3944.065003] task: ffff880041bd0b80 ti: ffff8800462a8000 task.ti: ffff8800462a8000
05:23:45:[ 3944.065003] RIP: 0010:[<ffffffff81302159>]  [<ffffffff81302159>] __write_lock_failed+0x9/0x20
05:23:45:[ 3944.065003] RSP: 0018:ffff8800462ab738  EFLAGS: 00000297
05:23:45:[ 3944.065003] RAX: 0000000000000000 RBX: ffff880047e78650 RCX: 0000000000000000
05:23:45:[ 3944.065003] RDX: 0000000000000000 RSI: ffff88007ab3d400 RDI: ffff880044301c4c
05:23:45:[ 3944.065003] RBP: ffff8800462ab738 R08: 0000000000000000 R09: 0000000000000000
05:23:45:[ 3944.065003] R10: 0000000000000000 R11: 0000000000000002 R12: ffff880041bd1700
05:23:45:[ 3944.065003] R13: ffff880000000003 R14: ffff88004c8e0400 R15: ffffffff810b8716
05:23:45:[ 3944.065003] FS:  00007fb9a6767740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
05:23:45:[ 3944.065003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
05:23:45:[ 3944.065003] CR2: 00007fb9a5c1bd70 CR3: 00000000457c5000 CR4: 00000000000006f0
05:23:45:[ 3944.065003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
05:23:45:[ 3944.065003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
05:23:45:[ 3944.065003] Stack:
05:23:45:[ 3944.065003]  ffff8800462ab748 ffffffff8163da17 ffff8800462ab778 ffffffffa0ec8a43
05:23:45:[ 3944.065003]  ffff88007ab3d400 ffff880044301b90 ffff88007ab3d400 0000000000000000
05:23:45:[ 3944.065003]  ffff8800462ab7a0 ffffffffa0ec985f ffff88007a54e000 ffff880044301b90
05:23:45:[ 3944.065003] Call Trace:
05:23:45:[ 3944.065003]  [<ffffffff8163da17>] _raw_write_lock+0x17/0x20
05:23:45:[ 3944.065003]  [<ffffffffa0ec8a43>] osd_object_sa_dirty_add+0x43/0x70 [osd_zfs]
05:23:45:[ 3944.065003]  [<ffffffffa0ec985f>] osd_object_sa_update+0x3f/0xb0 [osd_zfs]
05:23:45:[ 3944.065003]  [<ffffffffa0ecd980>] osd_write+0x260/0x380 [osd_zfs]
05:23:45:[ 3944.065003]  [<ffffffffa0a939a9>] dt_record_write+0x39/0x120 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffffa0a55c51>] llog_osd_write_rec+0xbf1/0x1560 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffffa0a4592a>] llog_write_rec+0xaa/0x280 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffffa0a4a038>] llog_cat_add_rec+0x268/0x780 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffffa0a42aba>] llog_add+0x7a/0x1a0 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffffa03b790b>] ? dnode_rele_and_unlock+0x5b/0x90 [zfs]
05:23:45:[ 3944.065003]  [<ffffffffa11d1360>] osp_sync_add_rec.isra.16+0x270/0x9d0 [osp]
05:23:45:[ 3944.065003]  [<ffffffffa11d4bc8>] osp_sync_add+0x48/0x50 [osp]
05:23:45:[ 3944.065003]  [<ffffffffa11c2e44>] osp_object_destroy+0xe4/0x220 [osp]
05:23:45:[ 3944.065003]  [<ffffffffa1128d2b>] lod_sub_object_destroy+0x1bb/0x450 [lod]
05:23:45:[ 3944.065003]  [<ffffffffa1116cab>] lod_object_destroy+0x40b/0x690 [lod]
05:23:45:[ 3944.065003]  [<ffffffffa0ec6a9c>] ? osd_object_write_locked+0x3c/0x60 [osd_zfs]
05:23:45:[ 3944.065003]  [<ffffffffa117abc3>] mdd_finish_unlink+0x123/0x410 [mdd]
05:23:45:[ 3944.065003]  [<ffffffffa1113c18>] ? lod_attr_get+0xa8/0x120 [lod]
05:23:45:[ 3944.065003]  [<ffffffffa117d008>] mdd_unlink+0xc78/0xd80 [mdd]
05:23:45:[ 3944.065003]  [<ffffffffa12115f4>] echo_md_destroy_internal+0xc4/0x520 [obdecho]
05:23:45:[ 3944.065003]  [<ffffffffa121624a>] echo_md_handler.isra.42+0x191a/0x2310 [obdecho]
05:23:45:[ 3944.065003]  [<ffffffffa0f5dcf0>] ? lfsck_key_init+0x20/0xd0 [lfsck]
05:23:45:[ 3944.065003]  [<ffffffff811c171e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
05:23:45:[ 3944.065003]  [<ffffffffa121d72e>] echo_client_iocontrol+0x10de/0x1c40 [obdecho]
05:23:45:[ 3944.065003]  [<ffffffffa0a5bd2e>] class_handle_ioctl+0x19de/0x2150 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffff81197448>] ? handle_mm_fault+0x5b8/0xf50
05:23:45:[ 3944.065003]  [<ffffffff81285e08>] ? security_capable+0x18/0x20
05:23:45:[ 3944.065003]  [<ffffffffa0a40712>] obd_class_ioctl+0xd2/0x170 [obdclass]
05:23:45:[ 3944.065003]  [<ffffffff811f24d5>] do_vfs_ioctl+0x2e5/0x4c0
05:23:45:[ 3944.065003]  [<ffffffff81641f9d>] ? __do_page_fault+0x16d/0x450
05:23:45:[ 3944.065003]  [<ffffffff811f2751>] SyS_ioctl+0xa1/0xc0
05:23:45:[ 3944.065003]  [<ffffffff81646a89>] system_call_fastpath+0x16/0x1b
05:23:45:[ 3944.065003] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 3e ff 07 f3 90 <83> 3f 01 75 f9 3e ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00
Comment by Gerrit Updater [ 25/Jul/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21492
Subject: LU-6756 osd-zfs: not hold rwlock to do blocked operation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5920fea45eed704e233d8649f73856e4a2c021ba

Comment by nasf (Inactive) [ 12/Aug/16 ]

The issue has been fixed by reverting commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43.
via the patch
http://review.whamcloud.com/21878

Generated at Sat Feb 10 02:02:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.