[LU-6727] parallel-scale test mdtestfpp hung Created: 15/Jun/15 Updated: 02/Sep/16 Resolved: 19/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Build: https://build.hpdd.intel.com/job/lustre-master/3059 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Maloo report: https://testing.hpdd.intel.com/test_sets/4542545e-10e2-11e5-a4f9-5254006e85c2 Console log on MDS (shadow-2vm4): [11694.671469] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale test mdtestfpp: mdtestfpp ========================================================== 12:43:53 \(1434051833\)^M [11695.035400] Lustre: DEBUG MARKER: == parallel-scale test mdtestfpp: mdtestfpp ========================================================== 12:43:53 (1434051833)^M [12064.045003] BUG: soft lockup - CPU#1 stuck for 22s! [mdt00_006:31733]^M [12064.045003] Modules linked in: osp(OF) mdd(OF) lod(OF) mdt(OF) lfsck(OF) mgs(OF) mgc(OF) osd_ldiskfs(OF) lquota(OF) fid(OF) fld(OF) ksocklnd(OF) ptlrpc(OF) obdclass(OF) lnet(OF) sha512_generic libcfs(OF) ldiskfs(OF) dm_mod nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ppdev ib_sa pcspkr virtio_balloon serio_raw parport_pc parport i2c_piix4 ib_mad ib_core ib_addr ext4 mbcache jbd2 ata_generic pata_acpi cirrus virtio_blk syscopyarea sysfillrect sysimgblt drm_kms_helper ttm ata_piix drm 8139too virtio_pci 8139cp virtio_ring virtio mii i2c_core libata floppy^M [12064.045003] CPU: 1 PID: 31733 Comm: mdt00_006 Tainted: GF O-------------- 3.10.0-229.4.2.el7_lustre.g1fee634.x86_64 #1^M [12064.045003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007^M [12064.045003] task: ffff880079498000 ti: ffff88004bd64000 task.ti: ffff88004bd64000^M [12064.045003] RIP: 0010:[<ffffffffa0bc8b9c>] [<ffffffffa0bc8b9c>] iam_insert_key_lock+0x2c/0x50 [osd_ldiskfs]^M [12064.045003] RSP: 0018:ffff88004bd673d0 EFLAGS: 00000206^M [12064.045003] RAX: 000000000231c029 RBX: ffff88007b1e7240 RCX: 00000000000000cc^M [12064.045003] RDX: ffff8800120137fc RSI: ffff88004bd67500 RDI: ffff88004bd674c8^M [12064.045003] RBP: ffff88004bd673d8 R08: ffff88005bfa23a8 R09: 0000000000000055^M [12064.045003] R10: ffff880079ecd1c8 R11: 7d1b000002000000 R12: ffff88004bd67360^M [12064.045003] R13: 0000000000000000 R14: ffff88005662b930 R15: ffffffffa05c4e87^M [12064.045003] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000^M [12064.045003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M [12064.045003] CR2: 00007f3dc08ef000 CR3: 000000007874c000 CR4: 00000000000006e0^M [12064.045003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [12064.045003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [12064.045003] Stack:^M [12064.045003] ffff88004bd675a8 ffff88004bd67430 ffffffffa0bcb582 ffff88004bd67478^M [12064.045003] ffff8800583c19c0 000000aa4bd6746c ffff880015727000 ffff88004bd674c8^M [12064.045003] ffff88007b1e7240 ffff88005d72a338 ffff88004bd675a8 ffff88005d72a360^M [12064.045003] Call Trace:^M [12064.045003] [<ffffffffa0bcb582>] iam_lfix_split+0x122/0x140 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0bc9e16>] iam_add_rec+0x236/0x2f0 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0bcb5a0>] ? iam_lfix_split+0x140/0x140 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0bca77e>] iam_insert+0xce/0x120 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0bc2115>] osd_oi_iam_refresh.isra.16+0x125/0x2a0 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0bc44c7>] osd_oi_insert+0x147/0x490 [osd_ldiskfs]^M [12064.045003] [<ffffffff811eb4b2>] ? generic_setxattr+0x62/0x80^M [12064.045003] [<ffffffffa0bbf767>] osd_object_ea_create+0x527/0x980 [osd_ldiskfs]^M [12064.045003] [<ffffffffa0e4cebc>] lod_sub_object_create+0x22c/0x460 [lod]^M [12064.045003] [<ffffffffa0e4495f>] lod_object_create+0xaf/0x200 [lod]^M [12064.045003] [<ffffffffa0ea62e5>] mdd_object_create_internal+0xb5/0x280 [mdd]^M [12064.045003] [<ffffffffa0e91b96>] mdd_object_create+0x76/0xa20 [mdd]^M [12064.045003] [<ffffffffa0e9ac17>] ? mdd_declare_create+0x447/0xd30 [mdd]^M [12064.045003] [<ffffffffa0e9c1c0>] mdd_create+0xcc0/0x1260 [mdd]^M [12064.045003] [<ffffffffa0d8101b>] mdt_reint_open+0x1f4b/0x2da0 [mdt]^M [12064.045003] [<ffffffffa0784149>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass]^M [12064.045003] [<ffffffff812ddd72>] ? strlcpy+0x42/0x60^M [12064.045003] [<ffffffffa0d75920>] mdt_reint_rec+0x80/0x210 [mdt]^M [12064.045003] [<ffffffffa0d591ac>] mdt_reint_internal+0x58c/0x780 [mdt]^M [12064.045003] [<ffffffffa0d59502>] mdt_intent_reint+0x162/0x420 [mdt]^M [12064.045003] [<ffffffffa0d630ba>] mdt_intent_policy+0x58a/0xb80 [mdt]^M [12064.045003] [<ffffffffa0972103>] ldlm_lock_enqueue+0x353/0x930 [ptlrpc]^M [12064.045003] [<ffffffffa0999062>] ldlm_handle_enqueue0+0x4f2/0x1540 [ptlrpc]^M [12064.045003] [<ffffffffa09beeb0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]^M [12064.045003] [<ffffffffa0a14562>] tgt_enqueue+0x62/0x210 [ptlrpc]^M [12064.045003] [<ffffffffa0a18235>] tgt_request_handle+0x6d5/0x1060 [ptlrpc]^M [12064.045003] [<ffffffffa09c820b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]^M [12064.045003] [<ffffffffa09c5d88>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]^M [12064.045003] [<ffffffff810a9662>] ? default_wake_function+0x12/0x20^M [12064.045003] [<ffffffff810a0898>] ? __wake_up_common+0x58/0x90^M [12064.045003] [<ffffffffa09cc3f8>] ptlrpc_main+0xaf8/0x1ea0 [ptlrpc]^M [12064.045003] [<ffffffff810ad8b6>] ? __dequeue_entity+0x26/0x40^M [12064.045003] [<ffffffff810125f6>] ? __switch_to+0x136/0x4a0^M [12064.045003] [<ffffffffa09cb900>] ? ptlrpc_register_service+0xf00/0xf00 [ptlrpc]^M [12064.045003] [<ffffffff8109739f>] kthread+0xcf/0xe0^M [12064.045003] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140^M [12064.045003] [<ffffffff81614f7c>] ret_from_fork+0x7c/0xb0^M [12064.045003] [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140^M [12064.045003] Code: 1f 44 00 00 55 48 89 e5 53 4c 8b 06 48 89 f3 f0 41 0f ba 28 19 19 c0 85 c0 74 12 0f 1f 40 00 49 8b 00 a9 00 00 00 02 74 e6 f3 90 <eb> f2 48 89 de e8 2a ff ff ff 48 8b 1b e8 22 7a 4d e0 f0 80 63 ^M [12064.045003] Kernel panic - not syncing: softlockup: hung tasks^M Please find more console logs in the attached file. |
| Comments |
| Comment by Sarah Liu [ 23/Jun/15 ] |
|
Not sure if this is the same issue since the MDS logs are missing server: lustre-master build # 3071 EL7 ldiskfs client-1 console shows 16:29:45:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:29:45:mdtest D 0000000000000001 0 4860 4858 0x00000080 16:29:45: ffff880034457898 0000000000000082 0000000000000000 0000000000000000 16:29:45: ffff880034457878 ffffffffa074fa22 0000000000000000 0000000000000000 16:29:45: ffffffffa07b75c0 ffff880033763400 ffff880071f55068 ffff880034457fd8 16:29:45:Call Trace: 16:29:45: [<ffffffffa074fa22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 16:29:45: [<ffffffff812940ba>] ? strlcpy+0x4a/0x60 16:29:45: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210 16:29:45: [<ffffffffa0930a85>] ? mdc_open_pack+0x1d5/0x250 [mdc] 16:29:45: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50 16:29:45: [<ffffffffa09342b2>] mdc_enqueue+0x222/0x1a50 [mdc] 16:29:46: [<ffffffffa0935cc2>] mdc_intent_lock+0x1e2/0x593 [mdc] 16:29:46: [<ffffffffa0a028a0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] 16:29:46: [<ffffffffa07060b0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 16:29:46: [<ffffffffa08c9903>] ? lmv_fid_alloc+0x93/0x440 [lmv] 16:29:46: [<ffffffffa08e3860>] lmv_intent_lock+0x5a0/0x1b00 [lmv] 16:29:46: [<ffffffff811747e3>] ? kmem_cache_alloc_trace+0x1b3/0x1c0 16:29:46: [<ffffffffa09fec09>] ? ll_i2suppgid+0x19/0x30 [lustre] 16:29:46: [<ffffffffa0a10f17>] ? ll_mdscapa_get+0x57/0x220 [lustre] 16:29:46: [<ffffffffa09e3706>] ? ll_prep_md_op_data+0x236/0x550 [lustre] 16:29:46: [<ffffffffa0a028a0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] 16:29:46: [<ffffffffa0a045a9>] ll_lookup_it+0x249/0xdb0 [lustre] 16:29:46: [<ffffffffa0a05199>] ll_lookup_nd+0x89/0x5e0 [lustre] 16:29:46: [<ffffffff8119b992>] __lookup_hash+0x102/0x160 16:29:46: [<ffffffff8119c0ba>] lookup_hash+0x3a/0x50 16:29:46: [<ffffffff811a0f9e>] do_filp_open+0x2de/0xd20 16:29:46: [<ffffffff810f0445>] ? call_rcu_sched+0x15/0x20 16:29:46: [<ffffffff810f045e>] ? call_rcu+0xe/0x10 16:29:46: [<ffffffff8129935a>] ? strncpy_from_user+0x4a/0x90 16:29:46: [<ffffffff811ae272>] ? alloc_fd+0x92/0x160 16:29:46: [<ffffffff8118b0b7>] do_sys_open+0x67/0x130 16:29:46: [<ffffffff810e5d77>] ? audit_syscall_entry+0x1d7/0x200 16:29:46: [<ffffffff8118b1c0>] sys_open+0x20/0x30 16:29:46: [<ffffffff8118b1e5>] sys_creat+0x15/0x20 16:29:46: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 16:29:46:Lustre: 2268:0:(client.c:2003:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1434385723/real 1434385723] req@ffff880034cfd3c0 x1504051204641016/t0(0) o250->MGC10.2.4.185@tcp@10.2.4.185@tcp:26/25 lens 520/544 e 0 to 1 dl 1434385748 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 16:29:47:Lustre: 2268:0:(client.c:2003:ptlrpc_expire_one_request()) Skipped 12 previous similar messages 16:29:47:INFO: task mdtest:4860 blocked for more than 120 seconds. 16:29:47: Not tainted 2.6.32-504.16.2.el6.x86_64 #1 16:29:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:29:47:mdtest D 0000000000000001 0 4860 4858 0x00000080 16:29:47: ffff880034457898 0000000000000082 0000000000000000 0000000000000000 16:29:47: ffff880034457878 ffffffffa074fa22 0000000000000000 0000000000000000 16:29:47: ffffffffa07b75c0 ffff880033763400 ffff880071f55068 ffff880034457fd8 16:29:47:Call Trace: 16:29:47: [<ffffffffa074fa22>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 16:29:47: [<ffffffff812940ba>] ? strlcpy+0x4a/0x60 16:29:47: [<ffffffff8152b7a6>] __mutex_lock_slowpath+0x96/0x210 16:29:47: [<ffffffffa0930a85>] ? mdc_open_pack+0x1d5/0x250 [mdc] 16:29:47: [<ffffffff8152b2cb>] mutex_lock+0x2b/0x50 16:29:47: [<ffffffffa09342b2>] mdc_enqueue+0x222/0x1a50 [mdc] 16:29:47: [<ffffffffa0935cc2>] mdc_intent_lock+0x1e2/0x593 [mdc] 16:29:47: [<ffffffffa0a028a0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] 16:29:47: [<ffffffffa07060b0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] 16:29:47: [<ffffffffa08c9903>] ? lmv_fid_alloc+0x93/0x440 [lmv] 16:29:47: [<ffffffffa08e3860>] lmv_intent_lock+0x5a0/0x1b00 [lmv] 16:29:47: [<ffffffff811747e3>] ? kmem_cache_alloc_trace+0x1b3/0x1c0 16:29:47: [<ffffffffa09fec09>] ? ll_i2suppgid+0x19/0x30 [lustre] 16:29:47: [<ffffffffa0a10f17>] ? ll_mdscapa_get+0x57/0x220 [lustre] 16:29:47: [<ffffffffa09e3706>] ? ll_prep_md_op_data+0x236/0x550 [lustre] 16:29:47: [<ffffffffa0a028a0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] 16:29:48: [<ffffffffa0a045a9>] ll_lookup_it+0x249/0xdb0 [lustre] 16:29:48: [<ffffffffa0a05199>] ll_lookup_nd+0x89/0x5e0 [lustre] 16:29:48: [<ffffffff8119b992>] __lookup_hash+0x102/0x160 16:29:48: [<ffffffff8119c0ba>] lookup_hash+0x3a/0x50 16:33:41: [<ffffffff811a0f9e>] do_filp_open+0x2de/0xd20 16:33:41: [<ffffffff810f0445>] ? call_rcu_sched+0x15/0x20 16:33:41: [<ffffffff810f045e>] ? call_rcu+0xe/0x10 16:33:41: [<ffffffff8129935a>] ? strncpy_from_user+0x4a/0x90 16:33:41: [<ffffffff811ae272>] ? alloc_fd+0x92/0x160 16:33:41: [<ffffffff8118b0b7>] do_sys_open+0x67/0x130 16:33:42: [<ffffffff810e5d77>] ? audit_syscall_entry+0x1d7/0x200 16:33:42: [<ffffffff8118b1c0>] sys_open+0x20/0x30 16:33:42: [<ffffffff8118b1e5>] sys_creat+0x15/0x20 16:33:42: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Alex Zhuravlev [ 28/Jul/15 ] |
|
Yu Jian, the logs from MDS are missing ? |
| Comment by Jian Yu [ 04/Aug/15 ] |
|
Hi Alex, The failure occurred consistently while running parallel-scale tests with RHEL 7.1 server. Could you please take a look at it? FYI, the following test parameters can be used to reproduce the failure and verify the patch: Test-Parameters: alwaysuploadlogs envdefinitions=SLOW=yes,ENABLE_QUOTA=yes \ mdtfilesystemtype=ldiskfs mdsfilesystemtype=ldiskfs ostfilesystemtype=ldiskfs \ clientdistro=el7 ossdistro=el7 mdsdistro=el7 mdtcount=1 testlist=parallel-scale |
| Comment by Sarah Liu [ 05/Aug/15 ] |
|
it also affected mds-survey test 2 https://testing.hpdd.intel.com/test_logs/2d4f69d6-3b1c-11e5-bdff-5254006e85c2/show_text |
| Comment by Gerrit Updater [ 05/Aug/15 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/15854 |
| Comment by Gerrit Updater [ 12/Aug/15 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/15961 |
| Comment by Gerrit Updater [ 19/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15961/ |
| Comment by Peter Jones [ 19/Aug/15 ] |
|
Landed for 2.8 |