[LU-7172] replay-single test_70d hung on MDT unmount Created: 16/Sep/15 Updated: 14/Jun/18 Resolved: 23/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | dne | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Replay-single test 70d hangs on unmounts of one of the MDTs in a DNE configuration. Logs for this failure are at https://testing.hpdd.intel.com/test_sets/0c5cca82-5c61-11e5-9065-5254006e85c2 . From the MDS2, MDS3, MDS4 console logs, we see 00:57:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 00:57:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 00:57:23:Lustre: lustre-MDT0001: Not available for connect from 10.2.4.158@tcp (stopping) 00:57:23:Lustre: Skipped 336 previous similar messages 00:57:23:INFO: task umount:4687 blocked for more than 120 seconds. 00:57:23: Not tainted 2.6.32-573.3.1.el6_lustre.gde57418.x86_64 #1 00:57:23:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00:57:23:umount D 0000000000000000 0 4687 4686 0x00000080 00:57:23: ffff88005cbefa38 0000000000000082 0000000000000000 ffff88005cbef9d8 00:57:23: ffff88005cbef998 000200000a0204a1 00000440a839357a 0000000000000000 00:57:23: ffff880056458044 000000010042c434 ffff8800688e9068 ffff88005cbeffd8 00:57:23:Call Trace: 01:05:22: [<ffffffff81539a62>] schedule_timeout+0x192/0x2e0 01:05:22: [<ffffffff81089c10>] ? process_timeout+0x0/0x10 01:05:22: [<ffffffffa05ce6b6>] obd_exports_barrier+0xb6/0x190 [obdclass] 01:05:22: [<ffffffffa0e9c145>] mdt_device_fini+0x475/0xf40 [mdt] 01:05:22: [<ffffffffa05d0496>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] 01:05:22: [<ffffffffa05ea8b2>] class_cleanup+0x572/0xd30 [obdclass] 01:05:23: [<ffffffffa05cb156>] ? class_name2dev+0x56/0xe0 [obdclass] 01:05:23: [<ffffffffa05ecf46>] class_process_config+0x1ed6/0x2840 [obdclass] 01:05:23: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 01:05:23: [<ffffffff8117892c>] ? __kmalloc+0x21c/0x230 01:05:23: [<ffffffffa05edd6f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] 01:05:23: [<ffffffffa05cb156>] ? class_name2dev+0x56/0xe0 [obdclass] 01:05:23: [<ffffffffa062721c>] server_put_super+0xa0c/0xed0 [obdclass] 01:05:23: [<ffffffff811b0176>] ? invalidate_inodes+0xf6/0x190 01:05:23: [<ffffffff811943bb>] generic_shutdown_super+0x5b/0xe0 01:05:23: [<ffffffff811944a6>] kill_anon_super+0x16/0x60 01:05:23: [<ffffffffa05f0c26>] lustre_kill_super+0x36/0x60 [obdclass] 01:05:23: [<ffffffff81194c47>] deactivate_super+0x57/0x80 01:05:23: [<ffffffff811b4adf>] mntput_no_expire+0xbf/0x110 01:05:23: [<ffffffff811b562b>] sys_umount+0x7b/0x3a0 01:05:23: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 01:05:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Other instances of this failure: There are a few other cases of this test failing with same stack trace, but MDS console messages differ: All of these failures are in review-dne-part-2 |
| Comments |
| Comment by Bob Glossman (Inactive) [ 07/Oct/15 ] |
|
another on master: |
| Comment by Bruno Faccini (Inactive) [ 16/Oct/15 ] |
|
+1 at https://testing.hpdd.intel.com/test_sets/7b1ad28c-73dd-11e5-ada9-5254006e85c2 Where the full stacks dump of Node hosting MDS[2-4] has also some others threads of interest : 19:41:17:mdt00_000 S 0000000000000000 0 9976 2 0x00000080 19:41:17: ffff88005c4eb9b0 0000000000000046 ffff88007bf72cc0 ffff88006a306698 19:41:17: 0000000000000286 0000000000000003 0000000000000001 0000000000000286 19:41:17: ffff88005c4eb960 ffffffff8105e173 ffff880037603068 ffff88005c4ebfd8 19:41:17:Call Trace: 19:41:17: [<ffffffff8105e173>] ? __wake_up+0x53/0x70 19:41:17: [<ffffffffa08a5848>] top_trans_stop+0x9f8/0xdf0 [ptlrpc] 19:41:17: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 19:41:17: [<ffffffffa0f7637a>] ? lod_load_striping+0x8a/0x190 [lod] 19:41:17: [<ffffffffa0f6991c>] lod_trans_stop+0x2bc/0x330 [lod] 19:41:18: [<ffffffffa101388a>] mdd_trans_stop+0x1a/0xac [mdd] 19:41:18: [<ffffffffa1001b45>] mdd_create+0x695/0x1aa0 [mdd] 19:41:18: [<ffffffffa0eb2784>] ? mdt_version_save+0x84/0x1a0 [mdt] 19:41:18: [<ffffffffa0eb4f46>] mdt_reint_create+0xbb6/0xcc0 [mdt] 19:41:18: [<ffffffffa061be80>] ? lu_ucred+0x20/0x30 [obdclass] 19:41:18: [<ffffffffa0e94675>] ? mdt_ucred+0x15/0x20 [mdt] 19:41:18: [<ffffffffa0ead83c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 19:41:18: [<ffffffffa084dbd2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 19:41:18: [<ffffffffa0eb197d>] mdt_reint_rec+0x5d/0x200 [mdt] 19:41:18: [<ffffffffa0e9d77b>] mdt_reint_internal+0x62b/0xb80 [mdt] 19:41:18: [<ffffffffa0e9e16b>] mdt_reint+0x6b/0x120 [mdt] 19:41:18: [<ffffffffa088fe3c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 19:41:18: [<ffffffffa08378c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 19:41:19: [<ffffffffa0836a80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 19:41:19: [<ffffffff810a0fce>] kthread+0x9e/0xc0 19:41:19: [<ffffffff8100c28a>] child_rip+0xa/0x20 19:41:19: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 19:41:19: [<ffffffff8100c280>] ? child_rip+0x0/0x20 19:41:59:mdt_rdpg00_00 S 0000000000000000 0 32690 2 0x00000080 19:41:59: ffff88007d0435e0 0000000000000046 0000000000000000 0000000000000000 19:41:59: ffff88007d043640 00000000a08d2b88 0000097942d7d1fc ffff88007acfb1ed 19:41:59: 000000497d043580 00000001009a5d44 ffff880054c8e5f8 ffff88007d043fd8 19:41:59:Call Trace: 19:41:59: [<ffffffff81539c52>] schedule_timeout+0x192/0x2e0 19:41:59: [<ffffffff81089be0>] ? process_timeout+0x0/0x10 19:42:00: [<ffffffffa081a4a9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc] 19:42:00: [<ffffffffa107a27b>] ? osp_prep_update_req+0x63b/0x690 [osp] 19:42:00: [<ffffffffa080fac0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] 19:42:00: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 19:42:00: [<ffffffffa0826715>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] 19:42:00: [<ffffffffa081ac31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] 19:42:00: [<ffffffffa107a3f1>] osp_remote_sync+0x121/0x190 [osp] 19:42:00: [<ffffffffa105da0d>] osp_attr_get+0x40d/0x6c0 [osp] 19:42:00: [<ffffffffa105f427>] osp_object_init+0x1c7/0x330 [osp] 19:42:00: [<ffffffffa05fece8>] lu_object_alloc+0xd8/0x320 [obdclass] 19:42:01: [<ffffffffa06000d1>] lu_object_find_try+0x151/0x260 [obdclass] 19:42:01: [<ffffffffa0600291>] lu_object_find_at+0xb1/0xe0 [obdclass] 19:42:01: [<ffffffffa0f81d35>] ? lod_load_lmv_shards+0x5d5/0xaf0 [lod] 19:42:01: [<ffffffffa0f6948f>] ? lod_fld_lookup+0x34f/0x520 [lod] 19:42:01: [<ffffffffa060113c>] dt_locate_at+0x1c/0xa0 [obdclass] 19:42:01: [<ffffffffa0f80a07>] lod_parse_dir_striping+0x2f7/0x6b0 [lod] 19:42:01: [<ffffffffa0f76045>] lod_load_striping_locked+0x355/0x600 [lod] 19:42:01: [<ffffffffa0f76359>] lod_load_striping+0x69/0x190 [lod] 19:42:01: [<ffffffffa0f86eeb>] lod_declare_attr_set+0x22b/0x510 [lod] 19:42:02: [<ffffffffa100c750>] mdd_attr_set+0x660/0x1730 [mdd] 19:42:02: [<ffffffffa0ec2e48>] mdt_mfd_close+0x288/0x1980 [mdt] 19:42:02: [<ffffffffa0825aec>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 19:42:02: [<ffffffffa0825d2b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] 19:42:02: [<ffffffffa0824ad5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 19:42:03: [<ffffffffa05dbbd5>] ? class_handle2object+0x95/0x190 [obdclass] 19:42:03: [<ffffffffa0ec473a>] mdt_close_internal+0x1fa/0x4e0 [mdt] 19:42:03: [<ffffffffa0ec4cd7>] mdt_close+0x2b7/0xa40 [mdt] 19:42:04: [<ffffffffa088fe3c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 19:42:04: [<ffffffffa08378c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] 19:42:04: [<ffffffffa0836a80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 19:42:04: [<ffffffff810a0fce>] kthread+0x9e/0xc0 19:42:04: [<ffffffff8100c28a>] child_rip+0xa/0x20 19:42:04: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 19:42:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20 and also according to its Console/Conman timestamps, the full stacks dump has required about 22mns to complete, and the umount thread claimed to be stuck is no longer present !!! |
| Comment by Bob Glossman (Inactive) [ 16/Nov/15 ] |
|
another on master: |
| Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ] |
|
I saw the same backtrace on the test of replay-dual test_26: umount S 0000000000000000 0 25955 25954 0x00000000 ffff8800724f3858 0000000000000086 ffff880000000010 ffff8800724f37f8 ffff8800724f37b8 0000000000000003 0000000000000001 0000000000000286 0000000000000000 0000000000000000 ffff8800786b1068 ffff8800724f3fd8 Call Trace: [<ffffffffa1b23658>] top_trans_stop+0xb08/0xe60 [ptlrpc] [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 [<ffffffffa0bfc91c>] lod_trans_stop+0x2bc/0x330 [lod] [<ffffffffa0b1ff2a>] mdd_trans_stop+0x1a/0xac [mdd] [<ffffffffa0b163c0>] mdd_close+0x1a0/0xbf0 [mdd] [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0b7e5a9>] mdt_mfd_close+0x359/0x1980 [mdt] [<ffffffffa11a09f5>] ? keys_fill+0xd5/0x1b0 [obdclass] [<ffffffffa11a11cb>] ? lu_context_init+0x8b/0x160 [obdclass] [<ffffffffa0b4fc42>] mdt_obd_disconnect+0x392/0x500 [mdt] [<ffffffffa11743e4>] class_disconnect_export_list+0x334/0x620 [obdclass] [<ffffffffa1174b56>] class_disconnect_exports+0x116/0x2f0 [obdclass] [<ffffffffa118e887>] class_cleanup+0x157/0xd20 [obdclass] [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa116f816>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa1191326>] class_process_config+0x1ed6/0x2830 [obdclass] [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa119213f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] [<ffffffffa116f816>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa11ca85c>] server_put_super+0xa0c/0xed0 [obdclass] [<ffffffff811b0136>] ? invalidate_inodes+0xf6/0x190 [<ffffffff8119439b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81194486>] kill_anon_super+0x16/0x60 [<ffffffffa1194ff6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81194c27>] deactivate_super+0x57/0x80 [<ffffffff811b4a9f>] mntput_no_expire+0xbf/0x110 [<ffffffff811b55eb>] sys_umount+0x7b/0x3a0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b |
| Comment by James Nunez (Inactive) [ 03/Feb/16 ] |
|
Experienced the same hang on MDS umount on master on recovery-small test_23 with the stack trace in the ticket description. Logs are at: |
| Comment by Lai Siyao [ 23/Sep/16 ] |
|
This is duplicate of |