[LU-7172] replay-single test_70d hung on MDT unmount Created: 16/Sep/15  Updated: 14/Jun/18  Resolved: 23/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: dne
Environment:

autotest


Issue Links:
Related
is related to LU-7117 replay-single test_70d: timeout and m... Resolved
is related to LU-7221 replay-ost-single test_3: ASSERTION( ... Resolved
is related to LU-7903 recovery-small test_23: hang on umount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Replay-single test 70d hangs on unmounts of one of the MDTs in a DNE configuration. Logs for this failure are at https://testing.hpdd.intel.com/test_sets/0c5cca82-5c61-11e5-9065-5254006e85c2 .

From the MDS2, MDS3, MDS4 console logs, we see

00:57:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
00:57:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
00:57:23:Lustre: lustre-MDT0001: Not available for connect from 10.2.4.158@tcp (stopping)
00:57:23:Lustre: Skipped 336 previous similar messages
00:57:23:INFO: task umount:4687 blocked for more than 120 seconds.
00:57:23:      Not tainted 2.6.32-573.3.1.el6_lustre.gde57418.x86_64 #1
00:57:23:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00:57:23:umount        D 0000000000000000     0  4687   4686 0x00000080
00:57:23: ffff88005cbefa38 0000000000000082 0000000000000000 ffff88005cbef9d8
00:57:23: ffff88005cbef998 000200000a0204a1 00000440a839357a 0000000000000000
00:57:23: ffff880056458044 000000010042c434 ffff8800688e9068 ffff88005cbeffd8
00:57:23:Call Trace:
01:05:22: [<ffffffff81539a62>] schedule_timeout+0x192/0x2e0
01:05:22: [<ffffffff81089c10>] ? process_timeout+0x0/0x10
01:05:22: [<ffffffffa05ce6b6>] obd_exports_barrier+0xb6/0x190 [obdclass]
01:05:22: [<ffffffffa0e9c145>] mdt_device_fini+0x475/0xf40 [mdt]
01:05:22: [<ffffffffa05d0496>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
01:05:22: [<ffffffffa05ea8b2>] class_cleanup+0x572/0xd30 [obdclass]
01:05:23: [<ffffffffa05cb156>] ? class_name2dev+0x56/0xe0 [obdclass]
01:05:23: [<ffffffffa05ecf46>] class_process_config+0x1ed6/0x2840 [obdclass]
01:05:23: [<ffffffffa04b2b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
01:05:23: [<ffffffff8117892c>] ? __kmalloc+0x21c/0x230
01:05:23: [<ffffffffa05edd6f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
01:05:23: [<ffffffffa05cb156>] ? class_name2dev+0x56/0xe0 [obdclass]
01:05:23: [<ffffffffa062721c>] server_put_super+0xa0c/0xed0 [obdclass]
01:05:23: [<ffffffff811b0176>] ? invalidate_inodes+0xf6/0x190
01:05:23: [<ffffffff811943bb>] generic_shutdown_super+0x5b/0xe0
01:05:23: [<ffffffff811944a6>] kill_anon_super+0x16/0x60
01:05:23: [<ffffffffa05f0c26>] lustre_kill_super+0x36/0x60 [obdclass]
01:05:23: [<ffffffff81194c47>] deactivate_super+0x57/0x80
01:05:23: [<ffffffff811b4adf>] mntput_no_expire+0xbf/0x110
01:05:23: [<ffffffff811b562b>] sys_umount+0x7b/0x3a0
01:05:23: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
01:05:23:Lustre: lustre-MDT0001 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?

Other instances of this failure:
2015-09-08 23:27:49 – https://testing.hpdd.intel.com/test_sets/fe085252-56cb-11e5-84d0-5254006e85c2
2015-09-11 10:01:48 - https://testing.hpdd.intel.com/test_sets/23ba7c20-58cd-11e5-baa0-5254006e85c2
2015-09-11 19:25:28 - https://testing.hpdd.intel.com/test_sets/a55b755e-5904-11e5-a4d9-5254006e85c2

There are a few other cases of this test failing with same stack trace, but MDS console messages differ:
2015-09-09 05:56:14 - https://testing.hpdd.intel.com/test_sets/3dda199c-56fe-11e5-8947-5254006e85c2
2015-09-12 19:16:06 - https://testing.hpdd.intel.com/test_sets/53eb4ab0-59d0-11e5-825b-5254006e85c2

All of these failures are in review-dne-part-2



 Comments   
Comment by Bob Glossman (Inactive) [ 07/Oct/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/029076e0-6d12-11e5-ab7f-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 16/Oct/15 ]

+1 at https://testing.hpdd.intel.com/test_sets/7b1ad28c-73dd-11e5-ada9-5254006e85c2

Where the full stacks dump of Node hosting MDS[2-4] has also some others threads of interest :

19:41:17:mdt00_000     S 0000000000000000     0  9976      2 0x00000080
19:41:17: ffff88005c4eb9b0 0000000000000046 ffff88007bf72cc0 ffff88006a306698
19:41:17: 0000000000000286 0000000000000003 0000000000000001 0000000000000286
19:41:17: ffff88005c4eb960 ffffffff8105e173 ffff880037603068 ffff88005c4ebfd8
19:41:17:Call Trace:
19:41:17: [<ffffffff8105e173>] ? __wake_up+0x53/0x70
19:41:17: [<ffffffffa08a5848>] top_trans_stop+0x9f8/0xdf0 [ptlrpc]
19:41:17: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
19:41:17: [<ffffffffa0f7637a>] ? lod_load_striping+0x8a/0x190 [lod]
19:41:17: [<ffffffffa0f6991c>] lod_trans_stop+0x2bc/0x330 [lod]
19:41:18: [<ffffffffa101388a>] mdd_trans_stop+0x1a/0xac [mdd]
19:41:18: [<ffffffffa1001b45>] mdd_create+0x695/0x1aa0 [mdd]
19:41:18: [<ffffffffa0eb2784>] ? mdt_version_save+0x84/0x1a0 [mdt]
19:41:18: [<ffffffffa0eb4f46>] mdt_reint_create+0xbb6/0xcc0 [mdt]
19:41:18: [<ffffffffa061be80>] ? lu_ucred+0x20/0x30 [obdclass]
19:41:18: [<ffffffffa0e94675>] ? mdt_ucred+0x15/0x20 [mdt]
19:41:18: [<ffffffffa0ead83c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
19:41:18: [<ffffffffa084dbd2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
19:41:18: [<ffffffffa0eb197d>] mdt_reint_rec+0x5d/0x200 [mdt]
19:41:18: [<ffffffffa0e9d77b>] mdt_reint_internal+0x62b/0xb80 [mdt]
19:41:18: [<ffffffffa0e9e16b>] mdt_reint+0x6b/0x120 [mdt]
19:41:18: [<ffffffffa088fe3c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
19:41:18: [<ffffffffa08378c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
19:41:19: [<ffffffffa0836a80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
19:41:19: [<ffffffff810a0fce>] kthread+0x9e/0xc0
19:41:19: [<ffffffff8100c28a>] child_rip+0xa/0x20
19:41:19: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
19:41:19: [<ffffffff8100c280>] ? child_rip+0x0/0x20

19:41:59:mdt_rdpg00_00 S 0000000000000000     0 32690      2 0x00000080
19:41:59: ffff88007d0435e0 0000000000000046 0000000000000000 0000000000000000
19:41:59: ffff88007d043640 00000000a08d2b88 0000097942d7d1fc ffff88007acfb1ed
19:41:59: 000000497d043580 00000001009a5d44 ffff880054c8e5f8 ffff88007d043fd8
19:41:59:Call Trace:
19:41:59: [<ffffffff81539c52>] schedule_timeout+0x192/0x2e0
19:41:59: [<ffffffff81089be0>] ? process_timeout+0x0/0x10
19:42:00: [<ffffffffa081a4a9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]
19:42:00: [<ffffffffa107a27b>] ? osp_prep_update_req+0x63b/0x690 [osp]
19:42:00: [<ffffffffa080fac0>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
19:42:00: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
19:42:00: [<ffffffffa0826715>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
19:42:00: [<ffffffffa081ac31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
19:42:00: [<ffffffffa107a3f1>] osp_remote_sync+0x121/0x190 [osp]
19:42:00: [<ffffffffa105da0d>] osp_attr_get+0x40d/0x6c0 [osp]
19:42:00: [<ffffffffa105f427>] osp_object_init+0x1c7/0x330 [osp]
19:42:00: [<ffffffffa05fece8>] lu_object_alloc+0xd8/0x320 [obdclass]
19:42:01: [<ffffffffa06000d1>] lu_object_find_try+0x151/0x260 [obdclass]
19:42:01: [<ffffffffa0600291>] lu_object_find_at+0xb1/0xe0 [obdclass]
19:42:01: [<ffffffffa0f81d35>] ? lod_load_lmv_shards+0x5d5/0xaf0 [lod]
19:42:01: [<ffffffffa0f6948f>] ? lod_fld_lookup+0x34f/0x520 [lod]
19:42:01: [<ffffffffa060113c>] dt_locate_at+0x1c/0xa0 [obdclass]
19:42:01: [<ffffffffa0f80a07>] lod_parse_dir_striping+0x2f7/0x6b0 [lod]
19:42:01: [<ffffffffa0f76045>] lod_load_striping_locked+0x355/0x600 [lod]
19:42:01: [<ffffffffa0f76359>] lod_load_striping+0x69/0x190 [lod]
19:42:01: [<ffffffffa0f86eeb>] lod_declare_attr_set+0x22b/0x510 [lod]
19:42:02: [<ffffffffa100c750>] mdd_attr_set+0x660/0x1730 [mdd]
19:42:02: [<ffffffffa0ec2e48>] mdt_mfd_close+0x288/0x1980 [mdt]
19:42:02: [<ffffffffa0825aec>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
19:42:02: [<ffffffffa0825d2b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
19:42:02: [<ffffffffa0824ad5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
19:42:03: [<ffffffffa05dbbd5>] ? class_handle2object+0x95/0x190 [obdclass]
19:42:03: [<ffffffffa0ec473a>] mdt_close_internal+0x1fa/0x4e0 [mdt]
19:42:03: [<ffffffffa0ec4cd7>] mdt_close+0x2b7/0xa40 [mdt]
19:42:04: [<ffffffffa088fe3c>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
19:42:04: [<ffffffffa08378c1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
19:42:04: [<ffffffffa0836a80>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
19:42:04: [<ffffffff810a0fce>] kthread+0x9e/0xc0
19:42:04: [<ffffffff8100c28a>] child_rip+0xa/0x20
19:42:04: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
19:42:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20

and also according to its Console/Conman timestamps, the full stacks dump has required about 22mns to complete, and the umount thread claimed to be stuck is no longer present !!!

Comment by Bob Glossman (Inactive) [ 16/Nov/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/ba799724-8bfc-11e5-9b92-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 27/Jan/16 ]

I saw the same backtrace on the test of replay-dual test_26:

umount        S 0000000000000000     0 25955  25954 0x00000000
 ffff8800724f3858 0000000000000086 ffff880000000010 ffff8800724f37f8
 ffff8800724f37b8 0000000000000003 0000000000000001 0000000000000286
 0000000000000000 0000000000000000 ffff8800786b1068 ffff8800724f3fd8
Call Trace:
 [<ffffffffa1b23658>] top_trans_stop+0xb08/0xe60 [ptlrpc]
 [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
 [<ffffffffa0bfc91c>] lod_trans_stop+0x2bc/0x330 [lod]
 [<ffffffffa0b1ff2a>] mdd_trans_stop+0x1a/0xac [mdd]
 [<ffffffffa0b163c0>] mdd_close+0x1a0/0xbf0 [mdd]
 [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0b7e5a9>] mdt_mfd_close+0x359/0x1980 [mdt]
 [<ffffffffa11a09f5>] ? keys_fill+0xd5/0x1b0 [obdclass]
 [<ffffffffa11a11cb>] ? lu_context_init+0x8b/0x160 [obdclass]
 [<ffffffffa0b4fc42>] mdt_obd_disconnect+0x392/0x500 [mdt]
 [<ffffffffa11743e4>] class_disconnect_export_list+0x334/0x620 [obdclass]
 [<ffffffffa1174b56>] class_disconnect_exports+0x116/0x2f0 [obdclass]
 [<ffffffffa118e887>] class_cleanup+0x157/0xd20 [obdclass]
 [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa116f816>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa1191326>] class_process_config+0x1ed6/0x2830 [obdclass]
 [<ffffffffa1043d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa119213f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
 [<ffffffffa116f816>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa11ca85c>] server_put_super+0xa0c/0xed0 [obdclass]
 [<ffffffff811b0136>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8119439b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff81194486>] kill_anon_super+0x16/0x60
 [<ffffffffa1194ff6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff81194c27>] deactivate_super+0x57/0x80
 [<ffffffff811b4a9f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811b55eb>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Comment by James Nunez (Inactive) [ 03/Feb/16 ]

Experienced the same hang on MDS umount on master on recovery-small test_23 with the stack trace in the ticket description.

Logs are at:
2016-02-03 02:35:25 - https://testing.hpdd.intel.com/test_sets/5af4a4b6-ca30-11e5-9215-5254006e85c2
2016-02-03 12:51:17 - https://testing.hpdd.intel.com/test_sets/cd701a88-ca90-11e5-9609-5254006e85c2
2016-02-09 21:11:22 - https://testing.hpdd.intel.com/test_sets/e80a4f54-cf8a-11e5-bda4-5254006e85c2
2016-02-18 22:26:13 - https://testing.hpdd.intel.com/test_sets/7a0c1ed4-d6a0-11e5-96e2-5254006e85c2
2016-02-21 20:28:42 - https://testing.hpdd.intel.com/test_sets/c0300edc-d8eb-11e5-b4e5-5254006e85c2
2016-02-23 00:50:12 - https://testing.hpdd.intel.com/test_sets/455ee11e-d9eb-11e5-bab5-5254006e85c2

Comment by Lai Siyao [ 23/Sep/16 ]

This is duplicate of LU-7903.

Generated at Sat Feb 10 02:06:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.