Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.1.5
-
None
-
3
-
11860
Description
mdt threaded report as inacive >200s. mds backup and requires a reboot. But after the reboot the mds hung again and required a reboot and mounting with abort recovery.
uploading these files to ftp site.
lustre-log.1385580907.6742.gz <- inital hang
lustre-log.1385589491.8362.gz <- Recover hang
We have a crashdump for the recover hang.
— initial hang----
Nov 27 10:30:14 nbp8-mds1 kernel: Lustre: 5687:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 377ef706-73ec-d593-7c7e-ac55fd582ec2@10.151.41.73@o2ib t0 exp (null) cur 1385577014 last 0
Nov 27 10:30:14 nbp8-mds1 kernel: Lustre: 6643:0:(ldlm_lib.c:952:target_handle_connect()) nbp8-MDT0000: connection from c2dc3e1e-9ec6-88a0-ddcb-182e74734295@10.151.41.73@o2ib t0 exp (null) cur 1385577014 last 0
Nov 27 10:30:53 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: haven't heard from client fd1a318e-556c-0397-a95e-9d2ed1998bc0 (at 10.151.41.73@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883f83c7fc00, cur 1385577053 expire 1385576903 last 1385576826
Nov 27 10:30:53 nbp8-mds1 kernel: Lustre: MGS: haven't heard from client 581c894b-1381-7f25-567b-57c83bbae311 (at 10.151.41.73@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883fa1722c00, cur 1385577053 expire 1385576903 last 1385576826
Nov 27 10:34:25 nbp8-mds1 kernel: LustreError: 5515:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
Nov 27 10:34:25 nbp8-mds1 kernel: LustreError: 5515:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with 10.151.32.5@o2ib (152): c: 6, oc: 0, rc: 8
Nov 27 10:34:52 nbp8-mds1 kernel: Lustre: 5687:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 243f629c-ac60-5881-7a9e-e96a02c21f7d@10.151.32.5@o2ib t0 exp (null) cur 1385577292 last 0
Nov 27 10:35:00 nbp8-mds1 kernel: LustreError: 5996:0:(quota_ctl.c:330:client_quota_ctl()) ptlrpc_queue_wait failed, rc: -3
Nov 27 10:35:00 nbp8-mds1 kernel: LustreError: 5996:0:(quota_ctl.c:330:client_quota_ctl()) Skipped 311 previous similar messages
Nov 27 10:35:39 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: haven't heard from client 3b493920-c724-792f-5966-cf83ffa67f75 (at 10.151.32.5@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881e3176fc00, cur 1385577339 expire 1385577189 last 1385577112
Nov 27 11:01:27 nbp8-mds1 kernel: LustreError: 5515:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
Nov 27 11:01:27 nbp8-mds1 kernel: LustreError: 5515:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with 10.151.27.18@o2ib (162): c: 7, oc: 0, rc: 8
Nov 27 11:03:43 nbp8-mds1 kernel: Lustre: 5686:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 236f42d4-1a48-7ef0-79e8-c65ae40bc796@10.151.27.18@o2ib t0 exp (null) cur 1385579023 last 0
Nov 27 11:03:43 nbp8-mds1 kernel: Lustre: 5686:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 1 previous similar message
Nov 27 11:03:43 nbp8-mds1 kernel: Lustre: 7068:0:(ldlm_lib.c:952:target_handle_connect()) nbp8-MDT0000: connection from 5b0af5d5-d93c-d433-ec5c-7b17cd82c746@10.151.27.18@o2ib t0 exp (null) cur 1385579023 last 0
Nov 27 11:35:07 nbp8-mds1 kernel: Lustre: Service thread pid 6742 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 27 11:35:07 nbp8-mds1 kernel: Pid: 6742, comm: mdt_137
Nov 27 11:35:16 nbp8-mds1 kernel:
Nov 27 11:35:16 nbp8-mds1 kernel: Call Trace:
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa04f819b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa04f960e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa0a9f6de>] qos_statfs_update+0x7fe/0xa70 [lov]
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffff8110e42e>] ? find_get_page+0x1e/0xa0
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa0aa00fd>] alloc_qos+0x1ad/0x21a0 [lov]
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa0aa5fdf>] ? lsm_alloc_plain+0xff/0x930 [lov]
Nov 27 11:35:16 nbp8-mds1 kernel: [<ffffffffa0aa306c>] qos_prep_create+0x1ec/0x2380 [lov]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0a9c63a>] lov_prep_create_set+0xea/0x390 [lov]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0a84b0c>] lov_create+0x1ac/0x1400 [lov]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0d8b0d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0ea2f13>] ? osd_object_read_unlock+0x53/0xa0 [osd_ldiskfs]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0dab916>] ? mdd_read_unlock+0x26/0x30 [mdd]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0d8f90e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0da1871>] mdd_create+0xf81/0x1a90 [mdd]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0ea9df3>] ? osd_oi_lookup+0x83/0x110 [osd_ldiskfs]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0ea456c>] ? osd_object_init+0xdc/0x3e0 [osd_ldiskfs]
Nov 27 11:35:22 nbp8-mds1 kernel: [<ffffffffa0eda3f7>] cml_create+0x97/0x250 [cmm]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e165e1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e2c06e>] mdt_reint_open+0x1aae/0x28a0 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa078f724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0da456e>] ? md_ucred+0x1e/0x60 [mdd]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e14c81>] mdt_reint_rec+0x41/0xe0 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e0bed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e0c53d>] mdt_intent_reint+0x1ed/0x530 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e0ac09>] mdt_intent_policy+0x379/0x690 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa074b351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa07711ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e0b586>] mdt_enqueue+0x46/0x130 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e00772>] mdt_handle_common+0x932/0x1750 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa0e01665>] mdt_regular_handle+0x15/0x20 [mdt]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 11:35:23 nbp8-mds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov 27 11:35:23 nbp8-mds1 kernel:
Nov 27 11:35:28 nbp8-mds1 kernel: LustreError: dumping log to /tmp/lustre-log.1385580907.6742
Nov 27 11:35:28 nbp8-mds1 kernel: Lustre: Service thread pid 6645 was inactive for 200.01s. The thread might be hung, or it might only be slow and will resume l
— after reboot hang —
ov 27 13:57:46 nbp8-mds1 kernel: LustreError: 6771:0:(ldlm_request.c:91:ldlm_expired_completion_wait()) Skipped 14 previous similar messages
Nov 27 13:58:11 nbp8-mds1 kernel: Lustre: Service thread pid 8362 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 27 13:58:11 nbp8-mds1 kernel: Pid: 8362, comm: mdt_454
Nov 27 13:58:14 nbp8-mds1 kernel:
Nov 27 13:58:14 nbp8-mds1 kernel: Call Trace:
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffff8151d552>] schedule_timeout+0x192/0x2e0
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffff8107bf80>] ? process_timeout+0x0/0x10
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0764c60>] ? ldlm_expired_completion_wait+0x0/0x260 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa04f95e1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0768d0d>] ldlm_completion_ast+0x48d/0x720 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0768506>] ldlm_cli_enqueue_local+0x1e6/0x560 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0768880>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0df9e60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0dfd2a0>] mdt_object_lock+0x320/0xb70 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0df9e60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0768880>] ? ldlm_completion_ast+0x0/0x720 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e0dc62>] mdt_getattr_name_lock+0xe22/0x1880 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa078eb1d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa07b8486>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0790da4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e0ec1d>] mdt_intent_getattr+0x2cd/0x4a0 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e0ac09>] mdt_intent_policy+0x379/0x690 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa074b351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa07711ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e0b586>] mdt_enqueue+0x46/0x130 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e00772>] mdt_handle_common+0x932/0x1750 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa0e01665>] mdt_regular_handle+0x15/0x20 [mdt]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 13:58:14 nbp8-mds1 kernel: [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Nov 27 13:58:15 nbp8-mds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov 27 13:58:15 nbp8-mds1 kernel:
Nov 27 13:58:15 nbp8-mds1 kernel: LustreError: dumping log to /tmp/lustre-log.1385589491.8362
Attachments
Issue Links
- is related to
-
LU-4271 mds load goes very high and filesystem hangs after mounting mdt
-
- Resolved
-