Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.10.5
-
None
-
2
-
9223372036854775807
Description
Hi
We are experiencing the Lustre MDS and OSS server deadlock with high CPU usage and once the server is rebooted the file system goes back to normal. By looking at the stack traces this looks similar to LU-11465
Here are the stack trace from MDS server.
Nov 3 19:02:14 mds rsyslogd: -- MARK -- Nov 3 19:07:07 mds kernel: [468204.667688] INFO: task jbd2/dm-0:5862 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.687998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.711686] jbd2/dm-0 D ffff883fcf614f10 0 5862 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.711693] Call Trace: Nov 3 19:07:07 mds kernel: [468204.719277] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.719303] [<ffffffffa01822b6>] jbd2_journal_commit_transaction+0x246/0x19a0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719309] [<ffffffff810cac9e>] ? account_entity_dequeue+0xae/0xd0 Nov 3 19:07:07 mds kernel: [468204.719312] [<ffffffff810ce77c>] ? dequeue_entity+0x11c/0x5e0 Nov 3 19:07:07 mds kernel: [468204.719315] [<ffffffff810c8185>] ? sched_clock_cpu+0x85/0xc0 Nov 3 19:07:07 mds kernel: [468204.719318] [<ffffffff8102954d>] ? __switch_to+0xcd/0x4b0 Nov 3 19:07:07 mds kernel: [468204.719322] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719324] [<ffffffff816946f7>] ? __schedule+0x477/0xa30 Nov 3 19:07:07 mds kernel: [468204.719328] [<ffffffff810998ee>] ? try_to_del_timer_sync+0x5e/0x90 Nov 3 19:07:07 mds kernel: [468204.719334] [<ffffffffa0188ab9>] kjournald2+0xc9/0x260 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719336] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719341] [<ffffffffa01889f0>] ? commit_timeout+0x10/0x10 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719343] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:07 mds kernel: [468204.719345] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719347] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:07 mds kernel: [468204.719349] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719351] INFO: task mdt00_002:5872 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.719352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.719354] mdt00_002 D ffff881fc0dd4f10 0 5872 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.719354] Call Trace: Nov 3 19:07:07 mds kernel: [468204.719356] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.719361] [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719362] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719367] [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719371] [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719385] [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719390] [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0 Nov 3 19:07:07 mds kernel: [468204.719394] [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719401] [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719410] [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719415] [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719430] [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt] Nov 3 19:07:07 mds kernel: [468204.719439] [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt] Nov 3 19:07:07 mds kernel: [468204.719446] [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt] Nov 3 19:07:07 mds kernel: [468204.719454] [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt] Nov 3 19:07:07 mds kernel: [468204.719462] [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt] Nov 3 19:07:07 mds kernel: [468204.719495] [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass] Nov 3 19:07:07 mds kernel: [468204.719513] [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass] Nov 3 19:07:07 mds kernel: [468204.719523] [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 3 19:07:07 mds kernel: [468204.719532] [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.719542] [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 3 19:07:07 mds kernel: [468204.719551] [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Nov 3 19:07:07 mds kernel: [468204.719560] [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.719569] [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt] Nov 3 19:07:07 mds kernel: [468204.719614] [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719640] [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719678] [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719715] [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719761] [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719800] [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719836] [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719872] [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719877] [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20 Nov 3 19:07:07 mds kernel: [468204.719911] [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719946] [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719952] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:07 mds kernel: [468204.719955] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719958] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:07 mds kernel: [468204.719961] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719963] INFO: task mdt01_001:5874 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.719965] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.719971] mdt01_001 D ffff881fc0dd0fd0 0 5874 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.719974] Call Trace: Nov 3 19:07:07 mds kernel: [468204.720005] [<ffffffffa0bae0a0>] ? ldlm_pool_add+0x80/0xf0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720010] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.720018] [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720022] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.720028] [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720035] [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720048] [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720053] [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0 Nov 3 19:07:07 mds kernel: [468204.720059] [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720068] [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720078] [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720088] [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720100] [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt] Nov 3 19:07:07 mds kernel: [468204.720112] [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt] Nov 3 19:07:07 mds kernel: [468204.720124] [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt] Nov 3 19:07:07 mds kernel: [468204.720136] [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt] Nov 3 19:07:07 mds kernel: [468204.720149] [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt] Nov 3 19:07:07 mds kernel: [468204.720170] [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass] Nov 3 19:07:07 mds kernel: [468204.720190] [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass] Nov 3 19:07:07 mds kernel: [468204.720202] [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 3 19:07:07 mds kernel: [468204.720214] [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.720227] [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 3 19:07:07 mds kernel: [468204.720239] [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Nov 3 19:07:07 mds kernel: [468204.720249] [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.720262] [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt] Nov 3 19:07:07 mds kernel: [468204.720291] [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720319] [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720350] [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720382] [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720418] [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720456] [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720491] [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720525] [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720529] [<ffffffff810c4d52>] ? default_wake_function+0x12/0x20 Nov 3 19:07:14 mds kernel: [468204.720531] [<ffffffff810bad2b>] ? __wake_up_common+0x5b/0x90 Nov 3 19:07:14 mds kernel: [468204.720564] [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720597] [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720600] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:14 mds kernel: [468204.720602] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:14 mds kernel: [468204.720603] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:14 mds kernel: [468204.720605] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:14 mds kernel: [468204.720609] INFO: task osp-pre-141-0:5909 blocked for more than 120 seconds. Nov 3 19:07:14 mds kernel: [468204.720610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:14 mds kernel: [468204.720612] osp-pre-141-0 D ffff881fd0621fa0 0 5909 2 0x00000080
Here are the stack trace from the OSS server.
Nov 3 19:08:40 oss kernel: [469273.871420] Pid: 85809, comm: ll_ost_io01_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 3 19:08:40 oss kernel: [469273.871421] Call Trace: Nov 3 19:08:40 oss kernel: [469273.871450] [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892337] [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892340] [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892344] [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892358] [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:08:40 oss kernel: [469273.892370] [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:08:40 oss kernel: [469273.892378] [<ffffffffa111127e>] ofd_trans_start+0x6e/0xf0 [ofd] Nov 3 19:08:40 oss kernel: [469273.892383] [<ffffffffa111772b>] ofd_commitrw_write+0x94b/0x1c90 [ofd] Nov 3 19:08:40 oss kernel: [469273.892394] [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd] Nov 3 19:08:40 oss kernel: [469273.892456] [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892504] [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892539] [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892570] [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892600] [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892609] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:08:40 oss kernel: [469273.892616] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:08:40 oss kernel: [469273.892624] [<ffffffffffffffff>] 0xffffffffffffffff Nov 3 19:08:40 oss kernel: [469273.892627] LustreError: dumping log to /tmp/lustre-log.1541297320.85809 Nov 3 19:08:53 oss kernel: [469287.156081] LNet: Service thread pid 85694 was inactive for 301.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 3 19:08:53 oss kernel: [469287.207070] Pid: 85694, comm: ll_ost_io00_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 3 19:08:53 oss kernel: [469287.207071] Call Trace: Nov 3 19:08:53 oss kernel: [469287.207100] [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.227995] [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.227999] [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:08:53 oss kernel: [469287.228002] [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.228010] [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:08:53 oss kernel: [469287.228024] [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:08:53 oss kernel: [469287.228036] [<ffffffffa1117ce3>] ofd_commitrw_write+0xf03/0x1c90 [ofd] Nov 3 19:08:53 oss kernel: [469287.228042] [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd] Nov 3 19:08:53 oss kernel: [469287.228084] [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228125] [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228160] [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228192] [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228224] [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228227] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:08:53 oss kernel: [469287.228230] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:08:53 oss kernel: [469287.228235] [<ffffffffffffffff>] 0xffffffffffffffff
If you need complete crash dump then let me know and I will upload it.
Thank You,
Manish