[LU-11613] MDS and OSS locked up wait_transaction_locked+0x85/0xd0 [jbd2] Created: 04/Nov/18 Updated: 27/Aug/19 Resolved: 27/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Manish (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS Linux release 7.5.1804 (Core) |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hi We are experiencing the Lustre MDS and OSS server deadlock with high CPU usage and once the server is rebooted the file system goes back to normal. By looking at the stack traces this looks similar to Here are the stack trace from MDS server.
Nov 3 19:02:14 mds rsyslogd: -- MARK -- Nov 3 19:07:07 mds kernel: [468204.667688] INFO: task jbd2/dm-0:5862 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.687998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.711686] jbd2/dm-0 D ffff883fcf614f10 0 5862 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.711693] Call Trace: Nov 3 19:07:07 mds kernel: [468204.719277] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.719303] [<ffffffffa01822b6>] jbd2_journal_commit_transaction+0x246/0x19a0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719309] [<ffffffff810cac9e>] ? account_entity_dequeue+0xae/0xd0 Nov 3 19:07:07 mds kernel: [468204.719312] [<ffffffff810ce77c>] ? dequeue_entity+0x11c/0x5e0 Nov 3 19:07:07 mds kernel: [468204.719315] [<ffffffff810c8185>] ? sched_clock_cpu+0x85/0xc0 Nov 3 19:07:07 mds kernel: [468204.719318] [<ffffffff8102954d>] ? __switch_to+0xcd/0x4b0 Nov 3 19:07:07 mds kernel: [468204.719322] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719324] [<ffffffff816946f7>] ? __schedule+0x477/0xa30 Nov 3 19:07:07 mds kernel: [468204.719328] [<ffffffff810998ee>] ? try_to_del_timer_sync+0x5e/0x90 Nov 3 19:07:07 mds kernel: [468204.719334] [<ffffffffa0188ab9>] kjournald2+0xc9/0x260 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719336] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719341] [<ffffffffa01889f0>] ? commit_timeout+0x10/0x10 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719343] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:07 mds kernel: [468204.719345] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719347] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:07 mds kernel: [468204.719349] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719351] INFO: task mdt00_002:5872 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.719352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.719354] mdt00_002 D ffff881fc0dd4f10 0 5872 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.719354] Call Trace: Nov 3 19:07:07 mds kernel: [468204.719356] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.719361] [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719362] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.719367] [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719371] [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719385] [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719390] [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0 Nov 3 19:07:07 mds kernel: [468204.719394] [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.719401] [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719410] [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719415] [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.719430] [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt] Nov 3 19:07:07 mds kernel: [468204.719439] [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt] Nov 3 19:07:07 mds kernel: [468204.719446] [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt] Nov 3 19:07:07 mds kernel: [468204.719454] [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt] Nov 3 19:07:07 mds kernel: [468204.719462] [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt] Nov 3 19:07:07 mds kernel: [468204.719495] [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass] Nov 3 19:07:07 mds kernel: [468204.719513] [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass] Nov 3 19:07:07 mds kernel: [468204.719523] [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 3 19:07:07 mds kernel: [468204.719532] [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.719542] [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 3 19:07:07 mds kernel: [468204.719551] [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Nov 3 19:07:07 mds kernel: [468204.719560] [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.719569] [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt] Nov 3 19:07:07 mds kernel: [468204.719614] [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719640] [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719678] [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719715] [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719761] [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719800] [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719836] [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719872] [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719877] [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20 Nov 3 19:07:07 mds kernel: [468204.719911] [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719946] [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.719952] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:07 mds kernel: [468204.719955] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719958] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:07 mds kernel: [468204.719961] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:07 mds kernel: [468204.719963] INFO: task mdt01_001:5874 blocked for more than 120 seconds. Nov 3 19:07:07 mds kernel: [468204.719965] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:07 mds kernel: [468204.719971] mdt01_001 D ffff881fc0dd0fd0 0 5874 2 0x00000080 Nov 3 19:07:07 mds kernel: [468204.719974] Call Trace: Nov 3 19:07:07 mds kernel: [468204.720005] [<ffffffffa0bae0a0>] ? ldlm_pool_add+0x80/0xf0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720010] [<ffffffff81694cd9>] schedule+0x29/0x70 Nov 3 19:07:07 mds kernel: [468204.720018] [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720022] [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30 Nov 3 19:07:07 mds kernel: [468204.720028] [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720035] [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720048] [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720053] [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0 Nov 3 19:07:07 mds kernel: [468204.720059] [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:07:07 mds kernel: [468204.720068] [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720078] [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720088] [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:07:07 mds kernel: [468204.720100] [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt] Nov 3 19:07:07 mds kernel: [468204.720112] [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt] Nov 3 19:07:07 mds kernel: [468204.720124] [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt] Nov 3 19:07:07 mds kernel: [468204.720136] [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt] Nov 3 19:07:07 mds kernel: [468204.720149] [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt] Nov 3 19:07:07 mds kernel: [468204.720170] [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass] Nov 3 19:07:07 mds kernel: [468204.720190] [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass] Nov 3 19:07:07 mds kernel: [468204.720202] [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 3 19:07:07 mds kernel: [468204.720214] [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.720227] [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 3 19:07:07 mds kernel: [468204.720239] [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Nov 3 19:07:07 mds kernel: [468204.720249] [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt] Nov 3 19:07:07 mds kernel: [468204.720262] [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt] Nov 3 19:07:07 mds kernel: [468204.720291] [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720319] [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720350] [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720382] [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720418] [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720456] [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:07:07 mds kernel: [468204.720491] [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720525] [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720529] [<ffffffff810c4d52>] ? default_wake_function+0x12/0x20 Nov 3 19:07:14 mds kernel: [468204.720531] [<ffffffff810bad2b>] ? __wake_up_common+0x5b/0x90 Nov 3 19:07:14 mds kernel: [468204.720564] [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720597] [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Nov 3 19:07:14 mds kernel: [468204.720600] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:07:14 mds kernel: [468204.720602] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:14 mds kernel: [468204.720603] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:07:14 mds kernel: [468204.720605] [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40 Nov 3 19:07:14 mds kernel: [468204.720609] INFO: task osp-pre-141-0:5909 blocked for more than 120 seconds. Nov 3 19:07:14 mds kernel: [468204.720610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 3 19:07:14 mds kernel: [468204.720612] osp-pre-141-0 D ffff881fd0621fa0 0 5909 2 0x00000080
Here are the stack trace from the OSS server.
Nov 3 19:08:40 oss kernel: [469273.871420] Pid: 85809, comm: ll_ost_io01_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 3 19:08:40 oss kernel: [469273.871421] Call Trace: Nov 3 19:08:40 oss kernel: [469273.871450] [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892337] [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892340] [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892344] [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:08:40 oss kernel: [469273.892358] [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:08:40 oss kernel: [469273.892370] [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:08:40 oss kernel: [469273.892378] [<ffffffffa111127e>] ofd_trans_start+0x6e/0xf0 [ofd] Nov 3 19:08:40 oss kernel: [469273.892383] [<ffffffffa111772b>] ofd_commitrw_write+0x94b/0x1c90 [ofd] Nov 3 19:08:40 oss kernel: [469273.892394] [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd] Nov 3 19:08:40 oss kernel: [469273.892456] [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892504] [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892539] [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892570] [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892600] [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:08:40 oss kernel: [469273.892609] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:08:40 oss kernel: [469273.892616] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:08:40 oss kernel: [469273.892624] [<ffffffffffffffff>] 0xffffffffffffffff Nov 3 19:08:40 oss kernel: [469273.892627] LustreError: dumping log to /tmp/lustre-log.1541297320.85809 Nov 3 19:08:53 oss kernel: [469287.156081] LNet: Service thread pid 85694 was inactive for 301.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 3 19:08:53 oss kernel: [469287.207070] Pid: 85694, comm: ll_ost_io00_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018 Nov 3 19:08:53 oss kernel: [469287.207071] Call Trace: Nov 3 19:08:53 oss kernel: [469287.207100] [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.227995] [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.227999] [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 3 19:08:53 oss kernel: [469287.228002] [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 3 19:08:53 oss kernel: [469287.228010] [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 3 19:08:53 oss kernel: [469287.228024] [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 3 19:08:53 oss kernel: [469287.228036] [<ffffffffa1117ce3>] ofd_commitrw_write+0xf03/0x1c90 [ofd] Nov 3 19:08:53 oss kernel: [469287.228042] [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd] Nov 3 19:08:53 oss kernel: [469287.228084] [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228125] [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228160] [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228192] [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228224] [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 3 19:08:53 oss kernel: [469287.228227] [<ffffffff810b1131>] kthread+0xd1/0xe0 Nov 3 19:08:53 oss kernel: [469287.228230] [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0 Nov 3 19:08:53 oss kernel: [469287.228235] [<ffffffffffffffff>] 0xffffffffffffffff
If you need complete crash dump then let me know and I will upload it. Thank You, Manish
|
| Comments |
| Comment by Peter Jones [ 04/Nov/18 ] |
|
Manish It sounds like this is a disruptive bug to production rather than the whole system is completely out of service so I have adjusted the severity accordingly. Alex Can you please investigate? Thanks Peter |
| Comment by Alex Zhuravlev [ 04/Nov/18 ] |
|
can you please attach backtraces for all the processes?
|
| Comment by Manish (Inactive) [ 04/Nov/18 ] |
|
Hi Alex, I have uploaded the stack trace to the FTP site and file name is "foreach_bt.txt" under ticket dir. Let me know if you are able to view that file. Thank You, Manish |
| Comment by Manish (Inactive) [ 05/Nov/18 ] |
|
Hi, We hit this issue again with one of our other file system and the stack traces looks similar. I have already uploaded the new stack trace to FTP site and file names are foreach_bt_mds_11-05-2018.txt foreach_bt_oss_11-05-2018.txt
Can we please raise the level of priority as this is now ongoing issues with most of our lustre file system with lustre upgrade with 2.10.5 version. Thank You, Manish |
| Comment by Alex Zhuravlev [ 05/Nov/18 ] |
|
thanks, it will take some time to analyse the traces.
|
| Comment by Alex Zhuravlev [ 06/Nov/18 ] |
|
Manish, please tell how many MDSs are used and whether LFSCK was running at that time?
|
| Comment by Mahmoud Hanafi [ 06/Nov/18 ] |
|
There is only one mds and fsck was not running. |
| Comment by Marion Hakanson (Inactive) [ 07/Nov/18 ] |
|
FYI, our lockups associated with Another difference from our
|
| Comment by Alex Zhuravlev [ 07/Nov/18 ] |
|
it would be very helpful if you can provide Lustre logs and/or dmesg messages. I think I understand the cause (and have patches in testing), but I still don't understand how we got into that situation..
|
| Comment by Alex Zhuravlev [ 07/Nov/18 ] |
|
do you use (or used) project quota?
|
| Comment by Alex Zhuravlev [ 07/Nov/18 ] |
|
AFAICS, some client sends chown/chgrp to the quota-enabled MDS which in turn needs to consult with OST on possibility to move file into another group. this file is being actively written so OST hits a deadlock ( though I can't really explain that OST hits it again and again.. that's supposed to be a race-like condition. |
| Comment by Manish (Inactive) [ 07/Nov/18 ] |
|
Hi Alex, We have never used project quota. I will share dmesg logs soon with Oleg and you can get more details form him. Thank You, Manish |
| Comment by Manish (Inactive) [ 07/Nov/18 ] |
|
Hi Alex, Sorry for the confusion about what logs I can share as I am trying to get familiar with security limitations on sharing data. Well so I have uploaded the dmesg logs on FTP site un ticket directory and files are "mds-vmcore-dmesg.txt" and "oss-vmcore-dmesg.txt". Thank You, Manish
|
| Comment by Alex Zhuravlev [ 07/Nov/18 ] |
$ pwd /scratch/ftp/uploads/LU-11613 [alexeyzh@ssh-2 LU-11613]$ ls -l total 4448 -rw-r--r-- 1 nobody ftp 115858 Nov 4 16:42 foreach_backtrace.zip -rw-r--r-- 1 nobody ftp 1724395 Nov 5 19:23 foreach_bt_mds_11-05-2018.txt -rw-r--r-- 1 nobody ftp 1315475 Nov 5 19:23 foreach_bt_oss_11-05-2018.txt -rw-r--r-- 1 nobody ftp 1389778 Nov 4 19:16 foreach_bt.txt probably into a different dir? |
| Comment by Manish (Inactive) [ 07/Nov/18 ] |
|
Hi Alex, No it's in same dir.
ftp.whamcloud.com:/uploads/LU-11613> ls -rw-r--r-- 1 99 50 115858 Nov 04 16:42 foreach_backtrace.zip -rw-r--r-- 1 99 50 1389778 Nov 04 19:16 foreach_bt.txt -rw-r--r-- 1 99 50 1724395 Nov 05 19:23 foreach_bt_mds_11-05-2018.txt -rw-r--r-- 1 99 50 1315475 Nov 05 19:23 foreach_bt_oss_11-05-2018.txt -rw-r--r-- 1 99 50 1042487 Nov 07 16:25 mds-vmcore-dmesg.txt -rw-r--r-- 1 99 50 1042487 Nov 07 16:26 oss-vmcore-dmesg.txt
Please re-fresh dir at your end. Thank You, Manish |
| Comment by Alex Zhuravlev [ 08/Nov/18 ] |
|
did you upgrade Lustre recently? if yes, what the previous version was? I'm able to reproduce the issue and the fix seem to work, but we'd like to understand why did you start to experience the issue. |
| Comment by Mahmoud Hanafi [ 08/Nov/18 ] |
|
Yes these filesystem were updated from 2.7 to 2.10.5 and that is when we started to see this issue. It is unlikely that there are frequent chown/chgrop operation. These are our larger filesystems and there are many applications running at the same time. It is difficult to isolate the issue to a simple type of app. |
| Comment by Nathan Dauchy (Inactive) [ 09/Nov/18 ] |
|
Hit this problem at NOAA too, within 1 day of upgrading servers from 2.7 to 2.10.5 and clients from 2.10.4 to 2.10.5. See DDN Case #112165 for details. At NOAA, chgrp is commonly used (to keep ownership aligned with top-level "project" directory, which we soon hope to replace with project quotas).
|
| Comment by Marion Hakanson (Inactive) [ 09/Nov/18 ] |
|
We use chgrp for the same reasons as the NOAA poster above. While we wait for the fix to be released, can we avoid this lockup by disabling quota enforcement, or removing all our group quotas? Or must we disable the chown/chgrp commands cluster-wide?
|
| Comment by Cliff White (Inactive) [ 09/Nov/18 ] |
|
Adding MDS call traces from DDN 112165 Nov 8 17:18:50 lfs-mds-1-1 kernel: Lustre: lfs1-MDT0000: Connection restored to 4de02ddb-11dd-0c3b-e5a4-6b3f42554a8e (at 10.179.38.35@o2ib) Nov 8 18:58:23 lfs-mds-1-1 kernel: Lustre: lfs1-MDT0000: Connection restored to a81b824f-2064-b0a5-e927-1bb9bd32bb32 (at 10.179.126.13@o2ib) Nov 8 19:11:19 lfs-mds-1-1 kernel: INFO: task jbd2/dm-21-8:41769 blocked for more than 120 seconds. Nov 8 19:11:19 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 8 19:11:19 lfs-mds-1-1 kernel: jbd2/dm-21-8 D ffff922a486a9fa0 0 41769 2 0x00000000 Nov 8 19:11:19 lfs-mds-1-1 kernel: Call Trace: Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffaeb14029>] schedule+0x29/0x70 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04a82ac>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4d52ce>] ? account_entity_dequeue+0xae/0xd0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4d8e4c>] ? dequeue_entity+0x11c/0x5e0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4d2765>] ? sched_clock_cpu+0x85/0xc0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae42959e>] ? __switch_to+0xce/0x580 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffaeb139fc>] ? __schedule+0x41c/0xa20 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4a3eae>] ? try_to_del_timer_sync+0x5e/0x90 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04aeab9>] kjournald2+0xc9/0x260 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04ae9f0>] ? commit_timeout+0x10/0x10 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb621>] kthread+0xd1/0xe0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffaeb205f7>] ret_from_fork_nospec_begin+0x21/0x21 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40 Nov 8 19:11:19 lfs-mds-1-1 kernel: INFO: task mdt00_002:41819 blocked for more than 120 seconds. Nov 8 19:11:19 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 8 19:11:19 lfs-mds-1-1 kernel: mdt00_002 D ffff922a48a9bf40 0 41819 2 0x00000000 Nov 8 19:11:19 lfs-mds-1-1 kernel: Call Trace: Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f05240>] ? ldlm_pool_add+0x80/0xf0 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffaeb14029>] schedule+0x29/0x70 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04a5085>] wait_transaction_locked+0x85/0xd0 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04a5368>] add_transaction_credits+0x268/0x2f0 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04a55e1>] start_this_handle+0x1a1/0x430 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc136a06a>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae5f7bc2>] ? kmem_cache_alloc+0x1c2/0x1f0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc04a5a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc134bcde>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1225179>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc134bcde>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc159b5d7>] mdt_empty_transno+0xf7/0x840 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc159e70e>] mdt_mfd_open+0x8de/0xe70 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1576042>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc159f21b>] mdt_finish_open+0x57b/0x690 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc15a0af8>] mdt_reint_open+0x17c8/0x3190 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f1fef7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0d5ca5e>] ? lu_ucred+0x1e/0x30 [obdclass] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1585ad5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1586401>] ? mdt_root_squash+0x21/0x430 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1596213>] mdt_reint_rec+0x83/0x210 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc157736b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1577892>] mdt_intent_reint+0x162/0x480 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc1582721>] mdt_intent_policy+0x441/0xc70 [mdt] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0ed71db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0ed02ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0ef9cf3>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f21fa0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f824e2>] tgt_enqueue+0x62/0x210 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f8641a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f2bfeb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f28618>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4cf682>] ? default_wake_function+0x12/0x20 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4c52ab>] ? __wake_up_common+0x5b/0x90 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f2f732>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffc0f2eca0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb621>] kthread+0xd1/0xe0 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffaeb205f7>] ret_from_fork_nospec_begin+0x21/0x21 Nov 8 19:11:19 lfs-mds-1-1 kernel: [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40 Nov 8 19:11:19 lfs-mds-1-1 kernel: INFO: task mdt02_000:41823 blocked for more than 120 seconds. |
| Comment by Peter Jones [ 09/Dec/18 ] |
|
The revert of |
| Comment by Alex Zhuravlev [ 10/Dec/18 ] |
|
cwhite_ddn please add all remaining stack traces if possible |
| Comment by Peter Jones [ 17/Dec/18 ] |
|
Let's not conflate similar (or even identical) issues from different sites with this support request from NASA. manishpatel has this issue been seen with 2.10.6? |
| Comment by Manish (Inactive) [ 17/Dec/18 ] |
|
Hi Peter,
We are still on 2.10.5 version and working few other issues to address with this one. So we may not get 2.10.6 version and may rather end up with newer released version based on downtime slots we get. Also we have got the new build ready without Thank You, Manish |
| Comment by Andreas Dilger [ 27/Aug/19 ] |
|
The patch https://review.whamcloud.com/31293 " |