[LU-11613] MDS and OSS locked up wait_transaction_locked+0x85/0xd0 [jbd2] Created: 04/Nov/18  Updated: 27/Aug/19  Resolved: 27/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Manish (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS Linux release 7.5.1804 (Core)
Kernel Version 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105
e2fsprogs-libs-1.44.3.wc1-0.el7.x86_64
e2fsprogs-1.44.3.wc1-0.el7.x86_64
e2fsprogs-static-1.44.3.wc1-0.el7.x86_64
e2fsprogs-devel-1.44.3.wc1-0.el7.x86_64
e2fsprogs-debuginfo-1.44.3.wc1-0.el7.x86_64


Issue Links:
Duplicate
duplicates LU-10048 osd-ldiskfs to truncate outside of ma... Resolved
Related
is related to LU-11465 OSS/MDS deadlock in 2.10.5 Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Hi 

We are experiencing the Lustre MDS and OSS server deadlock with high CPU usage and once the server is rebooted the file system goes back to normal. By looking at the stack traces this looks similar to LU-11465

Here are the stack trace from MDS server.

 

Nov  3 19:02:14 mds rsyslogd: -- MARK --
Nov  3 19:07:07 mds kernel: [468204.667688] INFO: task jbd2/dm-0:5862 blocked for more than 120 seconds.
Nov  3 19:07:07 mds kernel: [468204.687998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  3 19:07:07 mds kernel: [468204.711686] jbd2/dm-0       D ffff883fcf614f10     0  5862      2 0x00000080
Nov  3 19:07:07 mds kernel: [468204.711693] Call Trace:
Nov  3 19:07:07 mds kernel: [468204.719277]  [<ffffffff81694cd9>] schedule+0x29/0x70
Nov  3 19:07:07 mds kernel: [468204.719303]  [<ffffffffa01822b6>] jbd2_journal_commit_transaction+0x246/0x19a0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719309]  [<ffffffff810cac9e>] ? account_entity_dequeue+0xae/0xd0
Nov  3 19:07:07 mds kernel: [468204.719312]  [<ffffffff810ce77c>] ? dequeue_entity+0x11c/0x5e0
Nov  3 19:07:07 mds kernel: [468204.719315]  [<ffffffff810c8185>] ? sched_clock_cpu+0x85/0xc0
Nov  3 19:07:07 mds kernel: [468204.719318]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x4b0
Nov  3 19:07:07 mds kernel: [468204.719322]  [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30
Nov  3 19:07:07 mds kernel: [468204.719324]  [<ffffffff816946f7>] ? __schedule+0x477/0xa30
Nov  3 19:07:07 mds kernel: [468204.719328]  [<ffffffff810998ee>] ? try_to_del_timer_sync+0x5e/0x90
Nov  3 19:07:07 mds kernel: [468204.719334]  [<ffffffffa0188ab9>] kjournald2+0xc9/0x260 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719336]  [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30
Nov  3 19:07:07 mds kernel: [468204.719341]  [<ffffffffa01889f0>] ? commit_timeout+0x10/0x10 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719343]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  3 19:07:07 mds kernel: [468204.719345]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:07 mds kernel: [468204.719347]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  3 19:07:07 mds kernel: [468204.719349]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:07 mds kernel: [468204.719351] INFO: task mdt00_002:5872 blocked for more than 120 seconds.
Nov  3 19:07:07 mds kernel: [468204.719352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  3 19:07:07 mds kernel: [468204.719354] mdt00_002       D ffff881fc0dd4f10     0  5872      2 0x00000080
Nov  3 19:07:07 mds kernel: [468204.719354] Call Trace:
Nov  3 19:07:07 mds kernel: [468204.719356]  [<ffffffff81694cd9>] schedule+0x29/0x70
Nov  3 19:07:07 mds kernel: [468204.719361]  [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719362]  [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30
Nov  3 19:07:07 mds kernel: [468204.719367]  [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719371]  [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719385]  [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.719390]  [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0
Nov  3 19:07:07 mds kernel: [468204.719394]  [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.719401]  [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.719410]  [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.719415]  [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.719430]  [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719439]  [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719446]  [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719454]  [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719462]  [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719495]  [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass]
Nov  3 19:07:07 mds kernel: [468204.719513]  [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass]
Nov  3 19:07:07 mds kernel: [468204.719523]  [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719532]  [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719542]  [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719551]  [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719560]  [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719569]  [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt]
Nov  3 19:07:07 mds kernel: [468204.719614]  [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719640]  [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719678]  [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719715]  [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719761]  [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719800]  [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719836]  [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719872]  [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719877]  [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20
Nov  3 19:07:07 mds kernel: [468204.719911]  [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719946]  [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.719952]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  3 19:07:07 mds kernel: [468204.719955]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:07 mds kernel: [468204.719958]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  3 19:07:07 mds kernel: [468204.719961]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:07 mds kernel: [468204.719963] INFO: task mdt01_001:5874 blocked for more than 120 seconds.
Nov  3 19:07:07 mds kernel: [468204.719965] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  3 19:07:07 mds kernel: [468204.719971] mdt01_001       D ffff881fc0dd0fd0     0  5874      2 0x00000080
Nov  3 19:07:07 mds kernel: [468204.719974] Call Trace:
Nov  3 19:07:07 mds kernel: [468204.720005]  [<ffffffffa0bae0a0>] ? ldlm_pool_add+0x80/0xf0 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720010]  [<ffffffff81694cd9>] schedule+0x29/0x70
Nov  3 19:07:07 mds kernel: [468204.720018]  [<ffffffffa017f085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.720022]  [<ffffffff810b20a0>] ? wake_up_atomic_t+0x30/0x30
Nov  3 19:07:07 mds kernel: [468204.720028]  [<ffffffffa017f368>] add_transaction_credits+0x268/0x2f0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.720035]  [<ffffffffa017f5e1>] start_this_handle+0x1a1/0x430 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.720048]  [<ffffffffa1087fba>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.720053]  [<ffffffff811df622>] ? kmem_cache_alloc+0x1c2/0x1f0
Nov  3 19:07:07 mds kernel: [468204.720059]  [<ffffffffa017fa93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Nov  3 19:07:07 mds kernel: [468204.720068]  [<ffffffffa1069c8e>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.720078]  [<ffffffffa0fcb909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.720088]  [<ffffffffa1069c8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:07:07 mds kernel: [468204.720100]  [<ffffffffa1232da7>] mdt_empty_transno+0xf7/0x840 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720112]  [<ffffffffa1235ede>] mdt_mfd_open+0x8de/0xe70 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720124]  [<ffffffffa120e012>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720136]  [<ffffffffa12369eb>] mdt_finish_open+0x57b/0x690 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720149]  [<ffffffffa12382c8>] mdt_reint_open+0x17c8/0x3190 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720170]  [<ffffffffa09b2311>] ? upcall_cache_get_entry+0x211/0x8f0 [obdclass]
Nov  3 19:07:07 mds kernel: [468204.720190]  [<ffffffffa09b726e>] ? lu_ucred+0x1e/0x30 [obdclass]
Nov  3 19:07:07 mds kernel: [468204.720202]  [<ffffffffa121d9e5>] ? mdt_ucred+0x15/0x20 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720214]  [<ffffffffa121e2b1>] ? mdt_root_squash+0x21/0x430 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720227]  [<ffffffffa122daf3>] mdt_reint_rec+0x83/0x210 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720239]  [<ffffffffa120f33b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720249]  [<ffffffffa120f862>] mdt_intent_reint+0x162/0x430 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720262]  [<ffffffffa121a631>] mdt_intent_policy+0x441/0xc70 [mdt]
Nov  3 19:07:07 mds kernel: [468204.720291]  [<ffffffffa0b801db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720319]  [<ffffffffa0b792ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720350]  [<ffffffffa0ba2b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720382]  [<ffffffffa0bcada0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720418]  [<ffffffffa0c28262>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720456]  [<ffffffffa0c2beca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  3 19:07:07 mds kernel: [468204.720491]  [<ffffffffa0bd44bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  3 19:07:14 mds kernel: [468204.720525]  [<ffffffffa0bd29b8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Nov  3 19:07:14 mds kernel: [468204.720529]  [<ffffffff810c4d52>] ? default_wake_function+0x12/0x20
Nov  3 19:07:14 mds kernel: [468204.720531]  [<ffffffff810bad2b>] ? __wake_up_common+0x5b/0x90
Nov  3 19:07:14 mds kernel: [468204.720564]  [<ffffffffa0bd84a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  3 19:07:14 mds kernel: [468204.720597]  [<ffffffffa0bd7a10>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Nov  3 19:07:14 mds kernel: [468204.720600]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  3 19:07:14 mds kernel: [468204.720602]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:14 mds kernel: [468204.720603]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  3 19:07:14 mds kernel: [468204.720605]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
Nov  3 19:07:14 mds kernel: [468204.720609] INFO: task osp-pre-141-0:5909 blocked for more than 120 seconds.
Nov  3 19:07:14 mds kernel: [468204.720610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  3 19:07:14 mds kernel: [468204.720612] osp-pre-141-0   D ffff881fd0621fa0     0  5909      2 0x00000080

 

 

 

Here are the stack trace from the OSS server.

 

Nov  3 19:08:40 oss kernel: [469273.871420] Pid: 85809, comm: ll_ost_io01_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  3 19:08:40 oss kernel: [469273.871421] Call Trace:
Nov  3 19:08:40 oss kernel: [469273.871450]  [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Nov  3 19:08:40 oss kernel: [469273.892337]  [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2]
Nov  3 19:08:40 oss kernel: [469273.892340]  [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2]
Nov  3 19:08:40 oss kernel: [469273.892344]  [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Nov  3 19:08:40 oss kernel: [469273.892358]  [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Nov  3 19:08:40 oss kernel: [469273.892370]  [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:08:40 oss kernel: [469273.892378]  [<ffffffffa111127e>] ofd_trans_start+0x6e/0xf0 [ofd]
Nov  3 19:08:40 oss kernel: [469273.892383]  [<ffffffffa111772b>] ofd_commitrw_write+0x94b/0x1c90 [ofd]
Nov  3 19:08:40 oss kernel: [469273.892394]  [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd]
Nov  3 19:08:40 oss kernel: [469273.892456]  [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc]
Nov  3 19:08:40 oss kernel: [469273.892504]  [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc]
Nov  3 19:08:40 oss kernel: [469273.892539]  [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  3 19:08:40 oss kernel: [469273.892570]  [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  3 19:08:40 oss kernel: [469273.892600]  [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  3 19:08:40 oss kernel: [469273.892609]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  3 19:08:40 oss kernel: [469273.892616]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  3 19:08:40 oss kernel: [469273.892624]  [<ffffffffffffffff>] 0xffffffffffffffff
Nov  3 19:08:40 oss kernel: [469273.892627] LustreError: dumping log to /tmp/lustre-log.1541297320.85809
Nov  3 19:08:53 oss kernel: [469287.156081] LNet: Service thread pid 85694 was inactive for 301.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  3 19:08:53 oss kernel: [469287.207070] Pid: 85694, comm: ll_ost_io00_185 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
Nov  3 19:08:53 oss kernel: [469287.207071] Call Trace:
Nov  3 19:08:53 oss kernel: [469287.207100]  [<ffffffffa0179085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Nov  3 19:08:53 oss kernel: [469287.227995]  [<ffffffffa0179368>] add_transaction_credits+0x268/0x2f0 [jbd2]
Nov  3 19:08:53 oss kernel: [469287.227999]  [<ffffffffa01795e1>] start_this_handle+0x1a1/0x430 [jbd2]
Nov  3 19:08:53 oss kernel: [469287.228002]  [<ffffffffa0179a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Nov  3 19:08:53 oss kernel: [469287.228010]  [<ffffffffa14a7909>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Nov  3 19:08:53 oss kernel: [469287.228024]  [<ffffffffa108dc8e>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  3 19:08:53 oss kernel: [469287.228036]  [<ffffffffa1117ce3>] ofd_commitrw_write+0xf03/0x1c90 [ofd]
Nov  3 19:08:53 oss kernel: [469287.228042]  [<ffffffffa111b779>] ofd_commitrw+0x4c9/0xae0 [ofd]
Nov  3 19:08:53 oss kernel: [469287.228084]  [<ffffffffa0c52510>] obd_commitrw+0x2f3/0x336 [ptlrpc]
Nov  3 19:08:53 oss kernel: [469287.228125]  [<ffffffffa0c25145>] tgt_brw_write+0xfb5/0x1780 [ptlrpc]
Nov  3 19:08:53 oss kernel: [469287.228160]  [<ffffffffa0c20eca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  3 19:08:53 oss kernel: [469287.228192]  [<ffffffffa0bc94bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  3 19:08:53 oss kernel: [469287.228224]  [<ffffffffa0bcd4a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  3 19:08:53 oss kernel: [469287.228227]  [<ffffffff810b1131>] kthread+0xd1/0xe0
Nov  3 19:08:53 oss kernel: [469287.228230]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
Nov  3 19:08:53 oss kernel: [469287.228235]  [<ffffffffffffffff>] 0xffffffffffffffff

 

 

If you need complete crash dump then let me know and I will upload it.

Thank You,

                  Manish

 



 Comments   
Comment by Peter Jones [ 04/Nov/18 ]

Manish

It sounds like this is a disruptive bug to production rather than the whole system is completely out of service so I have adjusted the severity accordingly.

Alex

Can you please investigate?

Thanks

Peter

Comment by Alex Zhuravlev [ 04/Nov/18 ]

can you please attach backtraces for all the processes?

 

Comment by Manish (Inactive) [ 04/Nov/18 ]

Hi Alex,

I have uploaded the stack trace to the FTP site and file name is "foreach_bt.txt" under ticket dir.

Let me know if you are able to view that file.

Thank You,

                  Manish

Comment by Manish (Inactive) [ 05/Nov/18 ]

Hi,

We hit this issue again with one of our other file system and the stack traces looks similar. I have already uploaded the new stack trace to FTP site and file names are 

foreach_bt_mds_11-05-2018.txt

foreach_bt_oss_11-05-2018.txt

 

Can we please raise the level of priority as this is now ongoing issues with most of our lustre file system with lustre upgrade with 2.10.5 version.

Thank You,

                  Manish

Comment by Alex Zhuravlev [ 05/Nov/18 ]

thanks, it will take some time to analyse the traces.

 

Comment by Alex Zhuravlev [ 06/Nov/18 ]

Manish, please tell how many MDSs are used and whether LFSCK was running at that time?

 

Comment by Mahmoud Hanafi [ 06/Nov/18 ]

There is only one mds and fsck was not running. 

Comment by Marion Hakanson (Inactive) [ 07/Nov/18 ]

FYI, our lockups associated with LU-11465 show a high load average on the affected server (MDS for our case), but the CPU utilization is not high.  In fact the MDS is quite responsive, CPUs are pretty much idle, but the high load average is due to all of the stuck "D-state" processes which are runnable, but deadlocked.

Another difference from our LU-11465 situation is that rebooting the stuck server does not resolve the problem for us.  We must force-unmount or reboot all the clients as well.

 

Comment by Alex Zhuravlev [ 07/Nov/18 ]

it would be very helpful if you can provide Lustre logs and/or dmesg messages. I think I understand the cause (and have patches in testing), but I still don't understand how we got into that situation..

 

Comment by Alex Zhuravlev [ 07/Nov/18 ]

do you use (or used) project quota?

 

Comment by Alex Zhuravlev [ 07/Nov/18 ]

AFAICS, some client sends chown/chgrp to the quota-enabled MDS which in turn needs to consult with OST on possibility to move file into another group. this file is being actively written so OST hits a deadlock (LU-10048). until that client is stopped it keeps resending RPC after reboot.

though I can't really explain that OST hits it again and again.. that's supposed to be a race-like condition.

Comment by Manish (Inactive) [ 07/Nov/18 ]

Hi Alex,

We have never used project quota. I will share dmesg logs soon with Oleg and you can get more details form him.

Thank You,

                   Manish

Comment by Manish (Inactive) [ 07/Nov/18 ]

Hi Alex,

Sorry for the confusion about what logs I can share as I am trying to get familiar with security limitations on sharing data. Well so I have uploaded the dmesg logs on FTP site un ticket directory and files are "mds-vmcore-dmesg.txt" and "oss-vmcore-dmesg.txt".

Thank You,

                  Manish

 

Comment by Alex Zhuravlev [ 07/Nov/18 ]
$ pwd
/scratch/ftp/uploads/LU-11613
[alexeyzh@ssh-2 LU-11613]$ ls -l
total 4448
-rw-r--r-- 1 nobody ftp  115858 Nov  4 16:42 foreach_backtrace.zip
-rw-r--r-- 1 nobody ftp 1724395 Nov  5 19:23 foreach_bt_mds_11-05-2018.txt
-rw-r--r-- 1 nobody ftp 1315475 Nov  5 19:23 foreach_bt_oss_11-05-2018.txt
-rw-r--r-- 1 nobody ftp 1389778 Nov  4 19:16 foreach_bt.txt

probably into a different dir?

Comment by Manish (Inactive) [ 07/Nov/18 ]

Hi Alex,

No it's in same dir.

 

ftp.whamcloud.com:/uploads/LU-11613> ls
-rw-r--r--    1 99       50         115858 Nov 04 16:42 foreach_backtrace.zip
-rw-r--r--    1 99       50        1389778 Nov 04 19:16 foreach_bt.txt
-rw-r--r--    1 99       50        1724395 Nov 05 19:23 foreach_bt_mds_11-05-2018.txt
-rw-r--r--    1 99       50        1315475 Nov 05 19:23 foreach_bt_oss_11-05-2018.txt
-rw-r--r--    1 99       50        1042487 Nov 07 16:25 mds-vmcore-dmesg.txt
-rw-r--r--    1 99       50        1042487 Nov 07 16:26 oss-vmcore-dmesg.txt

 

Please re-fresh dir at your end.

Thank You,

                   Manish

Comment by Alex Zhuravlev [ 08/Nov/18 ]

did you upgrade Lustre recently? if yes, what the previous version was?
any idea about the application(s) the cluster is running? do you think chown/chgrp is a frequent operation?

I'm able to reproduce the issue and the fix seem to work, but we'd like to understand why did you start to experience the issue.

Comment by Mahmoud Hanafi [ 08/Nov/18 ]

Yes these filesystem were updated from 2.7 to 2.10.5 and that is when we started to see this issue.

It is unlikely that there are frequent chown/chgrop operation.

These are our larger filesystems and there are many applications running at the same time. It is difficult to isolate the issue to a simple type of app.

Comment by Nathan Dauchy (Inactive) [ 09/Nov/18 ]

Hit this problem at NOAA too, within 1 day of upgrading servers from 2.7 to 2.10.5 and clients from 2.10.4 to 2.10.5.  See DDN Case #112165 for details.

At NOAA, chgrp is commonly used (to keep ownership aligned with top-level "project" directory, which we soon hope to replace with project quotas).

 

Comment by Marion Hakanson (Inactive) [ 09/Nov/18 ]

We use chgrp for the same reasons as the NOAA poster above.

While we wait for the fix to be released, can we avoid this lockup by disabling quota enforcement, or removing all our group quotas?

Or must we disable the chown/chgrp commands cluster-wide?

 

Comment by Cliff White (Inactive) [ 09/Nov/18 ]

Adding MDS call traces from DDN 112165

Nov  8 17:18:50 lfs-mds-1-1 kernel: Lustre: lfs1-MDT0000: Connection restored to 4de02ddb-11dd-0c3b-e5a4-6b3f42554a8e (at 10.179.38.35@o2ib)
Nov  8 18:58:23 lfs-mds-1-1 kernel: Lustre: lfs1-MDT0000: Connection restored to a81b824f-2064-b0a5-e927-1bb9bd32bb32 (at 10.179.126.13@o2ib)
Nov  8 19:11:19 lfs-mds-1-1 kernel: INFO: task jbd2/dm-21-8:41769 blocked for more than 120 seconds.
Nov  8 19:11:19 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 19:11:19 lfs-mds-1-1 kernel: jbd2/dm-21-8    D ffff922a486a9fa0     0 41769      2 0x00000000
Nov  8 19:11:19 lfs-mds-1-1 kernel: Call Trace:
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffaeb14029>] schedule+0x29/0x70
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04a82ac>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4d52ce>] ? account_entity_dequeue+0xae/0xd0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4d8e4c>] ? dequeue_entity+0x11c/0x5e0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4d2765>] ? sched_clock_cpu+0x85/0xc0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae42959e>] ? __switch_to+0xce/0x580
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffaeb139fc>] ? __schedule+0x41c/0xa20
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4a3eae>] ? try_to_del_timer_sync+0x5e/0x90
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04aeab9>] kjournald2+0xc9/0x260 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04ae9f0>] ? commit_timeout+0x10/0x10 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb621>] kthread+0xd1/0xe0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffaeb205f7>] ret_from_fork_nospec_begin+0x21/0x21
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40
Nov  8 19:11:19 lfs-mds-1-1 kernel: INFO: task mdt00_002:41819 blocked for more than 120 seconds.
Nov  8 19:11:19 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 19:11:19 lfs-mds-1-1 kernel: mdt00_002       D ffff922a48a9bf40     0 41819      2 0x00000000
Nov  8 19:11:19 lfs-mds-1-1 kernel: Call Trace:
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f05240>] ? ldlm_pool_add+0x80/0xf0 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffaeb14029>] schedule+0x29/0x70
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04a5085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bc610>] ? wake_up_atomic_t+0x30/0x30
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04a5368>] add_transaction_credits+0x268/0x2f0 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04a55e1>] start_this_handle+0x1a1/0x430 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc136a06a>] ? osd_declare_write+0x1ea/0x450 [osd_ldiskfs]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae5f7bc2>] ? kmem_cache_alloc+0x1c2/0x1f0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc04a5a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc134bcde>] ? osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1225179>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc134bcde>] osd_trans_start+0x1ae/0x460 [osd_ldiskfs]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc159b5d7>] mdt_empty_transno+0xf7/0x840 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc159e70e>] mdt_mfd_open+0x8de/0xe70 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1576042>] ? mdt_pack_acl2body+0x1b2/0x800 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc159f21b>] mdt_finish_open+0x57b/0x690 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc15a0af8>] mdt_reint_open+0x17c8/0x3190 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f1fef7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0d5ca5e>] ? lu_ucred+0x1e/0x30 [obdclass]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1585ad5>] ? mdt_ucred+0x15/0x20 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1586401>] ? mdt_root_squash+0x21/0x430 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1596213>] mdt_reint_rec+0x83/0x210 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc157736b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1577892>] mdt_intent_reint+0x162/0x480 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc1582721>] mdt_intent_policy+0x441/0xc70 [mdt]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0ed71db>] ? ldlm_resource_get+0xab/0xa60 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0ed02ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0ef9cf3>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f21fa0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f824e2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f8641a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f2bfeb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f28618>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4cf682>] ? default_wake_function+0x12/0x20
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4c52ab>] ? __wake_up_common+0x5b/0x90
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f2f732>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffc0f2eca0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb621>] kthread+0xd1/0xe0
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffaeb205f7>] ret_from_fork_nospec_begin+0x21/0x21
Nov  8 19:11:19 lfs-mds-1-1 kernel:  [<ffffffffae4bb550>] ? insert_kthread_work+0x40/0x40
Nov  8 19:11:19 lfs-mds-1-1 kernel: INFO: task mdt02_000:41823 blocked for more than 120 seconds.
Comment by Peter Jones [ 09/Dec/18 ]

The revert of LU-5152 in 2.10.6 should remove the possibility to hit this issue

Comment by Alex Zhuravlev [ 10/Dec/18 ]

cwhite_ddn please add all remaining stack traces if possible

Comment by Peter Jones [ 17/Dec/18 ]

Let's not conflate similar (or even identical) issues from different sites with this support request from NASA. manishpatel has this issue been seen with 2.10.6?

Comment by Manish (Inactive) [ 17/Dec/18 ]

Hi Peter,

 

We are still on 2.10.5 version and working few other issues to address with this one. So we may not get 2.10.6 version and may rather end up with newer released version based on downtime slots we get. Also we have got the new build ready without LU-5152 patch, but all depends when we get a chance to implement it, so if that happens then we will update it and let you know if we still experience this issues.

Thank You,

                   Manish

Comment by Andreas Dilger [ 27/Aug/19 ]

The patch https://review.whamcloud.com/31293 "LU-10048 ofd: take local locks within transaction" landed.

Generated at Sat Feb 10 02:45:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.