[LU-952] Hung thread with HIGH OSS load Created: 26/Dec/11  Updated: 09/May/12  Resolved: 26/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.2.0, Lustre 2.1.2, Lustre 1.8.8

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre1.8.5 with ofed1.5.3 with kernel 2.6.18-238.12.1.el5. AT NASA AMES


Attachments: File messages.service113.gz     File service102.messages     File service103.messages    
Severity: 4
Epic: server
Rank (Obsolete): 4769

 Description   

We started getting the following error with the oss getting to a high load and filesystem becomeing unsable.

Dec 26 11:52:16 service102 kernel: Lustre: Service thread pid 10832 was inactive for 506.00s. The thread might be hung, or it
might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Dec 26 11:52:16 service102 kernel: Lustre: Skipped 1 previous similar message
Dec 26 11:52:16 service102 kernel: Pid: 10832, comm: ll_ost_323
Dec 26 11:52:16 service102 kernel:
Dec 26 11:52:16 service102 kernel: Call Trace:
Dec 26 11:52:16 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
Dec 26 11:52:16 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
Dec 26 11:52:16 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
Dec 26 11:52:18 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
Dec 26 11:52:18 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
Dec 26 11:52:18 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
Dec 26 11:52:18 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
Dec 26 11:52:18 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
Dec 26 11:52:18 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
Dec 26 11:52:18 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
Dec 26 11:52:18 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
Dec 26 11:52:18 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
Dec 26 11:52:18 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
Dec 26 11:52:18 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
Dec 26 11:52:19 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Dec 26 11:52:19 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
Dec 26 11:52:19 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 26 11:52:20 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Dec 26 11:52:20 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec 26 11:52:20 service102 kernel:
Dec 26 11:52:20 service102 kernel: Pid: 11020, comm: ll_ost_511
Dec 26 11:52:20 service102 kernel:
Dec 26 11:52:20 service102 kernel: Call Trace:
Dec 26 11:52:20 service102 kernel: [<ffffffff887d0d1d>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
Dec 26 11:52:20 service102 kernel: [<ffffffff801632b8>] list_add+0xc/0xe
Dec 26 11:52:20 service102 kernel: [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
Dec 26 11:52:20 service102 kernel: [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
Dec 26 11:52:20 service102 kernel: [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
Dec 26 11:52:20 service102 kernel: [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
Dec 26 11:52:20 service102 kernel: [<ffffffff88c16a72>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
Dec 26 11:52:20 service102 kernel: [<ffffffff8002cc0e>] mntput_no_expire+0x19/0x88
Dec 26 11:52:20 service102 kernel: [<ffffffff887fca00>] push_ctxt+0x370/0x380 [lvfs]
Dec 26 11:52:20 service102 kernel: [<ffffffff88c31a08>] filter_client_add+0x508/0xc30 [obdfilter]
Dec 26 11:52:20 service102 kernel: [<ffffffff88c30de7>] filter_export_stats_init+0x117/0x650 [obdfilter]
Dec 26 11:52:21 service102 kernel: [<ffffffff88c32665>] filter_connect+0x535/0x8c0 [obdfilter]
Dec 26 11:52:21 service102 kernel: [<ffffffff88936107>] lustre_msg_add_op_flags+0x47/0x120 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6500>] ost_handle+0x0/0x55b0 [ost]
Dec 26 11:52:21 service102 kernel: [<ffffffff88900976>] target_handle_connect+0x21c6/0x2e80 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff88931038>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff88bf6daf>] ost_handle+0x8af/0x55b0 [ost]
Dec 26 11:52:21 service102 kernel: [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
Dec 26 11:52:21 service102 kernel: [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 26 11:52:21 service102 kernel: [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Dec 26 11:52:21 service102 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec 26 11:52:21 service102 kernel:
Dec 26 11:52:21 service102 kernel: Pid: 10674, comm: ll_ost_165



 Comments   
Comment by Jinshan Xiong (Inactive) [ 26/Dec/11 ]

This must be another case of slow IO. We're tracking this issue on LU-874 and LU-410. Disabling read only cache and write through cache will help this issue.

Comment by Mahmoud Hanafi [ 27/Dec/11 ]

Reading LU874 and LU410 not sure if this is similar error condition.

Comment by Peter Jones [ 27/Dec/11 ]

Niu

Could you please check whether this is the same issue as Jinshan suspected and advise about the steps to take to restore the filesystem?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 27/Dec/11 ]

Hi, Mahmoud

I saw lots of:

Dec 26 12:27:36 service102 kernel: LustreError: 11933:0:(obd_config.c:1011:class_process_proc_param()) nbp6-OST0031: unknown param writethrough=0

in the service102 messages, not sure if the write through cache was really disabled. (it should be writethrough_cache_enable=0)

And what's the service113? I saw lots of multipath errors in the service113 messages. Thanks.

Comment by Niu Yawei (Inactive) [ 28/Dec/11 ]

ok, then the "unknown param writethrough=0" error messages might come from older mistaken settings, but you eventually disabled the writethrough cache correctly. I just want make sure that the "disable writethrough cache' was setted permanently by 'lctl conf_param OSTNAME.ost.writethrough_cache_enable=0', so the cache was disabled in recovery stage too.

From the begining of the stack trace, I saw some io threads are wait for journal commit:

Dec 26 11:43:03 service102 kernel: Pid: 9175, comm: ll_ost_io_41
Dec 26 11:43:03 service102 kernel:
Dec 26 11:43:03 service102 kernel: Call Trace:
Dec 26 11:43:03 service102 kernel:  [<ffffffff88b6869a>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Dec 26 11:43:03 service102 kernel:  [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
Dec 26 11:43:03 service102 kernel:  [<ffffffff88c1394b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
Dec 26 11:43:03 service102 kernel:  [<ffffffff88c54214>] filter_commitrw_write+0x1e04/0x2dc0 [obdfilter]
Dec 26 11:43:03 service102 kernel:  [<ffffffff8015ee31>] __next_cpu+0x19/0x28
Dec 26 11:43:03 service102 kernel:  [<ffffffff88c4b105>] filter_commitrw+0x65/0x2c0 [obdfilter]
Dec 26 11:43:03 service102 kernel:  [<ffffffff88bf5d19>] ost_brw_write+0x1c99/0x2480 [ost]
Dec 26 11:43:04 service102 kernel:  [<ffffffff8892ca48>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Dec 26 11:43:04 service102 kernel:  [<ffffffff88930f75>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Dec 26 11:43:04 service102 kernel:  [<ffffffff8008eb34>] default_wake_function+0x0/0xf
Dec 26 11:43:04 service102 kernel:  [<ffffffff88bf90ae>] ost_handle+0x2bae/0x55b0 [ost]
Dec 26 11:43:04 service102 kernel:  [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Dec 26 11:43:04 service102 kernel:  [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Dec 26 11:43:04 service102 kernel:  [<ffffffff8008eb34>] default_wake_function+0x0/0xf
Dec 26 11:43:04 service102 kernel:  [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Dec 26 11:43:04 service102 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 26 11:43:05 service102 kernel:  [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Dec 26 11:43:05 service102 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

I don't see why the commit wasn't finished yet.

The multipath errors are when we were debugging the issue and intentionally failed path on the DDN.

how long will the path failover take (switch from the failed path to the standby one)? is it possible that you switch the path too often and that makes the io can't finished for a very long time?

Comment by Mahmoud Hanafi [ 03/Jan/12 ]

The multipath failed paths were after the issue add occurred. We had a second incident today. We have taken system dumps. Where can we upload them.

Comment by Niu Yawei (Inactive) [ 03/Jan/12 ]

you can upload them to ftp.whamcloud.com/uploads/ , thanks.

Comment by Mahmoud Hanafi [ 04/Jan/12 ]

Due to export controls it turns out I can upload the crash files. But we are able to provide you info that you may need from them. If you would to see some thing specific please let me know.

Here is a bt from one of the hung threads
crash> bt
PID: 9290 TASK: ffff810aa7639860 CPU: 5 COMMAND: "ll_ost_115"
#0 [ffff810aa76a77c0] schedule at ffffffff80062f8c
#1 [ffff810aa76a7898] start_this_handle at ffffffff88b63d02
#2 [ffff810aa76a7918] jbd2_journal_start at ffffffff88b63e77
#3 [ffff810aa76a7938] ldiskfs_journal_start_sb at ffffffff88ba4c25
#4 [ffff810aa76a7948] fsfilt_ldiskfs_start at ffffffff88c16a72
#5 [ffff810aa76a79b8] filter_client_add at ffffffff88c31a08
#6 [ffff810aa76a7a98] filter_connect at ffffffff88c32665
#7 [ffff810aa76a7b18] target_handle_connect at ffffffff88900976
#8 [ffff810aa76a7ca8] ost_handle at ffffffff88bf6daf
#9 [ffff810aa76a7e28] ptlrpc_server_handle_request at ffffffff889405e9
#10 [ffff810aa76a7ec8] ptlrpc_main at ffffffff88941cd6
#11 [ffff810aa76a7f48] kernel_thread at ffffffff8005dfb1

And I am uploading the /var/log/messages

Comment by Niu Yawei (Inactive) [ 04/Jan/12 ]

Thank you, Mahmoud. The attached message is quite helpful, I found a deadlock in it:

Jan  3 14:34:03 service103 kernel: Pid: 28262, comm: ll_ost_io_432
Jan  3 14:34:03 service103 kernel:
Jan  3 14:34:03 service103 kernel: Call Trace:
Jan  3 14:34:03 service103 kernel:  [<ffffffff88b63d02>] start_this_handle+0x301/0x3cb [jbd2]
Jan  3 14:34:03 service103 kernel:  [<ffffffff800a2f36>] autoremove_wake_function+0x0/0x2e
Jan  3 14:34:03 service103 kernel:  [<ffffffff88b63e77>] jbd2_journal_start+0xab/0xdf [jbd2]
Jan  3 14:34:03 service103 kernel:  [<ffffffff88ba4c25>] ldiskfs_journal_start_sb+0x55/0xa0 [ldiskfs]
Jan  3 14:34:03 service103 kernel:  [<ffffffff88ba8ba4>] ldiskfs_acquire_dquot+0x64/0xb0 [ldiskfs]
Jan  3 14:34:04 service103 kernel:  [<ffffffff80104a42>] dqget+0x286/0x2b6
Jan  3 14:34:04 service103 kernel:  [<ffffffff80104e40>] dquot_initialize+0x7b/0xac
Jan  3 14:34:04 service103 kernel:  [<ffffffff88c3cb7d>] filter_destroy+0x99d/0x1fb0 [obdfilter]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88908910>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff8890bff0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88869403>] lh_read_lock+0x13/0x20 [obdclass]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88931ed4>] lustre_msg_add_version+0x34/0x110 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88934b9a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff887c9309>] cfs_mem_cache_free+0x9/0x10 [libcfs]
Jan  3 14:34:04 service103 kernel:  [<ffffffff888f33eb>] ldlm_resource_putref_internal+0x3ab/0x460 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff888ee8d2>] ldlm_lock_put+0x372/0x3d0 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88bee080>] ost_destroy+0x660/0x790 [ost]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88930e85>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Jan  3 14:34:04 service103 kernel:  [<ffffffff88bf7a56>] ost_handle+0x1556/0x55b0 [ost]
Jan  3 14:34:05 service103 kernel:  [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Jan  3 14:34:05 service103 kernel:  [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Jan  3 14:34:05 service103 kernel:  [<ffffffff8008ca4e>] __wake_up_common+0x3e/0x68
Jan  3 14:34:05 service103 kernel:  [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Jan  3 14:34:05 service103 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  3 14:34:05 service103 kernel:  [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Jan  3 14:34:05 service103 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

This thread hold sb->s_dquot->dqptr_sem -> start journal, whereas:

 
Jan  3 14:33:57 service103 kernel: Pid: 28173, comm: ll_ost_io_346
Jan  3 14:33:57 service103 kernel:
Jan  3 14:33:57 service103 kernel: Call Trace:
Jan  3 14:33:57 service103 kernel:  [<ffffffff88b97adf>] ldiskfs_mb_free_blocks+0x64f/0x710 [ldiskfs]
Jan  3 14:33:57 service103 kernel:  [<ffffffff80064691>] __down_read+0x7a/0x92
Jan  3 14:33:57 service103 kernel:  [<ffffffff800a5073>] down_read+0x11/0x13
Jan  3 14:33:57 service103 kernel:  [<ffffffff80105c25>] __dquot_free_space+0x3d/0x139
Jan  3 14:33:57 service103 kernel:  [<ffffffff80105d3c>] dquot_free_space+0xb/0xd
Jan  3 14:33:57 service103 kernel:  [<ffffffff88b7e1a3>] ldiskfs_free_blocks+0xa3/0xc0 [ldiskfs]
Jan  3 14:33:57 service103 kernel:  [<ffffffff88b84e4a>] ldiskfs_ext_truncate+0x50a/0xa80 [ldiskfs]
Jan  3 14:33:57 service103 kernel:  [<ffffffff8002d270>] wake_up_bit+0x1e/0x23
Jan  3 14:33:57 service103 kernel:  [<ffffffff88b8c5c3>] ldiskfs_truncate+0xb3/0x5c0 [ldiskfs]
Jan  3 14:33:58 service103 kernel:  [<ffffffff80019d9f>] __getblk+0x25/0x236
Jan  3 14:33:58 service103 kernel:  [<ffffffff88b8000b>] __ldiskfs_handle_dirty_metadata+0xdb/0x110 [ldiskfs]
Jan  3 14:33:58 service103 kernel:  [<ffffffff800cff96>] unmap_mapping_range+0x59/0x204
Jan  3 14:33:58 service103 kernel:  [<ffffffff88b88fd5>] ldiskfs_mark_iloc_dirty+0x4a5/0x540 [ldiskfs]
Jan  3 14:33:58 service103 kernel:  [<ffffffff800d09e6>] vmtruncate+0xa2/0xc9
Jan  3 14:33:58 service103 kernel:  [<ffffffff80041afe>] inode_setattr+0x22/0x104
Jan  3 14:33:58 service103 kernel:  [<ffffffff88b8f6ce>] ldiskfs_setattr+0x2de/0x3a0 [ldiskfs]
Jan  3 14:33:58 service103 kernel:  [<ffffffff88c13c07>] fsfilt_ldiskfs_setattr+0x1a7/0x250 [fsfilt_ldiskfs]
Jan  3 14:33:58 service103 kernel:  [<ffffffff88c2f511>] filter_version_get_check+0x91/0x2a0 [obdfilter]
Jan  3 14:33:58 service103 kernel:  [<ffffffff800a502c>] up_write+0x9/0xb
Jan  3 14:33:58 service103 kernel:  [<ffffffff88c3cf7b>] filter_destroy+0xd9b/0x1fb0 [obdfilter]
Jan  3 14:33:58 service103 kernel:  [<ffffffff88908910>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Jan  3 14:33:58 service103 kernel:  [<ffffffff8890bff0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88869403>] lh_read_lock+0x13/0x20 [obdclass]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88931ed4>] lustre_msg_add_version+0x34/0x110 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88934b9a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff887c9309>] cfs_mem_cache_free+0x9/0x10 [libcfs]
Jan  3 14:33:59 service103 kernel:  [<ffffffff888f33eb>] ldlm_resource_putref_internal+0x3ab/0x460 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff888ee8d2>] ldlm_lock_put+0x372/0x3d0 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88bee080>] ost_destroy+0x660/0x790 [ost]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88930e85>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88bf7a56>] ost_handle+0x1556/0x55b0 [ost]
Jan  3 14:33:59 service103 kernel:  [<ffffffff889405e9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff88940d45>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Jan  3 14:33:59 service103 kernel:  [<ffffffff8008eb34>] default_wake_function+0x0/0xf
Jan  3 14:33:59 service103 kernel:  [<ffffffff88941cd6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Jan  3 14:34:00 service103 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  3 14:34:00 service103 kernel:  [<ffffffff88940d70>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Jan  3 14:34:00 service103 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

This thread start journal > hold the sb>s_dquot->dqptr_sem.

Actually, the correct locking order is: journal_lock > dqptr_sem, however, we called ll_vfs_dq_init() before journal start in several places, which breaks the locking order. I'll post a patch to fix this soon.

Comment by Niu Yawei (Inactive) [ 05/Jan/12 ]

patch for b1_8: http://review.whamcloud.com/1919

Comment by Niu Yawei (Inactive) [ 06/Jan/12 ]

patch for master: http://review.whamcloud.com/1923

Comment by Niu Yawei (Inactive) [ 06/Jan/12 ]

Hi, Mahmoud

Could you try the patch for b1_8 to see if it can resolve your problem? Thanks.

Comment by Jay Lan (Inactive) [ 10/Jan/12 ]

I saw Maloo passed on "patch for b1_8: http://review.whamcloud.com/1919".
What tests has Maloo run and passed? A complete acc-sm tests? Has it run SANITY_QUOTA test suite? The "parallel_grouplock" test?

Thanks,
Jay

Comment by Peter Jones [ 10/Jan/12 ]

Jay

That's the beauty of the system - click on the link in the comments of the changeset (under Maloo https://maloo.whamcloud.com/test_sessions/e0e2890a-3793-11e1-bd0b-5254004bbbd3) and you can see for yourself

Peter

Comment by Mahmoud Hanafi [ 12/Jan/12 ]

Install the Patch today. Will know if a few days if we run into the deadlock.

Thanks,
Mahmoud

Comment by Shuichi Ihara (Inactive) [ 16/Jan/12 ]

Niu,

what is trigger of this issue? It seems that we also hit same issue which is filed on LU-967.
We were beginning this issue from a couple of weeks ago, and happened quite often at the last week and fixed by your patch. But why showed up this issue? what is trigger?

Comment by Niu Yawei (Inactive) [ 16/Jan/12 ]

Hi, Ihara

It's good to hear that this patch fix your problem. This is a deadlock, which could be triggered when quota is enabled and there are some many parallel ios which caused some io threads have to wait for journal space.

Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 16/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #427
LU-952 quota: follow locking order of quota code (Revision 75fd0fbc949962acc3a624b33df4627cd7915313)

Result = SUCCESS
Oleg Drokin : 75fd0fbc949962acc3a624b33df4627cd7915313
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Peter Jones [ 26/Jan/12 ]

Landed to b1_8 and 2.2

Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 26/Jan/12 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #169
LU-952 quota: follow locking order of quota code (Revision 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5)

Result = SUCCESS
Johann Lombardi : 7dc058bfa97f4ae7a544ddb30109d5fe06c41dc5
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,sles11,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el6,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el6,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,ofa #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,ofa #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el6,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el6,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,ofa #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter.c
  • lustre/obdfilter/filter_io_26.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,ofa #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 08/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,inkernel #41
LU-952 quota: follow locking order of quota code (Revision 7303349e0672c94a45265229eb6b6b765748fabd)

Result = SUCCESS
Oleg Drokin : 7303349e0672c94a45265229eb6b6b765748fabd
Files :

  • lustre/obdfilter/filter_io_26.c
  • lustre/obdfilter/filter.c
Generated at Sat Feb 10 01:12:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.