[LU-1107] OSS dumps Call traces occasionally Created: 15/Feb/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10317 |
| Description |
|
our customer is seeing what OSS is dumping the following Call traces occasionally. This is similar to Feb 14 15:03:45 oss02 kernel: [<ffffffff8002e024>] __wake_up+0x38/0x4f Feb 14 15:03:45 oss02 kernel: [<ffffffff88aba7f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2] Feb 14 15:03:45 oss02 kernel: [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e Feb 14 15:03:45 oss02 kernel: [<ffffffff88b6590b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs] Feb 14 15:03:45 oss02 kernel: [<ffffffff88ba6194>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter] Feb 14 15:03:45 oss02 kernel: [<ffffffff88b47d09>] ost_brw_write+0x1c99/0x2480 [ost] Feb 14 15:03:45 oss02 kernel: [<ffffffff88881ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff8884c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 14 15:03:45 oss02 kernel: [<ffffffff888860a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff88b4b09e>] ost_handle+0x2bae/0x55b0 [ost] Feb 14 15:03:45 oss02 kernel: [<ffffffff888956d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff88895e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 14 15:03:45 oss02 kernel: [<ffffffff88896dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 14 15:03:45 oss02 kernel: [<ffffffff88895e60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 14 15:03:45 oss02 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 The another call trace is here Feb 15 11:10:47 oss03 kernel: [<ffffffff80046823>] try_to_wake_up+0x27/0x484 Feb 15 11:10:47 oss03 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 15 11:10:47 oss03 kernel: [<ffffffff8028b1ca>] __down_trylock+0x39/0x4e Feb 15 11:10:47 oss03 kernel: [<ffffffff8006473d>] __down_failed_trylock+0x35/0x3a Feb 15 11:10:47 oss03 kernel: [<ffffffff8886c0c1>] ldlm_pool_shrink+0x31/0xf0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8884a1e6>] .text.lock.ldlm_resource+0x7d/0x87 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8886d24c>] ldlm_pools_shrink+0x15c/0x2f0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff80064614>] __down_read+0x12/0x92 Feb 15 11:10:47 oss03 kernel: [<ffffffff8003f285>] shrink_slab+0xdc/0x153 Feb 15 11:10:47 oss03 kernel: [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd Feb 15 11:10:47 oss03 kernel: [<ffffffff800ca81e>] __rmqueue+0x44/0xc6 Feb 15 11:10:47 oss03 kernel: [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442 Feb 15 11:10:47 oss03 kernel: [<ffffffff8000f46f>] __alloc_pages+0x78/0x308 Feb 15 11:10:47 oss03 kernel: [<ffffffff80025e20>] find_or_create_page+0x32/0x72 Feb 15 11:10:47 oss03 kernel: [<ffffffff88b98445>] filter_get_page+0x35/0x70 [obdfilter] Feb 15 11:10:47 oss03 kernel: [<ffffffff88b9a68a>] filter_preprw+0x14da/0x1e00 [obdfilter] Feb 15 11:10:47 oss03 kernel: [<ffffffff8876d121>] LNetMDBind+0x301/0x450 [lnet] Feb 15 11:10:47 oss03 kernel: [<ffffffff887d5d30>] class_handle2object+0xe0/0x170 [obdclass] Feb 15 11:10:47 oss03 kernel: [<ffffffff88b4500c>] ost_brw_write+0xf9c/0x2480 [ost] Feb 15 11:10:47 oss03 kernel: [<ffffffff8876d121>] LNetMDBind+0x301/0x450 [lnet] Feb 15 11:10:47 oss03 kernel: [<ffffffff88889c65>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff88883fe5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff88883ef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 15 11:10:47 oss03 kernel: [<ffffffff888840a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff88b4909e>] ost_handle+0x2bae/0x55b0 [ost] Feb 15 11:10:47 oss03 kernel: [<ffffffff887d5d30>] class_handle2object+0xe0/0x170 [obdclass] Feb 15 11:10:47 oss03 kernel: [<ffffffff8883e19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff88843168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff888936d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff88893e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 15 11:10:47 oss03 kernel: [<ffffffff88894dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 15 11:10:47 oss03 kernel: [<ffffffff88893e60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 15 11:10:47 oss03 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 any related to bugs? |
| Comments |
| Comment by Oleg Drokin [ 15/Feb/12 ] |
|
The two call traces are quite of different nature I would say. The other stacktrace looks like an outt of memory induced ldlm shrink problems, I think Johann battled quite a bit of those not so long ago. |
| Comment by Shuichi Ihara (Inactive) [ 15/Feb/12 ] |
|
got backtrace when the problem happens. In addtion, there are some updates. During OSS is dumping the calltraces, the client can't access to servers, but after 30-60mins server recovered without reboot or other any actions. |
| Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ] |
|
The customer hit same situation again today. They did sysrq-t to get backtrace. pleaes have a look at attached all oss log files. |
| Comment by Cliff White (Inactive) [ 16/Feb/12 ] |
|
What is in the logs prior to the stack dumps? There should be a related error, indicating why the stack was dumped. |
| Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ] |
|
hmm.. they hit same situation that client can't access to the servers, but not many call traces today. |
| Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ] |
|
this is client log. it was hanging when the problem happens. Is it related to |
| Comment by Shuichi Ihara (Inactive) [ 21/Feb/12 ] |
|
OK, I want to focus jbd2 related problem on this ticket. Please change the title to "jbd2 deadlock" whatever reasonable title name. Today, we saw same jbd2 related call trace on an OSS, and got sysrq -t before umount/mount OSTs. After umount/mount OSTs without reboot, the problem was gone. But, In fact, the same problem happened again. |
| Comment by Shuichi Ihara (Inactive) [ 21/Feb/12 ] |
|
the log files when the problem happened again at 02/22 |
| Comment by Oleg Drokin [ 23/Feb/12 ] |
Feb 22 06:02:18 oss01 kernel: Lustre: Service thread pid 18979 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 22 06:02:18 oss01 kernel: Pid: 18979, comm: ll_ost_io_78 Feb 22 06:02:18 oss01 kernel: Feb 22 06:02:18 oss01 kernel: Call Trace: Feb 22 06:02:18 oss01 kernel: [<ffffffff889f47f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2] Feb 22 06:02:18 oss01 kernel: [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e Feb 22 06:02:18 oss01 kernel: [<ffffffff88a9f90b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs] Feb 22 06:02:18 oss01 kernel: [<ffffffff88ae0194>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter] Feb 22 06:02:18 oss01 kernel: [<ffffffff88a81d09>] ost_brw_write+0x1c99/0x2480 [ost] Feb 22 06:02:18 oss01 kernel: [<ffffffff887bbac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 22 06:02:18 oss01 kernel: [<ffffffff887868b0>] target_committed_to_req+0x40/0x120 [ptlrpc] Feb 22 06:02:18 oss01 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 22 06:02:18 oss01 kernel: [<ffffffff887c00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 22 06:02:18 oss01 kernel: [<ffffffff88a8509e>] ost_handle+0x2bae/0x55b0 [ost] Feb 22 06:02:18 oss01 kernel: [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass] Feb 22 06:02:18 oss01 kernel: [<ffffffff8877a19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] Feb 22 06:02:18 oss01 kernel: [<ffffffff8877f168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887cf6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887cfe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 22 06:02:19 oss01 kernel: [<ffffffff887d0dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 22 06:02:19 oss01 kernel: [<ffffffff887cfe60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Feb 22 06:02:19 oss01 kernel: Feb 22 06:02:19 oss01 kernel: LustreError: dumping log to /tmp/lustre-log.1329858138.18979 Feb 22 06:02:19 oss01 kernel: Lustre: Service thread pid 18919 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 22 06:02:19 oss01 kernel: Pid: 18919, comm: ll_ost_io_18 Feb 22 06:02:19 oss01 kernel: Feb 22 06:02:19 oss01 kernel: Call Trace: Feb 22 06:02:19 oss01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Feb 22 06:02:19 oss01 kernel: [<ffffffff80046c6e>] try_to_wake_up+0x472/0x484 Feb 22 06:02:19 oss01 kernel: [<ffffffff80064a0b>] __down+0xc3/0xd8 Feb 22 06:02:19 oss01 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 22 06:02:19 oss01 kernel: [<ffffffff800646c9>] __down_failed+0x35/0x3a Feb 22 06:02:19 oss01 kernel: [<ffffffff88a316a0>] ldiskfs_destroy_inode+0x0/0x90 [ldiskfs] Feb 22 06:02:19 oss01 kernel: [<ffffffff887861e6>] .text.lock.ldlm_resource+0x7d/0x87 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887a9337>] ldlm_pools_shrink+0x247/0x2f0 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff80064614>] __down_read+0x12/0x92 Feb 22 06:02:19 oss01 kernel: [<ffffffff800223e9>] __up_read+0x19/0x7f Feb 22 06:02:19 oss01 kernel: [<ffffffff8003f279>] shrink_slab+0xd0/0x153 Feb 22 06:02:19 oss01 kernel: [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd Feb 22 06:02:19 oss01 kernel: [<ffffffff80017ce1>] cache_grow+0x360/0x3c7 Feb 22 06:02:19 oss01 kernel: [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442 Feb 22 06:02:19 oss01 kernel: [<ffffffff8000f46f>] __alloc_pages+0x78/0x308 Feb 22 06:02:19 oss01 kernel: [<ffffffff80025e20>] find_or_create_page+0x32/0x72 Feb 22 06:02:19 oss01 kernel: [<ffffffff88ad4445>] filter_get_page+0x35/0x70 [obdfilter] Feb 22 06:02:19 oss01 kernel: [<ffffffff88ad668a>] filter_preprw+0x14da/0x1e00 [obdfilter] Feb 22 06:02:19 oss01 kernel: [<ffffffff886a9121>] LNetMDBind+0x301/0x450 [lnet] Feb 22 06:02:19 oss01 kernel: [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass] Feb 22 06:02:19 oss01 kernel: [<ffffffff88a8100c>] ost_brw_write+0xf9c/0x2480 [ost] Feb 22 06:02:19 oss01 kernel: [<ffffffff887bbac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887868b0>] target_committed_to_req+0x40/0x120 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887bffe5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887bfef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 22 06:02:19 oss01 kernel: [<ffffffff887c00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff88a8509e>] ost_handle+0x2bae/0x55b0 [ost] Feb 22 06:02:19 oss01 kernel: [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass] Feb 22 06:02:19 oss01 kernel: [<ffffffff8877a19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8877f168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887cf6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff887cfe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 22 06:02:19 oss01 kernel: [<ffffffff887d0dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 22 06:02:19 oss01 kernel: [<ffffffff887cfe60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 22 06:02:19 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 This second stacktrace is the culprit I guess. Attempting to allocate some mmemory got into ldlm to drop ldiskfs inode which might hold a journal or something and upsettign the other thread as a result. I think we had some patches about it that I am now trying to find. |
| Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ] |
|
hm, on the server side, the ldlm pool shrinker is "just" supposed to decrease the SLV (see ldlm_srv_pool_shrink()), so i'm not sure to understand how it can end up in ldiskfs_destroy_inode() ... really odd. |
| Comment by Shuichi Ihara (Inactive) [ 24/Feb/12 ] |
|
sorry, they did only apply patches to OSSs. we should apply the patches of |
| Comment by Johann Lombardi (Inactive) [ 24/Feb/12 ] |
|
Ihara, the patch from |
| Comment by Shuichi Ihara (Inactive) [ 24/Feb/12 ] |
|
Johann, |
| Comment by Shuichi Ihara (Inactive) [ 26/Feb/12 ] |
|
The customer tested patch for Please investigate.. The production system is not working well if one client runs rsync on the client... |
| Comment by Peter Jones [ 26/Feb/12 ] |
|
Ihara I am sorry if we have not been clear enough - we did not expect the patch from Regards Peter |
| Comment by Shuichi Ihara (Inactive) [ 26/Feb/12 ] |
|
Peter, Yes, I agreed. Johann suggested the patch doesn't help. The customer did test just in case.. |
| Comment by Shuichi Ihara (Inactive) [ 29/Feb/12 ] |
|
Oleg suggested the following changes and we tested on it. --- a/lustre/ldlm/ldlm_pool.c
+++ b/lustre/ldlm/ldlm_pool.c
@@ -1062,8 +1062,7 @@ static int ldlm_pools_shrink(ldlm_side_t client, int nr,
int total = 0, cached = 0, nr_ns;
struct ldlm_namespace *ns;
- if (client == LDLM_NAMESPACE_CLIENT && nr != 0 &&
- !(gfp_mask & __GFP_FS))
+ if (nr != 0 && !(gfp_mask & __GFP_FS))
return -1;
if (nr != 0)
We didn't see same issue and same jbd2 calltrace on OSS and copy the files with rsync worked well. Feb 29 16:41:17 oss03 kernel: Call Trace: Feb 29 16:41:17 oss03 kernel: [<ffffffff80046c1e>] try_to_wake_up+0x422/0x484 Feb 29 16:41:17 oss03 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 29 16:41:17 oss03 kernel: [<ffffffff8028b1ca>] __down_trylock+0x39/0x4e Feb 29 16:41:17 oss03 kernel: [<ffffffff8006473d>] __down_failed_trylock+0x35/0x3a Feb 29 16:41:17 oss03 kernel: [<ffffffff887c20f4>] ldlm_pool_shrink+0x64/0xf0 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff887a01dc>] .text.lock.ldlm_resource+0x73/0x87 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff887c332f>] ldlm_pools_shrink+0x23f/0x2d0 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff80064614>] __down_read+0x12/0x92 Feb 29 16:41:17 oss03 kernel: [<ffffffff800223e9>] __up_read+0x19/0x7f Feb 29 16:41:17 oss03 kernel: [<ffffffff8003f285>] shrink_slab+0xdc/0x153 Feb 29 16:41:17 oss03 kernel: [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd Feb 29 16:41:17 oss03 kernel: [<ffffffff8000985a>] __d_lookup+0xb0/0xff Feb 29 16:41:17 oss03 kernel: [<ffffffff800ca81e>] __rmqueue+0x44/0xc6 Feb 29 16:41:17 oss03 kernel: [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442 Feb 29 16:41:17 oss03 kernel: [<ffffffff8000f46f>] __alloc_pages+0x78/0x308 Feb 29 16:41:17 oss03 kernel: [<ffffffff80025e20>] find_or_create_page+0x32/0x72 Feb 29 16:41:17 oss03 kernel: [<ffffffff88aee445>] filter_get_page+0x35/0x70 [obdfilter] Feb 29 16:41:17 oss03 kernel: [<ffffffff88aeea81>] filter_preprw_read+0x601/0xd30 [obdfilter] Feb 29 16:41:17 oss03 kernel: [<ffffffff886ca230>] LNetPut+0x730/0x840 [lnet] Feb 29 16:41:17 oss03 kernel: [<ffffffff887d5303>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff887dfc45>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff88af0f47>] filter_preprw+0x1d97/0x1e00 [obdfilter] Feb 29 16:41:17 oss03 kernel: [<ffffffff887d5aa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff887ddc3a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Feb 29 16:41:17 oss03 kernel: [<ffffffff887d5aa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff88a96ac3>] ost_brw_read+0xb33/0x1a70 [ost] Feb 29 16:41:18 oss03 kernel: [<ffffffff887d9ed5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Feb 29 16:41:18 oss03 kernel: [<ffffffff887da088>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff88a9f363>] ost_handle+0x2e73/0x55b0 [ost] Feb 29 16:41:18 oss03 kernel: [<ffffffff8872bd30>] class_handle2object+0xe0/0x170 [obdclass] Feb 29 16:41:18 oss03 kernel: [<ffffffff8879419a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff88799168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff887e96b9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff887e9e15>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Feb 29 16:41:18 oss03 kernel: [<ffffffff887eada6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Feb 29 16:41:18 oss03 kernel: [<ffffffff887e9e40>] ptlrpc_main+0x0/0x1120 [ptlrpc] Feb 29 16:41:18 oss03 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 |
| Comment by Shuichi Ihara (Inactive) [ 29/Feb/12 ] |
|
log files on OSS afer remove "client == LDLM_NAMESPACE_CLIENT" |
| Comment by Shuichi Ihara (Inactive) [ 01/Mar/12 ] |
|
The same problem happened again. In this site, the client memory size is 96GB, each OSS's memory 24GB and they have 4 OSSs. We just disabled lru_resize and used fixed lru_size and we haven't been seen same issue so far. We are going to be more testing on this configuration if the problem is gone completely. |
| Comment by Oleg Drokin [ 01/Mar/12 ] |
|
Hm, I guess there's some balancing to do on the server side to better reclaim locks preemptively before OOM situation arises. |
| Comment by Shuichi Ihara (Inactive) [ 21/Jun/12 ] |
|
Hello, Oleg, The problem seems not to be fixed yet.. we are still getting some many call traces on OSSs even disable dynamic lru_size.. Please have a look at them and investigate. Thanks! |
| Comment by Kit Westneat (Inactive) [ 29/Aug/12 ] |
|
We hit this bug again today, but it caused 80+ clients to be unable to reconnect for a while. The customer restarted the machine, so it didn't finish trying to do the shrink. There was, however, an earlier event on the 23rd of August and that took over 3 hours I'll try to get the lustre-logs that were dumped. In the meantime, I'll upload the oss log. It's worth looking at just to see this:
|
| Comment by Kit Westneat (Inactive) [ 29/Aug/12 ] |
|
Kernel log from OSS2, the one that hung |
| Comment by Kit Westneat (Inactive) [ 29/Aug/12 ] |
|
This looks very similar to |
| Comment by Peter Jones [ 29/Aug/12 ] |
|
Lai could you please comment? |
| Comment by Lai Siyao [ 30/Aug/12 ] |
|
Hmm, IMO it's the same issue as |
| Comment by Kit Westneat (Inactive) [ 30/Aug/12 ] |
|
Hi Lai, we are not carrying the patch in |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |