[LU-1107] OSS dumps Call traces occasionally Created: 15/Feb/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File UA-oss2-kern.log     File client_messages     File messages-2-29     File messages.tar.gz     File messages_cluster1     File oss01_messages     File oss02_messages     File oss03_messages     File sysrq-t    
Severity: 3
Rank (Obsolete): 10317

 Description   

our customer is seeing what OSS is dumping the following Call traces occasionally.

This is similar to LU-745, but this customer installed lustre-1.8.7 and LU-745 should be fixed.

Feb 14 15:03:45 oss02 kernel:  [<ffffffff8002e024>] __wake_up+0x38/0x4f
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88aba7f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88b6590b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88ba6194>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88b47d09>] ost_brw_write+0x1c99/0x2480 [ost]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88881ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff8884c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 14 15:03:45 oss02 kernel:  [<ffffffff888860a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88b4b09e>] ost_handle+0x2bae/0x55b0 [ost]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff888956d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88895e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88896dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 14 15:03:45 oss02 kernel:  [<ffffffff88895e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 14 15:03:45 oss02 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

The another call trace is here

Feb 15 11:10:47 oss03 kernel:  [<ffffffff80046823>] try_to_wake_up+0x27/0x484
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8028b1ca>] __down_trylock+0x39/0x4e
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8006473d>] __down_failed_trylock+0x35/0x3a
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8886c0c1>] ldlm_pool_shrink+0x31/0xf0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8884a1e6>] .text.lock.ldlm_resource+0x7d/0x87 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8886d24c>] ldlm_pools_shrink+0x15c/0x2f0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff80064614>] __down_read+0x12/0x92
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8003f285>] shrink_slab+0xdc/0x153
Feb 15 11:10:47 oss03 kernel:  [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd
Feb 15 11:10:47 oss03 kernel:  [<ffffffff800ca81e>] __rmqueue+0x44/0xc6
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8000f46f>] __alloc_pages+0x78/0x308
Feb 15 11:10:47 oss03 kernel:  [<ffffffff80025e20>] find_or_create_page+0x32/0x72
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88b98445>] filter_get_page+0x35/0x70 [obdfilter]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88b9a68a>] filter_preprw+0x14da/0x1e00 [obdfilter]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8876d121>] LNetMDBind+0x301/0x450 [lnet]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff887d5d30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88b4500c>] ost_brw_write+0xf9c/0x2480 [ost]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8876d121>] LNetMDBind+0x301/0x450 [lnet]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88889c65>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88883fe5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88883ef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 15 11:10:47 oss03 kernel:  [<ffffffff888840a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88b4909e>] ost_handle+0x2bae/0x55b0 [ost]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff887d5d30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8883e19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88843168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff888936d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88893e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88894dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 15 11:10:47 oss03 kernel:  [<ffffffff88893e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 15 11:10:47 oss03 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

any related to bugs?



 Comments   
Comment by Oleg Drokin [ 15/Feb/12 ]

The two call traces are quite of different nature I would say.
The first sample looks like some sort of journal deadlock. In that case it would have been very useful to have backtraces for all processes in the system. I checked the logs and the samples there are very limited.

The other stacktrace looks like an outt of memory induced ldlm shrink problems, I think Johann battled quite a bit of those not so long ago.
I propose you to separate these two reports into two different bugs.

Comment by Shuichi Ihara (Inactive) [ 15/Feb/12 ]

got backtrace when the problem happens.

In addtion, there are some updates. During OSS is dumping the calltraces, the client can't access to servers, but after 30-60mins server recovered without reboot or other any actions.

Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ]

The customer hit same situation again today. They did sysrq-t to get backtrace. pleaes have a look at attached all oss log files.

Comment by Cliff White (Inactive) [ 16/Feb/12 ]

What is in the logs prior to the stack dumps? There should be a related error, indicating why the stack was dumped.

Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ]

hmm.. they hit same situation that client can't access to the servers, but not many call traces today.
see messages_oss03 at Feb 16 13:12:15 and Feb 16 22:30:30, these are when the problme happened.

Comment by Shuichi Ihara (Inactive) [ 16/Feb/12 ]

this is client log. it was hanging when the problem happens. Is it related to LU-25?

Comment by Shuichi Ihara (Inactive) [ 21/Feb/12 ]

OK, I want to focus jbd2 related problem on this ticket. Please change the title to "jbd2 deadlock" whatever reasonable title name.

Today, we saw same jbd2 related call trace on an OSS, and got sysrq -t before umount/mount OSTs.

After umount/mount OSTs without reboot, the problem was gone. But, In fact, the same problem happened again.

Comment by Shuichi Ihara (Inactive) [ 21/Feb/12 ]

the log files when the problem happened again at 02/22

Comment by Oleg Drokin [ 23/Feb/12 ]
Feb 22 06:02:18 oss01 kernel: Lustre: Service thread pid 18979 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb 22 06:02:18 oss01 kernel: Pid: 18979, comm: ll_ost_io_78
Feb 22 06:02:18 oss01 kernel: 
Feb 22 06:02:18 oss01 kernel: Call Trace:
Feb 22 06:02:18 oss01 kernel:  [<ffffffff889f47f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e
Feb 22 06:02:18 oss01 kernel:  [<ffffffff88a9f90b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff88ae0194>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff88a81d09>] ost_brw_write+0x1c99/0x2480 [ost]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff887bbac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff887868b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 22 06:02:18 oss01 kernel:  [<ffffffff887c00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff88a8509e>] ost_handle+0x2bae/0x55b0 [ost]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff8877a19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
Feb 22 06:02:18 oss01 kernel:  [<ffffffff8877f168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cf6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cfe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887d0dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cfe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Feb 22 06:02:19 oss01 kernel: 
Feb 22 06:02:19 oss01 kernel: LustreError: dumping log to /tmp/lustre-log.1329858138.18979
Feb 22 06:02:19 oss01 kernel: Lustre: Service thread pid 18919 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Feb 22 06:02:19 oss01 kernel: Pid: 18919, comm: ll_ost_io_18
Feb 22 06:02:19 oss01 kernel: 
Feb 22 06:02:19 oss01 kernel: Call Trace:
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80062ff2>] thread_return+0x62/0xfe
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80046c6e>] try_to_wake_up+0x472/0x484
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80064a0b>] __down+0xc3/0xd8
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 22 06:02:19 oss01 kernel:  [<ffffffff800646c9>] __down_failed+0x35/0x3a
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88a316a0>] ldiskfs_destroy_inode+0x0/0x90 [ldiskfs]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887861e6>] .text.lock.ldlm_resource+0x7d/0x87 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887a9337>] ldlm_pools_shrink+0x247/0x2f0 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80064614>] __down_read+0x12/0x92
Feb 22 06:02:19 oss01 kernel:  [<ffffffff800223e9>] __up_read+0x19/0x7f
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8003f279>] shrink_slab+0xd0/0x153
Feb 22 06:02:19 oss01 kernel:  [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80017ce1>] cache_grow+0x360/0x3c7
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8000f46f>] __alloc_pages+0x78/0x308
Feb 22 06:02:19 oss01 kernel:  [<ffffffff80025e20>] find_or_create_page+0x32/0x72
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88ad4445>] filter_get_page+0x35/0x70 [obdfilter]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88ad668a>] filter_preprw+0x14da/0x1e00 [obdfilter]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff886a9121>] LNetMDBind+0x301/0x450 [lnet]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88a8100c>] ost_brw_write+0xf9c/0x2480 [ost]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887bbac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887868b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887bffe5>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887bfef5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887c00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88a8509e>] ost_handle+0x2bae/0x55b0 [ost]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff88711d30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8877a19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8877f168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cf6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cfe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887d0dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 22 06:02:19 oss01 kernel:  [<ffffffff887cfe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 22 06:02:19 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

This second stacktrace is the culprit I guess. Attempting to allocate some mmemory got into ldlm to drop ldiskfs inode which might hold a journal or something and upsettign the other thread as a result.

I think we had some patches about it that I am now trying to find.

Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ]

hm, on the server side, the ldlm pool shrinker is "just" supposed to decrease the SLV (see ldlm_srv_pool_shrink()), so i'm not sure to understand how it can end up in ldiskfs_destroy_inode() ... really odd.

Comment by Shuichi Ihara (Inactive) [ 24/Feb/12 ]

sorry, they did only apply patches to OSSs. we should apply the patches of LU-25 to both OSSs and clients, right?

Comment by Johann Lombardi (Inactive) [ 24/Feb/12 ]

Ihara, the patch from LU-25 only helps on the client side, so i don't think it will make any difference with this bug which is an OSS issue.
Could you please tell us what patches you applied on the OSS side?

Comment by Shuichi Ihara (Inactive) [ 24/Feb/12 ]

Johann,
OK... they applied LU-25 patches only to OSSs, but not to the clients. This was our miscommunication, sorry. let we try the same patch to clients in this time, just in case....

Comment by Shuichi Ihara (Inactive) [ 26/Feb/12 ]

The customer tested patch for LU-25 on the client, but it didn't help. The problem seems to be easy replicate. Just copy the data the local filesystem to lustre with rsync command. (e.g. rsync -avHS --exclude=/lustre --exclude=/dev --exclude=/proc --exclude=/sys /* /lustre/)

Please investigate.. The production system is not working well if one client runs rsync on the client...

Comment by Peter Jones [ 26/Feb/12 ]

Ihara

I am sorry if we have not been clear enough - we did not expect the patch from LU-25 to help with this issue as it is a different kind of issue. What is blocking us moving forward is to get details from you about whether any patches have been applied to the vanilla release

Regards

Peter

Comment by Shuichi Ihara (Inactive) [ 26/Feb/12 ]

Peter,

Yes, I agreed. Johann suggested the patch doesn't help. The customer did test just in case..
Anyway, we only applied a patch (http://review.whamcloud.com/#change,1919) to lustre-1.8.7-wc1 on this site.
The reason of what we applied this patch is that we hit an issue on LU-967 and LU-952 at different sites. (Just in order to prevent same issue on this site)

Comment by Shuichi Ihara (Inactive) [ 29/Feb/12 ]

Oleg suggested the following changes and we tested on it.

--- a/lustre/ldlm/ldlm_pool.c
+++ b/lustre/ldlm/ldlm_pool.c
@@ -1062,8 +1062,7 @@ static int ldlm_pools_shrink(ldlm_side_t client, int nr,
         int total = 0, cached = 0, nr_ns;
         struct ldlm_namespace *ns;
 
-        if (client == LDLM_NAMESPACE_CLIENT && nr != 0 &&
-            !(gfp_mask & __GFP_FS))
+        if (nr != 0 && !(gfp_mask & __GFP_FS))
                 return -1;
 
         if (nr != 0)

We didn't see same issue and same jbd2 calltrace on OSS and copy the files with rsync worked well.
Howerver, we got another calltrace here during the same testing.

Feb 29 16:41:17 oss03 kernel: Call Trace:
Feb 29 16:41:17 oss03 kernel:  [<ffffffff80046c1e>] try_to_wake_up+0x422/0x484
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8028b1ca>] __down_trylock+0x39/0x4e
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8006473d>] __down_failed_trylock+0x35/0x3a
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887c20f4>] ldlm_pool_shrink+0x64/0xf0 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887a01dc>] .text.lock.ldlm_resource+0x73/0x87 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887c332f>] ldlm_pools_shrink+0x23f/0x2d0 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff80064614>] __down_read+0x12/0x92
Feb 29 16:41:17 oss03 kernel:  [<ffffffff800223e9>] __up_read+0x19/0x7f
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8003f285>] shrink_slab+0xdc/0x153
Feb 29 16:41:17 oss03 kernel:  [<ffffffff800ce4ce>] zone_reclaim+0x235/0x2cd
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8000985a>] __d_lookup+0xb0/0xff
Feb 29 16:41:17 oss03 kernel:  [<ffffffff800ca81e>] __rmqueue+0x44/0xc6
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8000a939>] get_page_from_freelist+0xbf/0x442
Feb 29 16:41:17 oss03 kernel:  [<ffffffff8000f46f>] __alloc_pages+0x78/0x308
Feb 29 16:41:17 oss03 kernel:  [<ffffffff80025e20>] find_or_create_page+0x32/0x72
Feb 29 16:41:17 oss03 kernel:  [<ffffffff88aee445>] filter_get_page+0x35/0x70 [obdfilter]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff88aeea81>] filter_preprw_read+0x601/0xd30 [obdfilter]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff886ca230>] LNetPut+0x730/0x840 [lnet]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887d5303>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887dfc45>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff88af0f47>] filter_preprw+0x1d97/0x1e00 [obdfilter]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887d5aa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887ddc3a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Feb 29 16:41:17 oss03 kernel:  [<ffffffff887d5aa8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff88a96ac3>] ost_brw_read+0xb33/0x1a70 [ost]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887d9ed5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887da088>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff88a9f363>] ost_handle+0x2e73/0x55b0 [ost]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8872bd30>] class_handle2object+0xe0/0x170 [obdclass]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8879419a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff88799168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887e96b9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887e9e15>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887eada6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 29 16:41:18 oss03 kernel:  [<ffffffff887e9e40>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Feb 29 16:41:18 oss03 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Comment by Shuichi Ihara (Inactive) [ 29/Feb/12 ]

log files on OSS afer remove "client == LDLM_NAMESPACE_CLIENT"

Comment by Shuichi Ihara (Inactive) [ 01/Mar/12 ]

The same problem happened again.

In this site, the client memory size is 96GB, each OSS's memory 24GB and they have 4 OSSs.
We have enabled lru_resize on this client and it seems that OSS is getting memory pressure since OSS's memory is not reclaimed quickly.

We just disabled lru_resize and used fixed lru_size and we haven't been seen same issue so far.

We are going to be more testing on this configuration if the problem is gone completely.

Comment by Oleg Drokin [ 01/Mar/12 ]

Hm, I guess there's some balancing to do on the server side to better reclaim locks preemptively before OOM situation arises.

Comment by Shuichi Ihara (Inactive) [ 21/Jun/12 ]

Hello, Oleg,

The problem seems not to be fixed yet.. we are still getting some many call traces on OSSs even disable dynamic lru_size..
I just uploaded the recent OSS's log files on ftp site. uploads/LU-1107/2012-06-20.tar.gz

Please have a look at them and investigate.

Thanks!

Comment by Kit Westneat (Inactive) [ 29/Aug/12 ]

We hit this bug again today, but it caused 80+ clients to be unable to reconnect for a while. The customer restarted the machine, so it didn't finish trying to do the shrink. There was, however, an earlier event on the 23rd of August and that took over 3 hours to resolve. This is Lustre 1.8.6 (gce5e033)

I'll try to get the lustre-logs that were dumped. In the meantime, I'll upload the oss log. It's worth looking at just to see this:
Aug 23 15:53:37 oss2 kernel: Lustre: Service thread pid 7842 completed after 11067.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Comment by Kit Westneat (Inactive) [ 29/Aug/12 ]

Kernel log from OSS2, the one that hung

Comment by Kit Westneat (Inactive) [ 29/Aug/12 ]

This looks very similar to LU-1128, and therefore LU-1535. Can someone take a look at whether or not it's the same issue?

Comment by Peter Jones [ 29/Aug/12 ]

Lai could you please comment?

Comment by Lai Siyao [ 30/Aug/12 ]

Hmm, IMO it's the same issue as LU-1128 and LU-1535, Kit, could you verify that patch is not included in your code?

Comment by Kit Westneat (Inactive) [ 30/Aug/12 ]

Hi Lai, we are not carrying the patch in LU-1535, it looks like it's targeted for a 1.8.9 release. I'll get a build together for the customer to test on.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:13:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.