[LU-1444]  @@@ processing error (-16) Created: 29/May/12  Updated: 19/Jun/12  Resolved: 19/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: adam contois (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CentOS 5.8; kernel 2.6.18-238.12.1.el5_lustre.g266a955


Severity: 3
Rank (Obsolete): 6390

 Description   

We have a large cluster installation where the /home directories are Lustre mounts. Occasionally (roughly once a week or so), we see the OSS lock up - this usually manifests itself as a hang doing 'df -h' or the like.

Before the crash, this is what we see on the OSS:

May 24 21:29:11 oss0 kernel: LustreError: 5460:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (16) req@ffff810311398c00 x1396618176669269/t0 o8>2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 368/264 e 0 to 0 dl 1337909450 ref 1 fl Interpret:/0/0 rc -16/0
May 24 21:29:11 oss0 kernel: Lustre: 5475:0:(service.c:1434:ptlrpc_server_handle_request()) @@@ Request x1396618176668701 took longer than estimated (756+139s); client may timeout. req@ffff810322b3a000 x1396618176668701/t0 o101->2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 296/352 e 1 to 0 dl 1337909211 ref 1 fl Complete:/0/0 rc 0/0
May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow parent lock 892s due to heavy IO load
May 24 21:29:11 oss0 kernel: Lustre: Skipped 2 previous similar messages
May 24 21:29:11 oss0 kernel: Lustre: Service thread pid 5475 completed after 895.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow preprw_write setup 892s due to heavy IO load

The -16 error seems to show up every time this happens. The OSS itself is using an Adaptec 6805. The arrays all show as optimal.

The only fix we've found thus far is to reboot the OSS, then mount the OSS volumes (we don't have it set to automount on boot). Then, wait about 10 minutes for Lustre to recover. In the past, we've been able to pinpoint the problematic OSS (we have 2) by looking for unusual load via 'uptime', and then confirming -16 in /var/log/messages.

Any assistance would be extremely helpful. Please let me know if you need any further information. Thanks.



 Comments   
Comment by Peter Jones [ 30/May/12 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 30/May/12 ]

Hi, adam

Before the crash, this is what we see on the OSS

Is the OSS locked up or crashed? Is there any other message dumped on crash?

There are quite a few tickets for slow IO issue: LU-15, LU-874, LU-410. Disabling read only & write through cache on OSS could alleviate the situation, is the read only & write through cache on OSS is disabled? (you can veifty/change them in /proc/fs/$FSNAME/obdfilter/$OSTNAME/read_cache_enable and /proc/fs/$FSNAME/obdfilter/$OSTNAME/writethrough_cache_enable)

Thanks.

Comment by adam contois (Inactive) [ 30/May/12 ]

Hello Niu,

I'm checking on the what exactly happens on the OSS, so I'll get back to you on that.

read_cache_enable and writethrough_cache_enable both return true.

Comment by adam contois (Inactive) [ 30/May/12 ]

Hi again Niu,

When this has occurred, only Lustre becomes unavailable (filesystem commands hang) and the load on the OSS systems goes up.

We did have one occurrence of the OSS being unreachable, but I'm not sure if it crashed or was under high load and unresponsive - we had to reboot it quickly to get users up and running again.

Thanks.

Comment by Niu Yawei (Inactive) [ 30/May/12 ]

Thanks, Adam. Could you try to disable the read only cache & write through cache on OSS to see if it can still be reproduced? Is the quota enabled? (I want to know if it's LU-952)

Comment by adam contois (Inactive) [ 31/May/12 ]

Hi Niu,

We would be happy to disable read and write cache - however, this is a production system with important data etc. So, we need specific instructions. Do we just echo 0 to those /proc files on the OSS'es only? Will this take effect immediately, or does it require a reboot? Furthermore, what are the potential negative ramifications of doing this? And (last question), can we try this on a single OSS, or would they all need to be changed? Thanks!

Comment by Niu Yawei (Inactive) [ 01/Jun/12 ]

You can either disable it by config log or write proc file directly, both take effect immediately, no reboot required.

Run following command on MGS to disable them for certain OST permanently:

lctl conf_param $OSTNAME.ost.read_cache_enable=0
lctl conf_param $OSTNAME.ost.writethrough_cache_enable=0

You can try this only for the problematical OSS, and there should be no visible impact on the customer application.

Comment by adam contois (Inactive) [ 18/Jun/12 ]

Hi Niu,

We have disabled the read and writethrough cache. So far, we haven't had further issues. You can close out the ticket, and I'll go ahead and reopen it if necessary. Thanks for your help.

Comment by Niu Yawei (Inactive) [ 19/Jun/12 ]

Thanks, adam.

Generated at Sat Feb 10 01:16:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.