[LU-1444] @@@ processing error (-16) Created: 29/May/12 Updated: 19/Jun/12 Resolved: 19/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | adam contois (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5.8; kernel 2.6.18-238.12.1.el5_lustre.g266a955 |
||
| Severity: | 3 |
| Rank (Obsolete): | 6390 |
| Description |
|
We have a large cluster installation where the /home directories are Lustre mounts. Occasionally (roughly once a week or so), we see the OSS lock up - this usually manifests itself as a hang doing 'df -h' or the like. Before the crash, this is what we see on the OSS: May 24 21:29:11 oss0 kernel: LustreError: 5460:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error ( The -16 error seems to show up every time this happens. The OSS itself is using an Adaptec 6805. The arrays all show as optimal. The only fix we've found thus far is to reboot the OSS, then mount the OSS volumes (we don't have it set to automount on boot). Then, wait about 10 minutes for Lustre to recover. In the past, we've been able to pinpoint the problematic OSS (we have 2) by looking for unusual load via 'uptime', and then confirming -16 in /var/log/messages. Any assistance would be extremely helpful. Please let me know if you need any further information. Thanks. |
| Comments |
| Comment by Peter Jones [ 30/May/12 ] |
|
Niu Could you please comment on this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 30/May/12 ] |
|
Hi, adam
Is the OSS locked up or crashed? Is there any other message dumped on crash? There are quite a few tickets for slow IO issue: Thanks. |
| Comment by adam contois (Inactive) [ 30/May/12 ] |
|
Hello Niu, I'm checking on the what exactly happens on the OSS, so I'll get back to you on that. read_cache_enable and writethrough_cache_enable both return true. |
| Comment by adam contois (Inactive) [ 30/May/12 ] |
|
Hi again Niu, When this has occurred, only Lustre becomes unavailable (filesystem commands hang) and the load on the OSS systems goes up. We did have one occurrence of the OSS being unreachable, but I'm not sure if it crashed or was under high load and unresponsive - we had to reboot it quickly to get users up and running again. Thanks. |
| Comment by Niu Yawei (Inactive) [ 30/May/12 ] |
|
Thanks, Adam. Could you try to disable the read only cache & write through cache on OSS to see if it can still be reproduced? Is the quota enabled? (I want to know if it's |
| Comment by adam contois (Inactive) [ 31/May/12 ] |
|
Hi Niu, We would be happy to disable read and write cache - however, this is a production system with important data etc. So, we need specific instructions. Do we just echo 0 to those /proc files on the OSS'es only? Will this take effect immediately, or does it require a reboot? Furthermore, what are the potential negative ramifications of doing this? And (last question), can we try this on a single OSS, or would they all need to be changed? Thanks! |
| Comment by Niu Yawei (Inactive) [ 01/Jun/12 ] |
|
You can either disable it by config log or write proc file directly, both take effect immediately, no reboot required. Run following command on MGS to disable them for certain OST permanently: lctl conf_param $OSTNAME.ost.read_cache_enable=0 You can try this only for the problematical OSS, and there should be no visible impact on the customer application. |
| Comment by adam contois (Inactive) [ 18/Jun/12 ] |
|
Hi Niu, We have disabled the read and writethrough cache. So far, we haven't had further issues. You can close out the ticket, and I'll go ahead and reopen it if necessary. Thanks for your help. |
| Comment by Niu Yawei (Inactive) [ 19/Jun/12 ] |
|
Thanks, adam. |