Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 1.8.6
-
None
-
CentOS 5.8; kernel 2.6.18-238.12.1.el5_lustre.g266a955
-
3
-
6390
Description
We have a large cluster installation where the /home directories are Lustre mounts. Occasionally (roughly once a week or so), we see the OSS lock up - this usually manifests itself as a hang doing 'df -h' or the like.
Before the crash, this is what we see on the OSS:
May 24 21:29:11 oss0 kernel: LustreError: 5460:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (16) req@ffff810311398c00 x1396618176669269/t0 o8>2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 368/264 e 0 to 0 dl 1337909450 ref 1 fl Interpret:/0/0 rc -16/0
May 24 21:29:11 oss0 kernel: Lustre: 5475:0:(service.c:1434:ptlrpc_server_handle_request()) @@@ Request x1396618176668701 took longer than estimated (756+139s); client may timeout. req@ffff810322b3a000 x1396618176668701/t0 o101->2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 296/352 e 1 to 0 dl 1337909211 ref 1 fl Complete:/0/0 rc 0/0
May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow parent lock 892s due to heavy IO load
May 24 21:29:11 oss0 kernel: Lustre: Skipped 2 previous similar messages
May 24 21:29:11 oss0 kernel: Lustre: Service thread pid 5475 completed after 895.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow preprw_write setup 892s due to heavy IO load
The -16 error seems to show up every time this happens. The OSS itself is using an Adaptec 6805. The arrays all show as optimal.
The only fix we've found thus far is to reboot the OSS, then mount the OSS volumes (we don't have it set to automount on boot). Then, wait about 10 minutes for Lustre to recover. In the past, we've been able to pinpoint the problematic OSS (we have 2) by looking for unusual load via 'uptime', and then confirming -16 in /var/log/messages.
Any assistance would be extremely helpful. Please let me know if you need any further information. Thanks.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker
While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA