Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
lustre 1.8.5
-
3
-
24,531
-
5617
Description
NASA AMES is testing a cluster in preparation for upgrades.
"These errors are preventing us from completing a clean migration of the
filesystem, and we're not going to put it into production until they're
eliminated. We're planning to move filesystems on six more DDN 9900s to SFA
hardware and we'd like to complete the process as soon as possible."
Right now they are running a sles10sp3 variant plus lustre 1.8.2 plus patches on the client and
centos 5.5 plus lustre 1.8.5 plus patches on the server. I'll add the patch list when I get it.
client: kernel 2.6.16.60-0.68.1.20100916-nasa, lustre-1.8.2-4nas_ofed151
server: kernel 2.6.18-194.3.1.el5.20110322lustre185, lustre-1.8.5-4nas_ofed152
General problem is that they are seeing a lot of client evictions for no obvious reasons.
In particular this bug, here is the customer's description:
"client:
Jun 21 11:04:36 hw-burly1 kernel: LustreError: 11-0:
nbp1-OST004f-osc-ffff810c48fc7000: An error occurred while communicating with
NID 10.151.26.34@o2ib; the ost_statfs operation failed with -107
Jun 21 11:04:36 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
Connection to nbp1-OST004f (at 10.151.26.34@o2ib) was lost; in progress
operations using this service will wait for recovery to complete
Jun 21 11:04:36 hw-burly1 kernel: LustreError: 167-0: This client was evicted
by nbp1-OST004f; in progress operations using this service will fail.
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
27834:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace
nbp1-OST004f-osc-ffff810c48fc7000 resource refcount nonzero (2) after lock
cleanup; forcing cleanup.
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
330:0:(llite_lib.c:1752:ll_setattr_raw()) obd_setattr_async fails: rc=-5
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
330:0:(llite_lib.c:1761:ll_setattr_raw()) extent unlock failed: err=-5
Jun 21 11:04:37 hw-burly1 kernel: LustreError:
27834:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource:
ffff81092f055500 (233731/0/0/0) (rc: 1)
Jun 21 11:04:38 hw-burly1 kernel: LustreError:
27834:0:(import.c:1279:ptlrpc_invalidate_import_thread()) dump the log upon
eviction
Jun 21 11:04:38 hw-burly1 kernel: LustreError: dumping log to
/var/dumps/lustre-log.1308679478.27834
Jun 21 11:04:38 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
Connection restored to nbp1-OST004f (at 10.151.26.34@o2ib)
server:
Jun 21 11:04:36 service178 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.151.61.171@o2ib ns: filter-nbp1-OST004f_UUID
lock: ffff810a03cb3a00/0x50a73ed8006fda0b lrc: 3/0,0 mode: PW/PW res: 233731/0
rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote:
0x4cdfae7ceeea95bc expref: 9364 pid: 13239 timeout 5323842234
Jun 21 11:04:36 service178 kernel: LustreError: dumping log to
/var/dumps/lustre-log.1308679476.13105
I have debug logs with 'ioctl neterror net warning other dlmtrace error emerg
ha rpctrace config console' enabled for both the client and server, which I'll
upload once I have a case number. Unfortunately ntp wasn't running on the
server at the time, so the times are a bit off. I have logs from several other
events as well, though I have yet to examine them to see if the pattern is the
same in every case.
Here's a quick rundown of the events for this particular eviction:
- The client sends a PW lock enqueue request in at 1308679375.111977 in
x1371481946874614, which the server grants.
- At 1308679375.164893 it wants to upgrade the lock to a PW, in
x1371481946874670. Since it conflicts with the existing lock, the server sends
back a blocking AST at 1308679375.958555 in x1371171161806020, which the client
responds to.
- The client sends the lock cancellation at 1308679375.165789
x1371481946874671. However, there's no record of the server receiving it.
Oddly, the client seems to think it got a response:
00000100:00000200:21:1308679375.166321:0:9771:0:(client.c:925:ptlrpc_check_reply
()) @@@ REPLIED: rc = 1 for req@ffff811f4080dc00 x1371481946874671/t0
o103->nbp1-OST004f_UUID@10.151.26.34@o2ib:17/18 lens 296/384 e 0 to 1 dl
1308679413 ref 1 fl Rpc:RN/0/0 rc 0/0
00000100:00100000:21:1308679375.166329:0:9771:0:(client.c:2171:ptlrpc_queue_wait
()) Completed RPC pname:cluuid:pid:xid:nid:opc
ldlm_bl_00:1ae3c145-b2f5-384f-a14f-5af93cbbb774:9771:x1371481946874671:10.151.26
.34@o2ib:103
(or I'm misinterpreting the message).
Since the server never received the lock cancellation for the blocking AST it
issued, it evicts the client.
I checked the ib fabric and HCAs on each end, and didn't find anything
obviously wrong. I didn't catch any network-related errors in the Lustre logs.
I upped at_min from the default to 45, then 150, in case the LNet messages were
taking their time getting to the server. That didn't help.
I upgraded the client to 1.8.5-4nas about an hour ago, we'll see if that helps.
If it doesn't, I'm going to consider downgrading the server to a 1.8.4-based
build, which we've been running on a production filesystem for a while.
Since the filesystem isn't in production, I have quite a bit of flexibility in
terms of being able to apply patches - unlike the rest of our filesystems."
I looked over this as well and all I've come up with so far is that perhaps attachment 31479 [details] bug
23352 might be of some use.
I'll attach the aforementioned lustre debug logs.
===see bugzilla 24531 for logs===
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....