Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
None
-
StorP Storage Cluster: Dell R710 servers (20 OSS, 2 MDS), IB direct connected DDN99k storage on OSSes, FC direct attached DDN EF3000 storage on MDS, 24GB per server, dual socket 8 core Nehalem. StorP is dual-homed for Lustre clients with DDR IB and 10 Gig Ethernet via Chelsio T3 adapters. StorP is configured for failover MDS and OSS pairs with multipath.
StorP is running TOSS 1.4-2 (chaos 4.4-2) which includes:
lustre-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4
lustre-modules-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4
chaos-kernel-2.6.18-105chaos
Multiple compute clusters interconnect to StorP via a set of IB(client)-to-IB(server) lnet routers and a set of IB(client)-to-10gig(server) lnet routers. The IB-to-IB lnet routers deal with <300 Lustre client nodes. The IB-to-10gig routers deal with ~2700 Lustre client nodes.StorP Storage Cluster: Dell R710 servers (20 OSS, 2 MDS), IB direct connected DDN99k storage on OSSes, FC direct attached DDN EF3000 storage on MDS, 24GB per server, dual socket 8 core Nehalem. StorP is dual-homed for Lustre clients with DDR IB and 10 Gig Ethernet via Chelsio T3 adapters. StorP is configured for failover MDS and OSS pairs with multipath. StorP is running TOSS 1.4-2 (chaos 4.4-2) which includes: lustre-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4 lustre-modules-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4 chaos-kernel-2.6.18-105chaos Multiple compute clusters interconnect to StorP via a set of IB(client)-to-IB(server) lnet routers and a set of IB(client)-to-10gig(server) lnet routers. The IB-to-IB lnet routers deal with <300 Lustre client nodes. The IB-to-10gig routers deal with ~2700 Lustre client nodes.
Description
We are experiencing major MDS problems that are greatly affecting the stability of our Lustre filesystem. We don't have any changes in the fundamental configuration or setup of our storage cluster to point the finger at.
The general symptoms are that the load on the active MDS node is unusually high and filesystem access hangs intermittently. Logged into the active MDS node we noticed that the command line also intermittently hangs. We noticed that the ptlrpcd process was pegged at 100%+ cpu usage followed by ~50% cpu usage for the kiblnd_sd_* processes. Furthermore, the iowait time is less that 1% while system time ranges from 25%-80%. It sort of appears that the active MDS is spinning as quickly as it can dealing with some kind of RPC traffic coming in over the IB lnd. So far we haven't been able to isolate the traffic involved. In one isolation step we took all the Lnet routers offline feeding in from the compute clusters, and the MDS was still churning vigorously in ptlrpcd and kiblnd processes. Another symptom we are seeing now is that when an MDS node becomes active and start trying to serve clients, we can watch the node rather quickly consume all available memory via Slab allocations and then die an OOM death. Some other observations:
- OSTs evicting the mdtlov over the IB path
- fun 'sluggish network' log messages like: ===> Sep 7 12:39:21 amds1 LustreError: 13000:0:(import.c:357:ptlrpc_invalidate_import()) scratch2-OST0053_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out. <===
- MGS evicting itself over localhost connection: ==> Sep 7 12:39:14 amds1 Lustre: MGS: client 8337bacb-6b62-b0f0-261d-53678e2e56a9 (at 0@lo) evicted, unresponsive for 227s <==
At this point we have been through 3 or more MDS failover sequences and we also rebooted all the StorP Lustre servers and restarted the filesystem cleanly to see if that would clean things up.
We have syslog and Lustre debug message logs from various phases of debugging this. I'm not sure at this point what logs will be the most useful, but after I submit this issue I'll attach some files.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA