Description
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64
On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:
vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable
Output from proc "import" file for affect ost:
vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root:
The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.
Attachments
Activity
Resolution | New: Not a Bug [ 6 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
End date | New: 09/May/16 | |
Start date | New: 27/Apr/16 |
Epic/Theme | Original: LLNL-1 |
Assignee | Original: WC Triage [ wc-triage ] | New: Joseph Gmitter [ jgmitter ] |
Description |
Original:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. Have appended the following email discussion on the matter: Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov> CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> From: Roy Musselman <roymuss@us.ibm.com> Date: Mon, 25 Apr 2016 10:49:44 -0700 The ownership of this incident should probably shift to Lustre as per the evidence gathered this past weekend. (I trimmed out a lot of the old message history from this dialog) Examining the IO-node log R17-ID-J00.log revealed the following Lustre-related messages that occurred on 4/14, the same day that the user first started seeing symptoms, and coincidentally the same day the efix 12 & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre connections were lost and restored. However fsv-OST0017, which was identified in the "lfs check servers" messages from R17-ID-J00/J06, was not restored. 2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at 172.20.20.17@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: Skipped 1 previous similar message 2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at 172.20.20.29@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: Skipped 5 previous similar messages 2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim= ilar messages 2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim= ilar messages 2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at 172.20.20.22@o2ib500) 2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous similar messages 2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at 172.20.20.19@o2ib500) 2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at 172.20.20.18@o2ib500) 2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at 172.20.20.21@o2ib500) 2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at 172.20.20.26@o2ib500) 2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at 172.20.20.29@o2ib500) 2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 3 previous similar messages 2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388 previous similar messages 2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim= ilar messages 2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at 172.20.20.32@o2ib500) 2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: Skipped 5 previous similar messages 2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at 172.20.20.201@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at 172.20.20.201@o2ib500) 2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: Skipped 1 previous similar message 2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous similar messages 2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) 2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim= ilar messages 2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.23@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline" <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> Date: 04/24/2016 08:05 PM Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident 1057380 The user reports that the test that consistently fails on R16-R17 when running on Lustre, runs without error when using NFS. So, I've deleted the reservation on R16-R17 and asked that the user rerun the other failing scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've also ask that the user keep his Lustre directories in tact so that the failing test can be rerun to verify that the connections to Lustre have been resolved. Can you consider periodically (daily?, hourly?) monitoring the Lustre connections using the "lfs check servers" so that failures like this can be avoided in the future? Is it possible to generate an alert when ever a connection goes bad? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn, Dong H." <ahn1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov> Date: 04/23/2016 08:50 PM Subject: Suspect Vulcan IO-node to Lustre connections for incident 1057380 Greetings, Sending this to LLNL staff only. Friday evening I reported in incident 1057380 that the hang on R16/R17 still occured after some of the service actions were done. STAT could still not attach to some of the ranks, and the coreprocessor stack trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00. Based on these additional symptoms, I executed "lfs check servers" on all the Vulcan IO nodes, which was something Richard had suggested to me last month. The exception results are below. Note the errors on R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node jobs that also had hangs. I really hope this is the root cause of the problem. I believe that the fix may be to reboot the IO node. I'm not sure when R17-ID-J00 was last booted but it could have been after the diagnostic run on Thursday evening or Friday morning. Was there anything done to the system during the update on 4/14, other than the efixes which just updated the glibc and MPI libraries, which would have affected the IO nodes? In the mean time, I've suggested to the user to try running their test from NFS using the existing reservation on R16/R17, since they've assured me that there application does not do parallel IO. I'm not sure if they will receive my email this weekend to be able to attempt the NFS jobs before Monday. Results from "lsf check servers" 172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0 pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0 pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0 pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0 172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable 172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0 172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable 172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0 pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0 pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0 pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0 pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0 pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr): Interrupted system call |
New:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. |
Description |
Original:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. Have appended the following email discussion on the matter: Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov> CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> From: Roy Musselman <roymuss@us.ibm.com> Date: Mon, 25 Apr 2016 10:49:44 -0700 The ownership of this incident should probably shift to Lustre as per the evidence gathered this past weekend. (I trimmed out a lot of the old message history from this dialog) Examining the IO-node log R17-ID-J00.log revealed the following Lustre-related messages that occurred on 4/14, the same day that the user first started seeing symptoms, and coincidentally the same day the efix 12 & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre connections were lost and restored. However fsv-OST0017, which was identified in the "lfs check servers" messages from R17-ID-J00/J06, was not restored. 2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at 172.20.20.17@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: Skipped 1 previous similar message 2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at 172.20.20.29@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: Skipped 5 previous similar messages 2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim= ilar messages 2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim= ilar messages 2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at 172.20.20.22@o2ib500) 2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous similar messages 2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at 172.20.20.19@o2ib500) 2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at 172.20.20.18@o2ib500) 2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at 172.20.20.21@o2ib500) 2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at 172.20.20.26@o2ib500) 2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at 172.20.20.29@o2ib500) 2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 3 previous similar messages 2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388 previous similar messages 2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim= ilar messages 2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at 172.20.20.32@o2ib500) 2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: Skipped 5 previous similar messages 2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at 172.20.20.201@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at 172.20.20.201@o2ib500) 2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: Skipped 1 previous similar message 2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous similar messages 2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) 2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim= ilar messages 2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.23@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline" <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> Date: 04/24/2016 08:05 PM Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident 1057380 The user reports that the test that consistently fails on R16-R17 when running on Lustre, runs without error when using NFS. So, I've deleted the reservation on R16-R17 and asked that the user rerun the other failing scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've also ask that the user keep his Lustre directories in tact so that the failing test can be rerun to verify that the connections to Lustre have been resolved. Can you consider periodically (daily?, hourly?) monitoring the Lustre connections using the "lfs check servers" so that failures like this can be avoided in the future? Is it possible to generate an alert when ever a connection goes bad? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn, Dong H." <ahn1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov> Date: 04/23/2016 08:50 PM Subject: Suspect Vulcan IO-node to Lustre connections for incident 1057380 Greetings, Sending this to LLNL staff only. Friday evening I reported in incident 1057380 that the hang on R16/R17 still occured after some of the service actions were done. STAT could still not attach to some of the ranks, and the coreprocessor stack trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00. Based on these additional symptoms, I executed "lfs check servers" on all the Vulcan IO nodes, which was something Richard had suggested to me last month. The exception results are below. Note the errors on R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node jobs that also had hangs. I really hope this is the root cause of the problem. I believe that the fix may be to reboot the IO node. I'm not sure when R17-ID-J00 was last booted but it could have been after the diagnostic run on Thursday evening or Friday morning. Was there anything done to the system during the update on 4/14, other than the efixes which just updated the glibc and MPI libraries, which would have affected the IO nodes? In the mean time, I've suggested to the user to try running their test from NFS using the existing reservation on R16/R17, since they've assured me that there application does not do parallel IO. I'm not sure if they will receive my email this weekend to be able to attempt the NFS jobs before Monday. Results from "lsf check servers" 172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0 pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0 pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0 pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0 172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable 172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0 172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable 172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0 pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0 pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0 pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0 pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0 pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr): Interrupted system call |
New:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. Have appended the following email discussion on the matter: Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov> CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> From: Roy Musselman <roymuss@us.ibm.com> Date: Mon, 25 Apr 2016 10:49:44 -0700 The ownership of this incident should probably shift to Lustre as per the evidence gathered this past weekend. (I trimmed out a lot of the old message history from this dialog) Examining the IO-node log R17-ID-J00.log revealed the following Lustre-related messages that occurred on 4/14, the same day that the user first started seeing symptoms, and coincidentally the same day the efix 12 & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre connections were lost and restored. However fsv-OST0017, which was identified in the "lfs check servers" messages from R17-ID-J00/J06, was not restored. 2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at 172.20.20.17@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: Skipped 1 previous similar message 2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at 172.20.20.29@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: Skipped 5 previous similar messages 2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim= ilar messages 2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim= ilar messages 2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at 172.20.20.22@o2ib500) 2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous similar messages 2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at 172.20.20.19@o2ib500) 2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at 172.20.20.18@o2ib500) 2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at 172.20.20.21@o2ib500) 2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at 172.20.20.26@o2ib500) 2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at 172.20.20.29@o2ib500) 2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 3 previous similar messages 2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388 previous similar messages 2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim= ilar messages 2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at 172.20.20.32@o2ib500) 2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: Skipped 5 previous similar messages 2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at 172.20.20.201@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at 172.20.20.201@o2ib500) 2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: Skipped 1 previous similar message 2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous similar messages 2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) 2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim= ilar messages 2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.23@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline" <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> Date: 04/24/2016 08:05 PM Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident 1057380 The user reports that the test that consistently fails on R16-R17 when running on Lustre, runs without error when using NFS. So, I've deleted the reservation on R16-R17 and asked that the user rerun the other failing scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've also ask that the user keep his Lustre directories in tact so that the failing test can be rerun to verify that the connections to Lustre have been resolved. Can you consider periodically (daily?, hourly?) monitoring the Lustre connections using the "lfs check servers" so that failures like this can be avoided in the future? Is it possible to generate an alert when ever a connection goes bad? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn, Dong H." <ahn1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov> Date: 04/23/2016 08:50 PM Subject: Suspect Vulcan IO-node to Lustre connections for incident 1057380 Greetings, Sending this to LLNL staff only. Friday evening I reported in incident 1057380 that the hang on R16/R17 still occured after some of the service actions were done. STAT could still not attach to some of the ranks, and the coreprocessor stack trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00. Based on these additional symptoms, I executed "lfs check servers" on all the Vulcan IO nodes, which was something Richard had suggested to me last month. The exception results are below. Note the errors on R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node jobs that also had hangs. I really hope this is the root cause of the problem. I believe that the fix may be to reboot the IO node. I'm not sure when R17-ID-J00 was last booted but it could have been after the diagnostic run on Thursday evening or Friday morning. Was there anything done to the system during the update on 4/14, other than the efixes which just updated the glibc and MPI libraries, which would have affected the IO nodes? In the mean time, I've suggested to the user to try running their test from NFS using the existing reservation on R16/R17, since they've assured me that there application does not do parallel IO. I'm not sure if they will receive my email this weekend to be able to attempt the NFS jobs before Monday. Results from "lsf check servers" 172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0 pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0 pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0 pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0 172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable 172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0 172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable 172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0 pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0 pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0 pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0 pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0 pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr): Interrupted system call |
Description |
Original:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. Have appended the following email discussion on the matter: Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov> CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> From: Roy Musselman <roymuss@us.ibm.com> Date: Mon, 25 Apr 2016 10:49:44 -0700 The ownership of this incident should probably shift to Lustre as per the evidence gathered this past weekend. (I trimmed out a lot of the old message history from this dialog) Examining the IO-node log R17-ID-J00.log revealed the following Lustre-related messages that occurred on 4/14, the same day that the user first started seeing symptoms, and coincidentally the same day the efix 12 & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre connections were lost and restored. However fsv-OST0017, which was identified in the "lfs check servers" messages from R17-ID-J00/J06, was not restored. 2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at 172.20.20.17@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: Skipped 1 previous similar message 2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at 172.20.20.29@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: Skipped 5 previous similar messages 2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim= ilar messages 2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim= ilar messages 2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at 172.20.20.22@o2ib500) 2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous similar messages 2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at 172.20.20.19@o2ib500) 2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at 172.20.20.18@o2ib500) 2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at 172.20.20.21@o2ib500) 2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at 172.20.20.26@o2ib500) 2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at 172.20.20.29@o2ib500) 2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 3 previous similar messages 2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388 previous similar messages 2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim= ilar messages 2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at 172.20.20.32@o2ib500) 2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: Skipped 5 previous similar messages 2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at 172.20.20.201@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at 172.20.20.201@o2ib500) 2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: Skipped 1 previous similar message 2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous similar messages 2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) 2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim= ilar messages 2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.23@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline" <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> Date: 04/24/2016 08:05 PM Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident 1057380 The user reports that the test that consistently fails on R16-R17 when running on Lustre, runs without error when using NFS. So, I've deleted the reservation on R16-R17 and asked that the user rerun the other failing scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've also ask that the user keep his Lustre directories in tact so that the failing test can be rerun to verify that the connections to Lustre have been resolved. Can you consider periodically (daily?, hourly?) monitoring the Lustre connections using the "lfs check servers" so that failures like this can be avoided in the future? Is it possible to generate an alert when ever a connection goes bad? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn, Dong H." <ahn1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov> Date: 04/23/2016 08:50 PM Subject: Suspect Vulcan IO-node to Lustre connections for incident 1057380 Greetings, Sending this to LLNL staff only. Friday evening I reported in incident 1057380 that the hang on R16/R17 still occured after some of the service actions were done. STAT could still not attach to some of the ranks, and the coreprocessor stack trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00. Based on these additional symptoms, I executed "lfs check servers" on all the Vulcan IO nodes, which was something Richard had suggested to me last month. The exception results are below. Note the errors on R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node jobs that also had hangs. I really hope this is the root cause of the problem. I believe that the fix may be to reboot the IO node. I'm not sure when R17-ID-J00 was last booted but it could have been after the diagnostic run on Thursday evening or Friday morning. Was there anything done to the system during the update on 4/14, other than the efixes which just updated the glibc and MPI libraries, which would have affected the IO nodes? In the mean time, I've suggested to the user to try running their test from NFS using the existing reservation on R16/R17, since they've assured me that there application does not do parallel IO. I'm not sure if they will receive my email this weekend to be able to attempt the NFS jobs before Monday. Results from "lsf check servers" 172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0 pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0 pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0 pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0 172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable 172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0 172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable 172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0 pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0 pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0 pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0 pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0 pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr): Interrupted system call ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 |
New:
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64 Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors: vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable Output from proc "import" file for affect ost: {noformat} vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root: {noformat} The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running. Have appended the following email discussion on the matter: Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov> CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> From: Roy Musselman <roymuss@us.ibm.com> Date: Mon, 25 Apr 2016 10:49:44 -0700 The ownership of this incident should probably shift to Lustre as per the evidence gathered this past weekend. (I trimmed out a lot of the old message history from this dialog) Examining the IO-node log R17-ID-J00.log revealed the following Lustre-related messages that occurred on 4/14, the same day that the user first started seeing symptoms, and coincidentally the same day the efix 12 & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre connections were lost and restored. However fsv-OST0017, which was identified in the "lfs check servers" messages from R17-ID-J00/J06, was not restored. 2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at 172.20.20.17@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre: Skipped 1 previous similar message 2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at 172.20.20.29@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre: Skipped 5 previous similar messages 2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim= ilar messages 2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim= ilar messages 2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at 172.20.20.22@o2ib500) 2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous similar messages 2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at 172.20.20.19@o2ib500) 2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at 172.20.20.18@o2ib500) 2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at 172.20.20.21@o2ib500) 2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 1 previous similar message 2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at 172.20.20.26@o2ib500) 2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at 172.20.20.29@o2ib500) 2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre: Skipped 3 previous similar messages 2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388 previous similar messages 2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim= ilar messages 2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at 172.20.20.32@o2ib500) 2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre: Skipped 2 previous similar messages 2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre: Skipped 5 previous similar messages 2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre: fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at 172.20.20.201@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at 172.20.20.201@o2ib500) 2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre: Skipped 1 previous similar message 2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous similar messages 2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) 2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting 2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim= ilar messages 2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.23@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.24@o2ib500) 2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with 172.20.20.24@o2ib500, operation obd=5Fping failed with -19. 2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at 172.20.20.24@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre: fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at 172.20.20.23@o2ib500) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline" <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov> Date: 04/24/2016 08:05 PM Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident 1057380 The user reports that the test that consistently fails on R16-R17 when running on Lustre, runs without error when using NFS. So, I've deleted the reservation on R16-R17 and asked that the user rerun the other failing scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've also ask that the user keep his Lustre directories in tact so that the failing test can be rerun to verify that the connections to Lustre have been resolved. Can you consider periodically (daily?, hourly?) monitoring the Lustre connections using the "lfs check servers" so that failures like this can be avoided in the future? Is it possible to generate an alert when ever a connection goes bad? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab email: roymuss@us.ibm.com LLNL office: 925-422-6033 Cell: 507-358-8895, Home: 507-281-9565 From: Roy Musselman/Rochester/Contr/IBM To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov> Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn, Dong H." <ahn1@llnl.gov>, "Hedges, Richard M." <hedges1@llnl.gov>, "Legendre, Matthew P." <legendre1@llnl.gov>, "Gyllenhaal, John C." <gyllenhaal1@llnl.gov> Date: 04/23/2016 08:50 PM Subject: Suspect Vulcan IO-node to Lustre connections for incident 1057380 Greetings, Sending this to LLNL staff only. Friday evening I reported in incident 1057380 that the hang on R16/R17 still occured after some of the service actions were done. STAT could still not attach to some of the ranks, and the coreprocessor stack trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00. Based on these additional symptoms, I executed "lfs check servers" on all the Vulcan IO nodes, which was something Richard had suggested to me last month. The exception results are below. Note the errors on R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node jobs that also had hangs. I really hope this is the root cause of the problem. I believe that the fix may be to reboot the IO node. I'm not sure when R17-ID-J00 was last booted but it could have been after the diagnostic run on Thursday evening or Friday morning. Was there anything done to the system during the update on 4/14, other than the efixes which just updated the glibc and MPI libraries, which would have affected the IO nodes? In the mean time, I've suggested to the user to try running their test from NFS using the existing reservation on R16/R17, since they've assured me that there application does not do parallel IO. I'm not sure if they will receive my email this weekend to be able to attempt the NFS jobs before Monday. Results from "lsf check servers" 172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0 pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0 pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0 pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0 172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable 172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0 172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable 172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0 pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0 pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0 pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0 pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr): Interrupted system call 172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0 pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr): Interrupted system call |