Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8077

LLNL BGQ IO-node Lustre Client Not Reconnecting

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.5.4
    • 3
    • 9223372036854775807

    Description

      Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
      Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

      On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

      vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
      vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

      Output from proc "import" file for affect ost:

      vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
      import:
          name: fsv-OST0017-osc-c0000003e09f49c0
          target: fsv-OST0017_UUID
          state: REPLAY
          connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
          connect_data:
             flags: 0x4af0e3440478
             instance: 45
             target_version: 2.5.5.0
             initial_grant: 2097152
             max_brw_size: 4194304
             grant_block_size: 0
             grant_inode_size: 0
             grant_extent_overhead: 0
             cksum_types: 0x2
             max_easize: 32768
             max_object_bytes: 9223372036854775807
          import_flags: [ replayable, pingable, connect_tried ]
          connection:
             failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
             current_connection: 172.20.20.23@o2ib500
             connection_attempts: 39
             generation: 1
             in-progress_invalidations: 0
          rpcs:
             inflight: 168
             unregistering: 1
             timeouts: 20977
             avg_waittime: 209959 usec
          service_estimates:
             services: 48 sec
             network: 45 sec
          transactions:
             last_replay: 0
             peer_committed: 150323856033
             last_checked: 150323856033
          read_data_averages:
             bytes_per_rpc: 69000
             usec_per_rpc: 4389
             MB_per_sec: 15.72
          write_data_averages:
             bytes_per_rpc: 893643
             usec_per_rpc: 2458
             MB_per_sec: 363.56
      vulcanio121-ib0@root: 
      
      vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import 
      import:
          name: fsv-OST0017-osc-c0000003c4483300
          target: fsv-OST0017_UUID
          state: REPLAY
          connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
          connect_data:
             flags: 0x4af0e3440478
             instance: 45
             target_version: 2.5.5.0
             initial_grant: 2097152
             max_brw_size: 4194304
             grant_block_size: 0
             grant_inode_size: 0
             grant_extent_overhead: 0
             cksum_types: 0x2
             max_easize: 32768
             max_object_bytes: 9223372036854775807
          import_flags: [ replayable, pingable, connect_tried ]
          connection:
             failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
             current_connection: 172.20.20.23@o2ib500
             connection_attempts: 36
             generation: 1
             in-progress_invalidations: 0
          rpcs:
             inflight: 131
             unregistering: 1
             timeouts: 19341
             avg_waittime: 144395 usec
          service_estimates:
             services: 45 sec
             network: 50 sec
          transactions:
             last_replay: 0
             peer_committed: 150323856116
             last_checked: 150323856116
          read_data_averages:
             bytes_per_rpc: 67548
             usec_per_rpc: 3326
             MB_per_sec: 20.30
          write_data_averages:
             bytes_per_rpc: 913996
             usec_per_rpc: 5909
             MB_per_sec: 154.67
      vulcanio127-ib0@root: 
      

      The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

      Attachments

        Activity

          [LU-8077] LLNL BGQ IO-node Lustre Client Not Reconnecting
          jgmitter Joseph Gmitter (Inactive) made changes -
          Resolution New: Not a Bug [ 6 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          End date New: 09/May/16
          Start date New: 27/Apr/16
          morrone Christopher Morrone (Inactive) made changes -
          Epic/Theme Original: LLNL-1
          jgmitter Joseph Gmitter (Inactive) made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Joseph Gmitter [ jgmitter ]
          nedbass Ned Bass (Inactive) made changes -
          Description Original: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

          Have appended the following email discussion on the matter:

          Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections
          To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov>
          CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>,
          "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
          <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P."
          <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>,
          LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          From: Roy Musselman <roymuss@us.ibm.com>
          Date: Mon, 25 Apr 2016 10:49:44 -0700

          The ownership of this incident should probably shift to Lustre as per the
          evidence gathered this past weekend.

          (I trimmed out a lot of the old message history from this dialog)

          Examining the IO-node log R17-ID-J00.log revealed the following
          Lustre-related messages that occurred on 4/14, the same day that the user
          first started seeing symptoms, and coincidentally the same day the efix 12
          & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre
          connections were lost and restored. However fsv-OST0017, which was
          identified in the "lfs check servers" messages from R17-ID-J00/J06, was not
          restored.

          2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at
          172.20.20.17@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at
          172.20.20.29@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim=
          ilar
          messages
          2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim=
          ilar
          messages
          2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at
          172.20.20.22@o2ib500)
          2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous
          similar messages
          2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at
          172.20.20.19@o2ib500)
          2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at
          172.20.20.18@o2ib500)
          2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at
          172.20.20.21@o2ib500)
          2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at
          172.20.20.26@o2ib500)
          2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at
          172.20.20.29@o2ib500)
          2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 3 previous similar messages
          2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already
          connecting
          2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388
          previous similar messages
          2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim=
          ilar
          messages
          2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at
          172.20.20.32@o2ib500)
          2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0:
          LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre:
          fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at
          172.20.20.201@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at
          172.20.20.201@o2ib500)
          2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          Skipped 1 previous similar message
          2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous
          similar messages
          2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)
          2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim=
          ilar
          messages
          2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.23@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError:
          11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott"
                      <futral2@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre,
                      Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski
                      (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline"
                      <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          Date: 04/24/2016 08:05 PM
          Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          The user reports that the test that consistently fails on R16-R17 when
          running on Lustre, runs without error when using NFS. So, I've deleted the
          reservation on R16-R17 and asked that the user rerun the other failing
          scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've
          also ask that the user keep his Lustre directories in tact so that the
          failing test can be rerun to verify that the connections to Lustre have
          been resolved.

          Can you consider periodically (daily?, hourly?) monitoring the Lustre
          connections using the "lfs check servers" so that failures like this can be
          avoided in the future?
          Is it possible to generate an alert when ever a connection goes bad?

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral,
                      Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn,
                      Dong H." <ahn1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Legendre, Matthew P."
                      <legendre1@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>
          Date: 04/23/2016 08:50 PM
          Subject: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          Greetings,

          Sending this to LLNL staff only.

          Friday evening I reported in incident 1057380 that the hang on
          R16/R17 still occured after some of the service actions were done. STAT
          could still not attach to some of the ranks, and the coreprocessor stack
          trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00.

          Based on these additional symptoms, I executed "lfs check servers" on
          all the Vulcan IO nodes, which was something Richard had suggested to me
          last month. The exception results are below. Note the errors on
          R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node
          jobs that also had hangs. I really hope this is the root cause of the
          problem. I believe that the fix may be to reboot the IO node. I'm not sure
          when R17-ID-J00 was last booted but it could have been after the diagnostic
          run on Thursday evening or Friday morning.

          Was there anything done to the system during the update on 4/14, other than
          the efixes which just updated the glibc and MPI libraries, which would have
          affected the IO nodes?

          In the mean time, I've suggested to the user to try running their test from
          NFS using the existing reservation on R16/R17, since they've assured me
          that there application does not do parallel IO. I'm not sure if they will
          receive my email this weekend to be able to attempt the NFS jobs before
          Monday.

          Results from "lsf check servers"

          172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0
          pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0
          pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0
          pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0
          172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource
          temporarily unavailable

          172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0
          172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource
          temporarily unavailable

          172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0
          pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0
          pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0
          pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0
          pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0
          pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr):
          Interrupted system call
          New: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.
          nedbass Ned Bass (Inactive) made changes -
          Description Original: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

          Have appended the following email discussion on the matter:

          Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections
          To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov>
          CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>,
          "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
          <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P."
          <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>,
          LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          From: Roy Musselman <roymuss@us.ibm.com>
          Date: Mon, 25 Apr 2016 10:49:44 -0700

          The ownership of this incident should probably shift to Lustre as per the
          evidence gathered this past weekend.

          (I trimmed out a lot of the old message history from this dialog)

          Examining the IO-node log R17-ID-J00.log revealed the following
          Lustre-related messages that occurred on 4/14, the same day that the user
          first started seeing symptoms, and coincidentally the same day the efix 12
          & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre
          connections were lost and restored. However fsv-OST0017, which was
          identified in the "lfs check servers" messages from R17-ID-J00/J06, was not
          restored.

          2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at
          172.20.20.17@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at
          172.20.20.29@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim=
          ilar
          messages
          2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim=
          ilar
          messages
          2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at
          172.20.20.22@o2ib500)
          2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous
          similar messages
          2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at
          172.20.20.19@o2ib500)
          2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at
          172.20.20.18@o2ib500)
          2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at
          172.20.20.21@o2ib500)
          2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at
          172.20.20.26@o2ib500)
          2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at
          172.20.20.29@o2ib500)
          2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 3 previous similar messages
          2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already
          connecting
          2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388
          previous similar messages
          2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim=
          ilar
          messages
          2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at
          172.20.20.32@o2ib500)
          2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0:
          LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre:
          fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at
          172.20.20.201@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at
          172.20.20.201@o2ib500)
          2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          Skipped 1 previous similar message
          2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous
          similar messages
          2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)
          2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim=
          ilar
          messages
          2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.23@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError:
          11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565


          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott"
                      <futral2@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre,
                      Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski
                      (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline"
                      <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          Date: 04/24/2016 08:05 PM
          Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          The user reports that the test that consistently fails on R16-R17 when
          running on Lustre, runs without error when using NFS. So, I've deleted the
          reservation on R16-R17 and asked that the user rerun the other failing
          scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've
          also ask that the user keep his Lustre directories in tact so that the
          failing test can be rerun to verify that the connections to Lustre have
          been resolved.

          Can you consider periodically (daily?, hourly?) monitoring the Lustre
          connections using the "lfs check servers" so that failures like this can be
          avoided in the future?
          Is it possible to generate an alert when ever a connection goes bad?

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral,
                      Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn,
                      Dong H." <ahn1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Legendre, Matthew P."
                      <legendre1@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>
          Date: 04/23/2016 08:50 PM
          Subject: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          Greetings,

          Sending this to LLNL staff only.

          Friday evening I reported in incident 1057380 that the hang on
          R16/R17 still occured after some of the service actions were done. STAT
          could still not attach to some of the ranks, and the coreprocessor stack
          trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00.

          Based on these additional symptoms, I executed "lfs check servers" on
          all the Vulcan IO nodes, which was something Richard had suggested to me
          last month. The exception results are below. Note the errors on
          R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node
          jobs that also had hangs. I really hope this is the root cause of the
          problem. I believe that the fix may be to reboot the IO node. I'm not sure
          when R17-ID-J00 was last booted but it could have been after the diagnostic
          run on Thursday evening or Friday morning.

          Was there anything done to the system during the update on 4/14, other than
          the efixes which just updated the glibc and MPI libraries, which would have
          affected the IO nodes?

          In the mean time, I've suggested to the user to try running their test from
          NFS using the existing reservation on R16/R17, since they've assured me
          that there application does not do parallel IO. I'm not sure if they will
          receive my email this weekend to be able to attempt the NFS jobs before
          Monday.

          Results from "lsf check servers"

          172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0
          pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0
          pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0
          pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0
          172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource
          temporarily unavailable

          172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0
          172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource
          temporarily unavailable

          172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0
          pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0
          pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0
          pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0
          pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0
          pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr):
          Interrupted system call
          New: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

          Have appended the following email discussion on the matter:

          Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections
          To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov>
          CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>,
          "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
          <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P."
          <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>,
          LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          From: Roy Musselman <roymuss@us.ibm.com>
          Date: Mon, 25 Apr 2016 10:49:44 -0700

          The ownership of this incident should probably shift to Lustre as per the
          evidence gathered this past weekend.

          (I trimmed out a lot of the old message history from this dialog)

          Examining the IO-node log R17-ID-J00.log revealed the following
          Lustre-related messages that occurred on 4/14, the same day that the user
          first started seeing symptoms, and coincidentally the same day the efix 12
          & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre
          connections were lost and restored. However fsv-OST0017, which was
          identified in the "lfs check servers" messages from R17-ID-J00/J06, was not
          restored.

          2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at
          172.20.20.17@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at
          172.20.20.29@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim=
          ilar
          messages
          2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim=
          ilar
          messages
          2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at
          172.20.20.22@o2ib500)
          2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous
          similar messages
          2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at
          172.20.20.19@o2ib500)
          2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at
          172.20.20.18@o2ib500)
          2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at
          172.20.20.21@o2ib500)
          2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at
          172.20.20.26@o2ib500)
          2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at
          172.20.20.29@o2ib500)
          2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 3 previous similar messages
          2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already
          connecting
          2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388
          previous similar messages
          2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim=
          ilar
          messages
          2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at
          172.20.20.32@o2ib500)
          2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0:
          LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre:
          fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at
          172.20.20.201@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at
          172.20.20.201@o2ib500)
          2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          Skipped 1 previous similar message
          2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous
          similar messages
          2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)
          2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim=
          ilar
          messages
          2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.23@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError:
          11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott"
                      <futral2@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre,
                      Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski
                      (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline"
                      <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          Date: 04/24/2016 08:05 PM
          Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          The user reports that the test that consistently fails on R16-R17 when
          running on Lustre, runs without error when using NFS. So, I've deleted the
          reservation on R16-R17 and asked that the user rerun the other failing
          scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've
          also ask that the user keep his Lustre directories in tact so that the
          failing test can be rerun to verify that the connections to Lustre have
          been resolved.

          Can you consider periodically (daily?, hourly?) monitoring the Lustre
          connections using the "lfs check servers" so that failures like this can be
          avoided in the future?
          Is it possible to generate an alert when ever a connection goes bad?

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral,
                      Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn,
                      Dong H." <ahn1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Legendre, Matthew P."
                      <legendre1@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>
          Date: 04/23/2016 08:50 PM
          Subject: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          Greetings,

          Sending this to LLNL staff only.

          Friday evening I reported in incident 1057380 that the hang on
          R16/R17 still occured after some of the service actions were done. STAT
          could still not attach to some of the ranks, and the coreprocessor stack
          trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00.

          Based on these additional symptoms, I executed "lfs check servers" on
          all the Vulcan IO nodes, which was something Richard had suggested to me
          last month. The exception results are below. Note the errors on
          R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node
          jobs that also had hangs. I really hope this is the root cause of the
          problem. I believe that the fix may be to reboot the IO node. I'm not sure
          when R17-ID-J00 was last booted but it could have been after the diagnostic
          run on Thursday evening or Friday morning.

          Was there anything done to the system during the update on 4/14, other than
          the efixes which just updated the glibc and MPI libraries, which would have
          affected the IO nodes?

          In the mean time, I've suggested to the user to try running their test from
          NFS using the existing reservation on R16/R17, since they've assured me
          that there application does not do parallel IO. I'm not sure if they will
          receive my email this weekend to be able to attempt the NFS jobs before
          Monday.

          Results from "lsf check servers"

          172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0
          pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0
          pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0
          pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0
          172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource
          temporarily unavailable

          172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0
          172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource
          temporarily unavailable

          172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0
          pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0
          pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0
          pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0
          pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0
          pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr):
          Interrupted system call
          nedbass Ned Bass (Inactive) made changes -
          Description Original: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

          Have appended the following email discussion on the matter:

          Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections
          To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov>
          CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>,
          "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
          <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P."
          <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>,
          LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          From: Roy Musselman <roymuss@us.ibm.com>
          Date: Mon, 25 Apr 2016 10:49:44 -0700

          The ownership of this incident should probably shift to Lustre as per the
          evidence gathered this past weekend.

          (I trimmed out a lot of the old message history from this dialog)

          Examining the IO-node log R17-ID-J00.log revealed the following
          Lustre-related messages that occurred on 4/14, the same day that the user
          first started seeing symptoms, and coincidentally the same day the efix 12
          & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre
          connections were lost and restored. However fsv-OST0017, which was
          identified in the "lfs check servers" messages from R17-ID-J00/J06, was not
          restored.

          2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at
          172.20.20.17@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at
          172.20.20.29@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim=
          ilar
          messages
          2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim=
          ilar
          messages
          2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at
          172.20.20.22@o2ib500)
          2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous
          similar messages
          2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at
          172.20.20.19@o2ib500)
          2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at
          172.20.20.18@o2ib500)
          2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at
          172.20.20.21@o2ib500)
          2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at
          172.20.20.26@o2ib500)
          2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at
          172.20.20.29@o2ib500)
          2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 3 previous similar messages
          2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already
          connecting
          2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388
          previous similar messages
          2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim=
          ilar
          messages
          2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at
          172.20.20.32@o2ib500)
          2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0:
          LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre:
          fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at
          172.20.20.201@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at
          172.20.20.201@o2ib500)
          2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          Skipped 1 previous similar message
          2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous
          similar messages
          2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)
          2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim=
          ilar
          messages
          2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.23@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError:
          11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565


          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott"
                      <futral2@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre,
                      Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski
                      (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline"
                      <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          Date: 04/24/2016 08:05 PM
          Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          The user reports that the test that consistently fails on R16-R17 when
          running on Lustre, runs without error when using NFS. So, I've deleted the
          reservation on R16-R17 and asked that the user rerun the other failing
          scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've
          also ask that the user keep his Lustre directories in tact so that the
          failing test can be rerun to verify that the connections to Lustre have
          been resolved.

          Can you consider periodically (daily?, hourly?) monitoring the Lustre
          connections using the "lfs check servers" so that failures like this can be
          avoided in the future?
          Is it possible to generate an alert when ever a connection goes bad?

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral,
                      Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn,
                      Dong H." <ahn1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Legendre, Matthew P."
                      <legendre1@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>
          Date: 04/23/2016 08:50 PM
          Subject: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          Greetings,

          Sending this to LLNL staff only.

          Friday evening I reported in incident 1057380 that the hang on
          R16/R17 still occured after some of the service actions were done. STAT
          could still not attach to some of the ranks, and the coreprocessor stack
          trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00.

          Based on these additional symptoms, I executed "lfs check servers" on
          all the Vulcan IO nodes, which was something Richard had suggested to me
          last month. The exception results are below. Note the errors on
          R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node
          jobs that also had hangs. I really hope this is the root cause of the
          problem. I believe that the fix may be to reboot the IO node. I'm not sure
          when R17-ID-J00 was last booted but it could have been after the diagnostic
          run on Thursday evening or Friday morning.

          Was there anything done to the system during the update on 4/14, other than
          the efixes which just updated the glibc and MPI libraries, which would have
          affected the IO nodes?

          In the mean time, I've suggested to the user to try running their test from
          NFS using the existing reservation on R16/R17, since they've assured me
          that there application does not do parallel IO. I'm not sure if they will
          receive my email this weekend to be able to attempt the NFS jobs before
          Monday.

          Results from "lsf check servers"

          172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0
          pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0
          pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0
          pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0
          172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource
          temporarily unavailable

          172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0
          172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource
          temporarily unavailable

          172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0
          pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0
          pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0
          pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0
          pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0
          pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr):
          Interrupted system call

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565
          New: Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
          Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

          On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

          vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
          vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

          Output from proc "import" file for affect ost:

          {noformat}
          vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
          import:
              name: fsv-OST0017-osc-c0000003e09f49c0
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 39
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 168
                 unregistering: 1
                 timeouts: 20977
                 avg_waittime: 209959 usec
              service_estimates:
                 services: 48 sec
                 network: 45 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856033
                 last_checked: 150323856033
              read_data_averages:
                 bytes_per_rpc: 69000
                 usec_per_rpc: 4389
                 MB_per_sec: 15.72
              write_data_averages:
                 bytes_per_rpc: 893643
                 usec_per_rpc: 2458
                 MB_per_sec: 363.56
          vulcanio121-ib0@root:

          vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
          import:
              name: fsv-OST0017-osc-c0000003c4483300
              target: fsv-OST0017_UUID
              state: REPLAY
              connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
              connect_data:
                 flags: 0x4af0e3440478
                 instance: 45
                 target_version: 2.5.5.0
                 initial_grant: 2097152
                 max_brw_size: 4194304
                 grant_block_size: 0
                 grant_inode_size: 0
                 grant_extent_overhead: 0
                 cksum_types: 0x2
                 max_easize: 32768
                 max_object_bytes: 9223372036854775807
              import_flags: [ replayable, pingable, connect_tried ]
              connection:
                 failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
                 current_connection: 172.20.20.23@o2ib500
                 connection_attempts: 36
                 generation: 1
                 in-progress_invalidations: 0
              rpcs:
                 inflight: 131
                 unregistering: 1
                 timeouts: 19341
                 avg_waittime: 144395 usec
              service_estimates:
                 services: 45 sec
                 network: 50 sec
              transactions:
                 last_replay: 0
                 peer_committed: 150323856116
                 last_checked: 150323856116
              read_data_averages:
                 bytes_per_rpc: 67548
                 usec_per_rpc: 3326
                 MB_per_sec: 20.30
              write_data_averages:
                 bytes_per_rpc: 913996
                 usec_per_rpc: 5909
                 MB_per_sec: 154.67
          vulcanio127-ib0@root:
          {noformat}

          The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

          Have appended the following email discussion on the matter:

          Subject: RE: Incident #: 1057380 Suspect Vulcan IO-node to Lustre connections
          To: LLNL Bgl-Admin <bgl-admin@lists.llnl.gov>
          CC: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott" <futral2@llnl.gov>,
          "Gyllenhaal, John C." <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
          <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre, Matthew P."
          <legendre1@llnl.gov>, "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>,
          LC Hotline <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          From: Roy Musselman <roymuss@us.ibm.com>
          Date: Mon, 25 Apr 2016 10:49:44 -0700

          The ownership of this incident should probably shift to Lustre as per the
          evidence gathered this past weekend.

          (I trimmed out a lot of the old message history from this dialog)

          Examining the IO-node log R17-ID-J00.log revealed the following
          Lustre-related messages that occurred on 4/14, the same day that the user
          first started seeing symptoms, and coincidentally the same day the efix 12
          & 13 ( glibc and MPI libraries) updates were made to Vulcan. Many Lustre
          connections were lost and restored. However fsv-OST0017, which was
          identified in the "lfs check servers" messages from R17-ID-J00/J06, was not
          restored.

          2016-04-14 10:18:18.376415 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          fsv-OST0010-osc-c0000003e09f49c0: Connection to fsv-OST0010 (at
          172.20.20.17@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:18:18.377098 {RMP14Ap004250453} [mmcs]{128}.10.0: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:19:07.295373 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:07.296193 {RMP14Ap004250453} [mmcs]{128}.11.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:19:56.415101 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection to fsv-OST001c (at
          172.20.20.29@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:19:56.415554 {RMP14Ap004250453} [mmcs]{128}.8.3: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 10:33:50.702803 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909074 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:33:56.909684 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 2852 previous sim=
          ilar
          messages
          2016-04-14 10:34:07.495345 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:07.495964 {RMP14Ap004250453} [mmcs]{128}.1.0: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 3807 previous sim=
          ilar
          messages
          2016-04-14 10:34:09.694464 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0015-osc-c0000003e09f49c0: Connection restored to fsv-OST0015 (at
          172.20.20.22@o2ib500)
          2016-04-14 10:34:10.702159 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:34:10.702500 {RMP14Ap004250453} [mmcs]{128}.1.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 21982 previous
          similar messages
          2016-04-14 10:34:34.440285 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0012-osc-c0000003e09f49c0: Connection restored to fsv-OST0012 (at
          172.20.20.19@o2ib500)
          2016-04-14 10:34:34.440649 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:38.442605 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0011-osc-c0000003e09f49c0: Connection restored to fsv-OST0011 (at
          172.20.20.18@o2ib500)
          2016-04-14 10:34:40.651655 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0014-osc-c0000003e09f49c0: Connection restored to fsv-OST0014 (at
          172.20.20.21@o2ib500)
          2016-04-14 10:34:40.652034 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 1 previous similar message
          2016-04-14 10:34:52.237143 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST0019-osc-c0000003e09f49c0: Connection restored to fsv-OST0019 (at
          172.20.20.26@o2ib500)
          2016-04-14 10:35:02.029847 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          fsv-OST001c-osc-c0000003e09f49c0: Connection restored to fsv-OST001c (at
          172.20.20.29@o2ib500)
          2016-04-14 10:35:02.030420 {RMP14Ap004250453} [mmcs]{128}.1.1: Lustre:
          Skipped 3 previous similar messages
          2016-04-14 10:35:57.721359 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already
          connecting
          2016-04-14 10:35:57.721779 {RMP14Ap004250453} [mmcs]{128}.11.1:
          LustreError: 4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 388
          previous similar messages
          2016-04-14 10:38:53.015474 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 10:38:53.015922 {RMP14Ap004250453} [mmcs]{128}.0.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6739 previous sim=
          ilar
          messages
          2016-04-14 10:46:23.018342 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          fsv-OST001f-osc-c0000003e09f49c0: Connection restored to fsv-OST001f (at
          172.20.20.32@o2ib500)
          2016-04-14 10:46:23.019057 {RMP14Ap004250453} [mmcs]{128}.13.3: Lustre:
          Skipped 2 previous similar messages
          2016-04-14 10:48:21.567554 {RMP14Ap004250453} [mmcs]{128}.13.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 10:54:06.288585 {RMP14Ap004250453} [mmcs]{128}.15.0:
          LustreError: 11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 10:54:06.288965 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 10:54:06.289253 {RMP14Ap004250453} [mmcs]{128}.15.0: Lustre:
          Skipped 5 previous similar messages
          2016-04-14 11:01:36.246847 {RMP14Ap004250453} [mmcs]{128}.15.1: Lustre:
          fsv-OST0017-osc-c0000003e09f49c0: Connection to fsv-OST0017 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.698603 {RMP14Ap004250453} [mmcs]{128}.15.3: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection to fsv-MDT0000 (at
          172.20.20.201@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 11:08:55.758601 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          fsv-MDT0000-mdc-c0000003e09f49c0: Connection restored to fsv-MDT0000 (at
          172.20.20.201@o2ib500)
          2016-04-14 11:08:55.860691 {RMP14Ap004250453} [mmcs]{128}.9.2: Lustre:
          Skipped 1 previous similar message
          2016-04-14 11:10:05.732381 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:10:05.760709 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 13819 previous
          similar messages
          2016-04-14 11:11:15.470182 {RMP14Ap004250453} [mmcs]{128}.9.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)
          2016-04-14 11:11:40.369420 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) already connecting
          2016-04-14 11:11:40.369775 {RMP14Ap004250453} [mmcs]{128}.9.1: LustreError:
          4029:0:(import.c:648:ptlrpc=5Fconnect=5Fimport()) Skipped 6981 previous sim=
          ilar
          messages
          2016-04-14 11:19:26.617798 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.23@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 12:15:45.337488 {RMP14Ap004250453} [mmcs]{128}.11.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.24@o2ib500)
          2016-04-14 13:08:53.563675 {RMP14Ap004250453} [mmcs]{128}.7.0: LustreError:
          11-0: fsv-OST0016-osc-c0000003e09f49c0: Communicating with
          172.20.20.24@o2ib500, operation obd=5Fping failed with -19.
          2016-04-14 13:08:53.564909 {RMP14Ap004250453} [mmcs]{128}.7.0: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection to fsv-OST0016 (at
          172.20.20.24@o2ib500) was lost; in progress operations using this service
          will wait for recovery to complete
          2016-04-14 13:42:56.966595 {RMP14Ap004250453} [mmcs]{128}.7.1: Lustre:
          fsv-OST0016-osc-c0000003e09f49c0: Connection restored to fsv-OST0016 (at
          172.20.20.23@o2ib500)

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565


          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Ahn, Dong H." <ahn1@llnl.gov>, "Futral, Scott"
                      <futral2@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Legendre,
                      Matthew P." <legendre1@llnl.gov>, "Paul Szepietowski
                      (pauls@llnl.gov)" <pauls@llnl.gov>, "LC Hotline"
                      <lc-hotline@llnl.gov>, itsme2i <itsme2i@llnl.gov>
          Date: 04/24/2016 08:05 PM
          Subject: Re: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          The user reports that the test that consistently fails on R16-R17 when
          running on Lustre, runs without error when using NFS. So, I've deleted the
          reservation on R16-R17 and asked that the user rerun the other failing
          scenarios ( 8K nodes and the all-ranks-to-stdout test ) from NFS. I've
          also ask that the user keep his Lustre directories in tact so that the
          failing test can be rerun to verify that the connections to Lustre have
          been resolved.

          Can you consider periodically (daily?, hourly?) monitoring the Lustre
          connections using the "lfs check servers" so that failures like this can be
          avoided in the future?
          Is it possible to generate an alert when ever a connection goes bad?

          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          Roy Musselman
          IBM Contracted HPC Application Analyst at Lawrence Livermore National Lab
          email: roymuss@us.ibm.com
          LLNL office: 925-422-6033
          Cell: 507-358-8895, Home: 507-281-9565

          From: Roy Musselman/Rochester/Contr/IBM
          To: "LLNL Bgl-Admin" <bgl-admin@lists.llnl.gov>
          Cc: "Paul Szepietowski (pauls@llnl.gov)" <pauls@llnl.gov>, "Futral,
                      Scott" <futral2@llnl.gov>, "Lee, Greg" <lee218@llnl.gov>, "Ahn,
                      Dong H." <ahn1@llnl.gov>, "Hedges, Richard M."
                      <hedges1@llnl.gov>, "Legendre, Matthew P."
                      <legendre1@llnl.gov>, "Gyllenhaal, John C."
                      <gyllenhaal1@llnl.gov>
          Date: 04/23/2016 08:50 PM
          Subject: Suspect Vulcan IO-node to Lustre connections for incident
                      1057380

          Greetings,

          Sending this to LLNL staff only.

          Friday evening I reported in incident 1057380 that the hang on
          R16/R17 still occured after some of the service actions were done. STAT
          could still not attach to some of the ranks, and the coreprocessor stack
          trace tool could not connect to 172.20.17.121 which is IO node R17-ID-J00.

          Based on these additional symptoms, I executed "lfs check servers" on
          all the Vulcan IO nodes, which was something Richard had suggested to me
          last month. The exception results are below. Note the errors on
          R17-ID-J00 and J06. Also R15 and R20 are nodes that were used in 8K node
          jobs that also had hangs. I really hope this is the root cause of the
          problem. I believe that the fix may be to reboot the IO node. I'm not sure
          when R17-ID-J00 was last booted but it could have been after the diagnostic
          run on Thursday evening or Friday morning.

          Was there anything done to the system during the update on 4/14, other than
          the efixes which just updated the glibc and MPI libraries, which would have
          affected the IO nodes?

          In the mean time, I've suggested to the user to try running their test from
          NFS using the existing reservation on R16/R17, since they've assured me
          that there application does not do parallel IO. I'm not sure if they will
          receive my email this weekend to be able to attempt the NFS jobs before
          Monday.

          Results from "lsf check servers"

          172.20.17.23 vulcanio23 R02-ID-J06 vulcanio23-ib0
          pdsh@vulcanlac4: 172.20.17.23: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.51 vulcanio51 R06-ID-J02 vulcanio51-ib0
          pdsh@vulcanlac4: 172.20.17.51: mcmd: xpoll (setting up stderr): Interrupted
          system call

          172.20.17.109 vulcanio109 R15-ID-J04 vulcanio109-ib0
          pdsh@vulcanlac4: 172.20.17.109: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.121 vulcanio121 R17-ID-J00 vulcanio121-ib0
          172.20.17.121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource
          temporarily unavailable

          172.20.17.127 vulcanio127 R17-ID-J06 vulcanio127-ib0
          172.20.17.127: fsv-OST0017-osc-c0000003c4483300: check error: Resource
          temporarily unavailable

          172.20.17.133 vulcanio133 R20-ID-J04 vulcanio133-ib0
          pdsh@vulcanlac4: 172.20.17.133: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.137 vulcanio137 R21-ID-J00 vulcanio137-ib0
          pdsh@vulcanlac4: 172.20.17.137: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.154 vulcanio154 R23-ID-J01 vulcanio154-ib0
          pdsh@vulcanlac4: 172.20.17.154: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.156 vulcanio156 R23-ID-J03 vulcanio156-ib0
          pdsh@vulcanlac4: 172.20.17.156: mcmd: xpoll (setting up stderr):
          Interrupted system call

          172.20.17.171 vulcanio171 R25-ID-J02 vulcanio171-ib0
          pdsh@vulcanlac4: 172.20.17.171: mcmd: xpoll (setting up stderr):
          Interrupted system call
          weems2 Lance Weems created issue -

          People

            jgmitter Joseph Gmitter (Inactive)
            weems2 Lance Weems
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: