Description
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64
On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:
vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable
Output from proc "import" file for affect ost:
vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import import: name: fsv-OST0017-osc-c0000003e09f49c0 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 39 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 168 unregistering: 1 timeouts: 20977 avg_waittime: 209959 usec service_estimates: services: 48 sec network: 45 sec transactions: last_replay: 0 peer_committed: 150323856033 last_checked: 150323856033 read_data_averages: bytes_per_rpc: 69000 usec_per_rpc: 4389 MB_per_sec: 15.72 write_data_averages: bytes_per_rpc: 893643 usec_per_rpc: 2458 MB_per_sec: 363.56 vulcanio121-ib0@root: vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import import: name: fsv-OST0017-osc-c0000003c4483300 target: fsv-OST0017_UUID state: REPLAY connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ] connect_data: flags: 0x4af0e3440478 instance: 45 target_version: 2.5.5.0 initial_grant: 2097152 max_brw_size: 4194304 grant_block_size: 0 grant_inode_size: 0 grant_extent_overhead: 0 cksum_types: 0x2 max_easize: 32768 max_object_bytes: 9223372036854775807 import_flags: [ replayable, pingable, connect_tried ] connection: failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ] current_connection: 172.20.20.23@o2ib500 connection_attempts: 36 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 131 unregistering: 1 timeouts: 19341 avg_waittime: 144395 usec service_estimates: services: 45 sec network: 50 sec transactions: last_replay: 0 peer_committed: 150323856116 last_checked: 150323856116 read_data_averages: bytes_per_rpc: 67548 usec_per_rpc: 3326 MB_per_sec: 20.30 write_data_averages: bytes_per_rpc: 913996 usec_per_rpc: 5909 MB_per_sec: 154.67 vulcanio127-ib0@root:
The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.