Description
Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64
On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:
vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable
Output from proc "import" file for affect ost:
vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
import:
name: fsv-OST0017-osc-c0000003e09f49c0
target: fsv-OST0017_UUID
state: REPLAY
connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
connect_data:
flags: 0x4af0e3440478
instance: 45
target_version: 2.5.5.0
initial_grant: 2097152
max_brw_size: 4194304
grant_block_size: 0
grant_inode_size: 0
grant_extent_overhead: 0
cksum_types: 0x2
max_easize: 32768
max_object_bytes: 9223372036854775807
import_flags: [ replayable, pingable, connect_tried ]
connection:
failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
current_connection: 172.20.20.23@o2ib500
connection_attempts: 39
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 168
unregistering: 1
timeouts: 20977
avg_waittime: 209959 usec
service_estimates:
services: 48 sec
network: 45 sec
transactions:
last_replay: 0
peer_committed: 150323856033
last_checked: 150323856033
read_data_averages:
bytes_per_rpc: 69000
usec_per_rpc: 4389
MB_per_sec: 15.72
write_data_averages:
bytes_per_rpc: 893643
usec_per_rpc: 2458
MB_per_sec: 363.56
vulcanio121-ib0@root:
vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import
import:
name: fsv-OST0017-osc-c0000003c4483300
target: fsv-OST0017_UUID
state: REPLAY
connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
connect_data:
flags: 0x4af0e3440478
instance: 45
target_version: 2.5.5.0
initial_grant: 2097152
max_brw_size: 4194304
grant_block_size: 0
grant_inode_size: 0
grant_extent_overhead: 0
cksum_types: 0x2
max_easize: 32768
max_object_bytes: 9223372036854775807
import_flags: [ replayable, pingable, connect_tried ]
connection:
failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
current_connection: 172.20.20.23@o2ib500
connection_attempts: 36
generation: 1
in-progress_invalidations: 0
rpcs:
inflight: 131
unregistering: 1
timeouts: 19341
avg_waittime: 144395 usec
service_estimates:
services: 45 sec
network: 50 sec
transactions:
last_replay: 0
peer_committed: 150323856116
last_checked: 150323856116
read_data_averages:
bytes_per_rpc: 67548
usec_per_rpc: 3326
MB_per_sec: 20.30
write_data_averages:
bytes_per_rpc: 913996
usec_per_rpc: 5909
MB_per_sec: 154.67
vulcanio127-ib0@root:
The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.