Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8077

LLNL BGQ IO-node Lustre Client Not Reconnecting

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.5.4
    • 3
    • 9223372036854775807

    Description

      Lustre Clients: lustre-client-ion-2.5.4-16chaos_2.6.32_504.8.2.bgq.4blueos.V1R2M3.bl2.2_11.ppc64.ppc64
      Lustre Servers: lustre-2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64

      On our IBM BGQ system Vulcan at LLNL, the ion's have been experiencing what is believed to be repeated ost connection issues affecting user jobs. Recently two ions have reported issues have been identified. The rack has been drained and the ions left as is. The command "lfs check servers" reports the following errors:

      vulcanio121: fsv-OST0017-osc-c0000003e09f49c0: check error: Resource temporarily unavailable
      vulcanio127: fsv-OST0017-osc-c0000003c4483300: check error: Resource temporarily unavailable

      Output from proc "import" file for affect ost:

      vulcanio121-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003e09f49c0/import
      import:
          name: fsv-OST0017-osc-c0000003e09f49c0
          target: fsv-OST0017_UUID
          state: REPLAY
          connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
          connect_data:
             flags: 0x4af0e3440478
             instance: 45
             target_version: 2.5.5.0
             initial_grant: 2097152
             max_brw_size: 4194304
             grant_block_size: 0
             grant_inode_size: 0
             grant_extent_overhead: 0
             cksum_types: 0x2
             max_easize: 32768
             max_object_bytes: 9223372036854775807
          import_flags: [ replayable, pingable, connect_tried ]
          connection:
             failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
             current_connection: 172.20.20.23@o2ib500
             connection_attempts: 39
             generation: 1
             in-progress_invalidations: 0
          rpcs:
             inflight: 168
             unregistering: 1
             timeouts: 20977
             avg_waittime: 209959 usec
          service_estimates:
             services: 48 sec
             network: 45 sec
          transactions:
             last_replay: 0
             peer_committed: 150323856033
             last_checked: 150323856033
          read_data_averages:
             bytes_per_rpc: 69000
             usec_per_rpc: 4389
             MB_per_sec: 15.72
          write_data_averages:
             bytes_per_rpc: 893643
             usec_per_rpc: 2458
             MB_per_sec: 363.56
      vulcanio121-ib0@root: 
      
      vulcanio127-ib0@root: cat /proc/fs/lustre/osc/fsv-OST0017-osc-c0000003c4483300/import 
      import:
          name: fsv-OST0017-osc-c0000003c4483300
          target: fsv-OST0017_UUID
          state: REPLAY
          connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type ]
          connect_data:
             flags: 0x4af0e3440478
             instance: 45
             target_version: 2.5.5.0
             initial_grant: 2097152
             max_brw_size: 4194304
             grant_block_size: 0
             grant_inode_size: 0
             grant_extent_overhead: 0
             cksum_types: 0x2
             max_easize: 32768
             max_object_bytes: 9223372036854775807
          import_flags: [ replayable, pingable, connect_tried ]
          connection:
             failover_nids: [ 172.20.20.23@o2ib500, 172.20.20.24@o2ib500 ]
             current_connection: 172.20.20.23@o2ib500
             connection_attempts: 36
             generation: 1
             in-progress_invalidations: 0
          rpcs:
             inflight: 131
             unregistering: 1
             timeouts: 19341
             avg_waittime: 144395 usec
          service_estimates:
             services: 45 sec
             network: 50 sec
          transactions:
             last_replay: 0
             peer_committed: 150323856116
             last_checked: 150323856116
          read_data_averages:
             bytes_per_rpc: 67548
             usec_per_rpc: 3326
             MB_per_sec: 20.30
          write_data_averages:
             bytes_per_rpc: 913996
             usec_per_rpc: 5909
             MB_per_sec: 154.67
      vulcanio127-ib0@root: 
      

      The disconnects appear to have happen when we updated our lustre cluster. All other ion's reconnect with not problem once the lustre cluster was back up and running.

      Attachments

        Activity

          [LU-8077] LLNL BGQ IO-node Lustre Client Not Reconnecting

          Can we close this issue?

          jgmitter Joseph Gmitter (Inactive) added a comment - Can we close this issue?
          green Oleg Drokin added a comment -

          On the off chance can you see if any of that info is still there in the "xterm" scroll buffer somewhere from the time you did the bt?

          green Oleg Drokin added a comment - On the off chance can you see if any of that info is still there in the "xterm" scroll buffer somewhere from the time you did the bt?

          I did a "foreach bt" from crash. That is easier on these nodes than doing sysrq-t. Console access is possible, but a pain. And I don't believe that these nodes have been set up to get crash dumps. But at least crash works on a live system. That is much better than we had on previous BG systems.

          I didn't know the nodes were being rebooted without getting more info, or I would have saved it myself. I agree, more needs to be gathered next time this happens.

          morrone Christopher Morrone (Inactive) added a comment - I did a "foreach bt" from crash. That is easier on these nodes than doing sysrq-t. Console access is possible, but a pain. And I don't believe that these nodes have been set up to get crash dumps. But at least crash works on a live system. That is much better than we had on previous BG systems. I didn't know the nodes were being rebooted without getting more info, or I would have saved it myself. I agree, more needs to be gathered next time this happens.
          green Oleg Drokin added a comment -

          Chris, it must have been doing something somewhere, though
          was there an ll_imp_inval for example?

          Would be great if you can get a sysrq-t (or an equivalent from a crashdump) should something like this happen again.

          green Oleg Drokin added a comment - Chris, it must have been doing something somewhere, though was there an ll_imp_inval for example? Would be great if you can get a sysrq-t (or an equivalent from a crashdump) should something like this happen again.

          Please keep the ticket open until the underlying bug has been fixed.

          nedbass Ned Bass (Inactive) added a comment - Please keep the ticket open until the underlying bug has been fixed.

          Hello Lance and team,

          Is there any more work required on this ticket? Or can we go ahead mark it as resolved?

          Many thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Hello Lance and team, Is there any more work required on this ticket? Or can we go ahead mark it as resolved? Many thanks, ~ jfc.

          I looked at the backtraces on one of the nodes and didn't see any lustre threads stuck on anything. It looked like all were just in their normal idle waiting states.

          morrone Christopher Morrone (Inactive) added a comment - I looked at the backtraces on one of the nodes and didn't see any lustre threads stuck on anything. It looked like all were just in their normal idle waiting states.
          green Oleg Drokin added a comment -

          next time something like this happens it would be great if you an get backtraces from lustre threads to see what they were stuck on.

          green Oleg Drokin added a comment - next time something like this happens it would be great if you an get backtraces from lustre threads to see what they were stuck on.
          green Oleg Drokin added a comment -

          hm, I wonder if ir was one of the recently uncovered bulk or rpc state hangups like LU-7434

          green Oleg Drokin added a comment - hm, I wonder if ir was one of the recently uncovered bulk or rpc state hangups like LU-7434

          I glanced at the clients, and I believe that they both had started replay but didn't finish, and then got stuck on the subsequent replay.

          morrone Christopher Morrone (Inactive) added a comment - I glanced at the clients, and I believe that they both had started replay but didn't finish, and then got stuck on the subsequent replay.

          People

            jgmitter Joseph Gmitter (Inactive)
            weems2 Lance Weems
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: