Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11230

QIB route to OPA LNet drops / selftest fail

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.4
    • OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5

    Description

      Hi Folks,

      Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail.

      In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers.

      For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine.

      We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets.

      Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics?

      Shortened example of failed selftest:

      [root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh
      LST_SESSION = 755
      SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
      192.168.55.77@o2ib are added to session
      192.168.44.199@o2ib44 are added to session
      Test was added successfully
      bulk_read is running now
      Capturing statistics for 300 secs [LNet Rates of lfrom]
      [R] Avg: 3163     RPC/s Min: 3163     RPC/s Max: 3163     RPC/s
      [W] Avg: 1580     RPC/s Min: 1580     RPC/s Max: 1580     RPC/s
      [LNet Bandwidth of lfrom]
      [R] Avg: 1581.81  MiB/s Min: 1581.81  MiB/s Max: 1581.81  MiB/s
      [W] Avg: 0.24     MiB/s Min: 0.24     MiB/s Max: 0.24     MiB/s
      
      etc...
      
      [LNet Bandwidth of lfrom]
      [R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
      [W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
      [LNet Rates of lto]
      [R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
      [W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
      [LNet Bandwidth of lto]
      [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
      [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
      
      lfrom:
      12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired]
      Total 1 error nodes in lfrom
      lto:
      Total 0 error nodes in lto
      Batch is stopped
      session is ended
      

      Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable.

      Cheers,
      Simon

      Attachments

        1. lnet.conf
          2 kB
        2. lnet-tests-21_aug_2018.txt
          12 kB
        3. lnetctl_export_lnet-router.txt
          16 kB
        4. lnetctl_export_opa.txt
          10 kB
        5. lnetctl_export_qlogic.txt
          4 kB
        6. dk.log.xz
          6.23 MB

        Activity

          People

            ashehata Amir Shehata (Inactive)
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: