Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11972

lustre IB client always hang when memory size small than 40GB

    XMLWordPrintable

Details

    • Question/Request
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • None
    • None
    • [VM]
      CentOS Linux release 7.5.1804 (Core)
      3.10.0-862.14.4.el7.x86_64
      CPU: Intel Xeon Processor (Skylake) 6 cores
      MemTotal : 40GB
      mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0
      Lustre : 2.10.5 / 2.10.6
    • 9223372036854775807

    Description

      I have created one VM with IB and mount lustre client ok.

      We tested lustre client io access in VM. (dd if=/dev/zero of=/mnt/lustre/testfile bs=1M )
      But lustre client IB always hang when VM memory size only 40G.

      This issue can not be reproduced when VM memory size is 80GB

      We have test with mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0 or Lustre : 2.10.5 / 2.10.6 ,
      they all have same problem when VM memory size only 40G

      The VM syslog print these messages (see attached file) ib_err.txt
      "mlx5_0:dump_cqe:285:(pid 1854): dump error cqe

      .....

      LustreError: 1854:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8bafb2976400
      Jan 21 17:15:14 slurm-vm-1 kernel: Lustre: 1860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error:

       

      We have got respond from Mellanox

      >>Our RnD review the syslog and checked code, they give conclusion below, FYI. I think >>this issue is related with uplevel Lustre design, you can open defect for their community to fix. 
      >>Parsing CQE shows that it is  local protection error – ERR_EXE_BIND_GAHER_TPT.
      >>If an application gets a local protection error, in most cases it means that it is using wrong/insufficient mkey in the WR.
      >>In this case the application is Lustre - which is open source and maintained by the community, not Mellanox.

       

      Attachments

        1. dbg1.tgz
          1.04 MB
        2. ib_err.txt
          17 kB

        Activity

          People

            pjones Peter Jones
            sebg-crd-pm sebg-crd-pm (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: