Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15116

crash when writing files in parallel on LTS lustre version

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.7
    • None
    • Centos 7.9 Server with Rocky LInux 8.4 Clients. Infiniband network using HP branded Connect-X 3 Pro Mellanox cards.
    • 3
    • 9223372036854775807

    Description

      We are attempting to use Lustre as a parallel file system for StarCCM+ which can write large files in parallel.  We have a combined MGS/MDS server and a single OSS.  All file systems are zfs backed. 

      Based on the compatibility matrix for the LTS release https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix  I'm using Lustre 2.12.7 with Centos 7.9 on the servers and Rocky LInux 8.4 on the clients.  So far I have only installed lustre from the whamcloud repositories.  I have tried both the in kernel driver version and the MOFED version for Centos 7.9.

      Whenever we attempt to write a file in parallel (from Star) we get the following errors:

      [60662.673922] LustreError: 3073:0:(pack_generic.c:605:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [60662.674340] LustreError: 3073:0:(pack_generic.c:605:__lustre_unpack_msg()) Skipped 348 previous similar messages
      [60662.674654] LustreError: 3073:0:(sec.c:2217:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-172.24.33.1@o2ib1 x1713655326257920
      [60662.675251] LustreError: 3073:0:(sec.c:2217:sptlrpc_svc_unwrap_request()) Skipped 348 previous similar messages
      [60721.038766] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
      [60779.408713] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
      [60837.774776] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
      [60896.137635] LustreError: 3027:0:(events.c:310:request_in_callback()) event type 2, status -5, service ost_io
      
      

      Starccm+ also supports a serial (non parallel) write mode which works some of the time but fails intermittently with the same error.

       

      All of the software related to Lustre was installed from the whamcloud repos (https://downloads.whamcloud.com/public/lustre/) including zfs.  One thing to note is I had a lot of trouble getting the lustre zfs osd kernel module to install.  When trying to install the kmod-lustre-osd-zfs package I get a long list of unresolved symbols and yum does not allow me to install the package.  I also tried building it with dkms using the lustre-zfs-dkms package but it seems to be broken and does not build properly against the zfs-dkms and spl-dkms packages.  I spent a fair bit of time going through the configure files it generated and it appeared to be looking in the wrong directories for the zfs and spl source.  In the end I forced the kmod-lustre-osd-zfs package to install using

      rpm -Uvh --nodeps $(repoquery --location kmod-lustre-osd-zfs)
      

      I do not receive any errors when I load the kernel modules including the zfs modules and there are no errors in dmesg that obviously connect to any missing symbols.  Therefore, I suspect the yum issues are actually not correct, but I mention it here in case I just missed something.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              godbolt Bryan Godbolt
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: