[LU-15116] crash when writing files in parallel on LTS lustre version Created: 15/Oct/21  Updated: 18/Oct/21  Resolved: 18/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Bryan Godbolt Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Centos 7.9 Server with Rocky LInux 8.4 Clients. Infiniband network using HP branded Connect-X 3 Pro Mellanox cards.


Attachments: File kmod-lustre-osd-zfs unsatisfied dependencies.log    
Issue Links:
Related
is related to LU-14733 brw_bulk_ready() BRW bulk READ failed... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are attempting to use Lustre as a parallel file system for StarCCM+ which can write large files in parallel.  We have a combined MGS/MDS server and a single OSS.  All file systems are zfs backed. 

Based on the compatibility matrix for the LTS release https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix  I'm using Lustre 2.12.7 with Centos 7.9 on the servers and Rocky LInux 8.4 on the clients.  So far I have only installed lustre from the whamcloud repositories.  I have tried both the in kernel driver version and the MOFED version for Centos 7.9.

Whenever we attempt to write a file in parallel (from Star) we get the following errors:

[60662.673922] LustreError: 3073:0:(pack_generic.c:605:__lustre_unpack_msg()) message length 0 too small for magic/version check
[60662.674340] LustreError: 3073:0:(pack_generic.c:605:__lustre_unpack_msg()) Skipped 348 previous similar messages
[60662.674654] LustreError: 3073:0:(sec.c:2217:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-172.24.33.1@o2ib1 x1713655326257920
[60662.675251] LustreError: 3073:0:(sec.c:2217:sptlrpc_svc_unwrap_request()) Skipped 348 previous similar messages
[60721.038766] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
[60779.408713] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
[60837.774776] LustreError: 3023:0:(events.c:310:request_in_callback()) event type 2, status -103, service ost_io
[60896.137635] LustreError: 3027:0:(events.c:310:request_in_callback()) event type 2, status -5, service ost_io

Starccm+ also supports a serial (non parallel) write mode which works some of the time but fails intermittently with the same error.

 

All of the software related to Lustre was installed from the whamcloud repos (https://downloads.whamcloud.com/public/lustre/) including zfs.  One thing to note is I had a lot of trouble getting the lustre zfs osd kernel module to install.  When trying to install the kmod-lustre-osd-zfs package I get a long list of unresolved symbols and yum does not allow me to install the package.  I also tried building it with dkms using the lustre-zfs-dkms package but it seems to be broken and does not build properly against the zfs-dkms and spl-dkms packages.  I spent a fair bit of time going through the configure files it generated and it appeared to be looking in the wrong directories for the zfs and spl source.  In the end I forced the kmod-lustre-osd-zfs package to install using

rpm -Uvh --nodeps $(repoquery --location kmod-lustre-osd-zfs)

I do not receive any errors when I load the kernel modules including the zfs modules and there are no errors in dmesg that obviously connect to any missing symbols.  Therefore, I suspect the yum issues are actually not correct, but I mention it here in case I just missed something.



 Comments   
Comment by Andreas Dilger [ 18/Oct/21 ]

The error reported looks similar to LU-14733 on RHEL8.4.

The patches from LU-14773 are already landed on b2_12 for the upcoming 2.12.8 release, but you could test out the b2_12 branch, for which the most recent build is at https://build.whamcloud.com/job/lustre-b2_12

Comment by Bryan Godbolt [ 18/Oct/21 ]

Hi Andreas,

Thanks very much for the prompt reply.  We have installed the version you mentioned and so far it has resolved the issue!

Generated at Sat Feb 10 03:15:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.