[LU-16907] sanity test_123f: crashed MDS with Max IOV exceeded: 257 should be < 256 Created: 16/Jun/23  Updated: 16/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15550 WBC: retry the batched RPC when the r... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/85009820-d429-41a1-8948-90b2f66d7f02

test_123f failed with the following error:

onyx-37vm9 crashed during sanity test_123f

LNetError: 191018:0:(socklnd_cb.c:1036:ksocknal_send()) ASSERTION( tx->tx_nkiov <= 256 ) failed: 
LNetError: 191018:0:(socklnd_cb.c:1036:ksocknal_send()) LBUG
Pid: 191018, comm: mdt_out00_001 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Wed May 3 16:22:26 UTC 2023
Call Trace TBD:
 libcfs_call_trace+0x6f/0xa0 [libcfs]
 lbug_with_loc+0x3f/0x70 [libcfs]
 ksocknal_send+0x27a/0x320 [ksocklnd]
 lnet_ni_send+0x4c/0xe0 [lnet]
 lnet_send+0xae/0x1e0 [lnet]
 LNetPut+0x318/0x940 [lnet]
 ptl_send_buf+0x208/0x5a0 [ptlrpc]
 ptlrpc_send_reply+0x2ad/0x8d0 [ptlrpc]
 target_send_reply+0x328/0x7d0 [ptlrpc]
 tgt_request_handle+0xe85/0x1920 [ptlrpc]
 ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
 ptlrpc_main+0xc52/0x1510

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/94706 - 5.4.0-131-generic
servers: https://build.whamcloud.com/job/lustre-reviews/94706 - 4.18.0-425.10.1.el8_lustre.x86_64

Have seen this about 10 times since 2023-05-09, after patch https://review.whamcloud.com/46540 "LU-15550 ptlrpc: retry mechanism for overflowed batched RPCs" landed, but I'm not sure if it is directly related.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_123f - onyx-37vm9 crashed during sanity test_123f



 Comments   
Comment by Andreas Dilger [ 16/Jun/23 ]

One of the failures was marked with LU-15550, but I'm not sure if that is the cause.

Generated at Sat Feb 10 03:31:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.