[LU-5533] Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_INTENT_GETATTR' Created: 22/Aug/14  Updated: 11/May/15  Resolved: 12/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Liang Zhen (Inactive) Assignee: Li Wei (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-5528 Race - connect vs resend Resolved
Severity: 3
Rank (Obsolete): 15405

 Description   

when we simulate message drop for portal 17 (LDLM_CANCEL_REQUEST_PORTAL) and portal 18 (LDLM_CANCEL_REPLY_PORTAL), I saw this failure on client and application failed.

LustreError: 13507:0:(layout.c:2042:__req_capsule_get()) @@@ Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_INTENT_GETATTR': 0 vs. 112 (server)
  req@ffff880ecfc8c000 x1476486707242176/t0(0) o101->soaked-MDT0000-mdc-ffff881029559c00@192.168.1.108@o2ib:12/10 lens 576/192 e 0 to 0 dl 1408692442 ref 1 fl Complete:R/2/0 rc 0/0
LustreError: 13507:0:(file.c:3238:ll_inode_revalidate_fini()) soaked: revalidate FID [0x3800004c4:0x79:0x0] error: rc = -71
LustreError: 11-0: soaked-MDT0000-mdc-ffff881029559c00: Communicating with 192.168.1.108@o2ib, operation mds_reint failed with -107.
Lustre: soaked-MDT0000-mdc-ffff881029559c00: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: soaked-MDT0000-mdc-ffff881029559c00: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
LustreError: 13506:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -5
LustreError: 13508:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -108
LustreError: 13509:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -108

Application is quite simple, just a MPI program which repeats to fstat and fchmod on two nodes.



 Comments   
Comment by Li Wei (Inactive) [ 28/Aug/14 ]

http://review.whamcloud.com/11629

Comment by Christopher Morrone [ 01/Oct/14 ]

We have seen this error in production using 2.4.2-14chaos (see github.com/chaos/lustre). We will need a fix for b2_5 (we are going to transition to b2_5 in a few weeks).

Comment by Christopher Morrone [ 02/Oct/14 ]

We have additionally seen the error message with the LDLM_ENQUEUE_LVB name instead of LDLM_INTENT_GETATTR. Am I correct in assuming that is the same problem?

Comment by Li Wei (Inactive) [ 03/Oct/14 ]

Chris, if the request corresponding to the LDLM_ENQUEUE_LVB name had a 192-byte reply length (please look for "lens ???/192"), I would assume that was due to the same problem (i.e., a ping being mistook as a normal reply). I'll get the patch going as soon as my vacation is over.

Comment by Li Wei (Inactive) [ 03/Oct/14 ]

Oops, s/a ping/an early reply/.

Comment by Li Wei (Inactive) [ 08/Oct/14 ]

The patch has been updated with problem description.

Comment by Christopher Morrone [ 23/Oct/14 ]

Yes, for LDLM_ENQUEUE_LVB it is 328/192. Here is a console snippet:

2014-09-23 18:56:43 Lustre: 5696:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 51 > total measured time 10
2014-09-23 18:56:43 LustreError: 5696:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_ENQUEUE_LVB': 0 vs. 112 (server)
2014-09-23 18:56:43   req@ffff880c1422a400 x1476341002572612/t0(0) o101->lse-OST0047-osc-ffff880c2690a800@172.19.1.240@o2ib100:28/4 lens 328/192 e 0 to 0 dl 1411523899 ref 1 fl Interpret:R/2/0 rc 0/0
2014-09-23 18:56:43 Lustre: 5698:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 51 > total measured time 35
Comment by Li Wei (Inactive) [ 12/Nov/14 ]

Let's use Alexander's patch under LU-5528 instead.

Generated at Sat Feb 10 01:52:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.