[LU-5533] Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_INTENT_GETATTR' Created: 22/Aug/14 Updated: 11/May/15 Resolved: 12/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Liang Zhen (Inactive) | Assignee: | Li Wei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15405 | ||||||||
| Description |
|
when we simulate message drop for portal 17 (LDLM_CANCEL_REQUEST_PORTAL) and portal 18 (LDLM_CANCEL_REPLY_PORTAL), I saw this failure on client and application failed. LustreError: 13507:0:(layout.c:2042:__req_capsule_get()) @@@ Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_INTENT_GETATTR': 0 vs. 112 (server) req@ffff880ecfc8c000 x1476486707242176/t0(0) o101->soaked-MDT0000-mdc-ffff881029559c00@192.168.1.108@o2ib:12/10 lens 576/192 e 0 to 0 dl 1408692442 ref 1 fl Complete:R/2/0 rc 0/0 LustreError: 13507:0:(file.c:3238:ll_inode_revalidate_fini()) soaked: revalidate FID [0x3800004c4:0x79:0x0] error: rc = -71 LustreError: 11-0: soaked-MDT0000-mdc-ffff881029559c00: Communicating with 192.168.1.108@o2ib, operation mds_reint failed with -107. Lustre: soaked-MDT0000-mdc-ffff881029559c00: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete LustreError: 167-0: soaked-MDT0000-mdc-ffff881029559c00: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. LustreError: 13506:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -5 LustreError: 13508:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -108 LustreError: 13509:0:(llite_lib.c:1522:ll_md_setattr()) md_setattr fails: rc = -108 Application is quite simple, just a MPI program which repeats to fstat and fchmod on two nodes. |
| Comments |
| Comment by Li Wei (Inactive) [ 28/Aug/14 ] |
| Comment by Christopher Morrone [ 01/Oct/14 ] |
|
We have seen this error in production using 2.4.2-14chaos (see github.com/chaos/lustre). We will need a fix for b2_5 (we are going to transition to b2_5 in a few weeks). |
| Comment by Christopher Morrone [ 02/Oct/14 ] |
|
We have additionally seen the error message with the LDLM_ENQUEUE_LVB name instead of LDLM_INTENT_GETATTR. Am I correct in assuming that is the same problem? |
| Comment by Li Wei (Inactive) [ 03/Oct/14 ] |
|
Chris, if the request corresponding to the LDLM_ENQUEUE_LVB name had a 192-byte reply length (please look for "lens ???/192"), I would assume that was due to the same problem (i.e., a ping being mistook as a normal reply). I'll get the patch going as soon as my vacation is over. |
| Comment by Li Wei (Inactive) [ 03/Oct/14 ] |
|
Oops, s/a ping/an early reply/. |
| Comment by Li Wei (Inactive) [ 08/Oct/14 ] |
|
The patch has been updated with problem description. |
| Comment by Christopher Morrone [ 23/Oct/14 ] |
|
Yes, for LDLM_ENQUEUE_LVB it is 328/192. Here is a console snippet: 2014-09-23 18:56:43 Lustre: 5696:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 51 > total measured time 10 2014-09-23 18:56:43 LustreError: 5696:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `dlm_rep' (1 of 1) in format `LDLM_ENQUEUE_LVB': 0 vs. 112 (server) 2014-09-23 18:56:43 req@ffff880c1422a400 x1476341002572612/t0(0) o101->lse-OST0047-osc-ffff880c2690a800@172.19.1.240@o2ib100:28/4 lens 328/192 e 0 to 0 dl 1411523899 ref 1 fl Interpret:R/2/0 rc 0/0 2014-09-23 18:56:43 Lustre: 5698:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 51 > total measured time 35 |
| Comment by Li Wei (Inactive) [ 12/Nov/14 ] |
|
Let's use Alexander's patch under |