[LU-3306] layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (6 of 8) in format `LDLM_INTENT_OPEN': 3 vs. 0 (client) Created: 09/May/13 Updated: 16/Oct/13 Resolved: 16/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 8192 |
| Description |
|
Fresh master checkout, hit this in racer: [61845.122433] LustreError: 7885:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc90006c54ff0 buffer[6] len 3: strlen 1 [61845.131556] LustreError: 7885:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (6 of 8) in format `LDLM_INTENT_OPEN': 3 vs. 0 (client) [61845.131557] req@ffff88003c01e7f0 x1434586131710596/t0(0) o101->13783025-666d-77a1-7c55-b4fef65827f1@0@lo:0/0 lens 576/0 e 0 to 0 dl 1368130054 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [61845.133834] LustreError: 7885:0:(mdt_handler.c:1727:mdt_reint_internal()) Can't unpack reint, rc -14 [61845.134310] LustreError: 7885:0:(lustre_mdt.h:66:err_serious()) ASSERTION( -rc < ESERIOUS ) failed: [61845.134747] LustreError: 7885:0:(lustre_mdt.h:66:err_serious()) LBUG [61845.134992] Pid: 7885, comm: mdt00_008 [61845.135199] [61845.135199] Call Trace: [61845.135558] [<ffffffffa0af28a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [61845.135817] [<ffffffffa0af2ea7>] lbug_with_loc+0x47/0xb0 [libcfs] [61845.136103] [<ffffffffa05c32eb>] mdt_enqueue+0x10b/0x110 [mdt] [61845.140089] [<ffffffffa05b4d28>] mdt_handle_common+0x648/0x1660 [mdt] [61845.140375] [<ffffffffa05f0f25>] mds_regular_handle+0x15/0x20 [mdt] [61845.140681] [<ffffffffa12a66c8>] ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc] [61845.141158] [<ffffffffa0af35ee>] ? cfs_timer_arm+0xe/0x10 [libcfs] [61845.141448] [<ffffffffa0b04e9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [61845.142006] [<ffffffffa129de11>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [61845.142386] [<ffffffff81054613>] ? __wake_up+0x53/0x70 [61845.143683] [<ffffffffa12a79d5>] ptlrpc_main+0xa45/0x1650 [ptlrpc] [61845.143981] [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [61845.144265] [<ffffffff8100c10a>] child_rip+0xa/0x20 [61845.144542] [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [61845.144854] [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [61845.145134] [<ffffffff8100c100>] ? child_rip+0x0/0x20 [61845.145399] [61845.220106] Kernel panic - not syncing: LBUG Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-05-09-16\:06\:54/ |
| Comments |
| Comment by Oleg Drokin [ 12/May/13 ] |
|
Had another identical crash happen, so I guess this is more than just some sort of random memory corruption. |
| Comment by Oleg Drokin [ 13/May/13 ] |
|
and another one, /exports/crashdumps/192.168.10.220-2013-05-13-06\:05\:21/ |
| Comment by Oleg Drokin [ 18/May/13 ] |
|
Well, I am hitting this pretty regularly again and again, so I did some digging and here's what I uncovered: 00000080:00200000:3.0:1368694813.256766:0:21370:0:(dcache.c:401:ll_revalidate_it ()) VFS Op:name=13,intent=open ... 00000002:00010000:3.0:1368694813.266750:0:21370:0:(mdc_locks.c:1083:mdc_intent_lock()) (name: 13,[0x200000401:0x330:0x0]) in obj [0x200000400:0x1:0x0], intent: open flags 02304001 I found op_data for this request and it's: crash> p *(struct md_op_data *)0xffff88000b5d7df0
$3 = {
op_fid1 = {
f_seq = 8589935616,
f_oid = 1,
f_ver = 0
},
op_fid2 = {
f_seq = 8589935617,
f_oid = 816,
f_ver = 0
},
...
op_name = 0xffff8800b5076fd0 "2", <====
op_namelen = 2, <==== !!
...
So op_data has different name than what was reported in the log, but the length is the same. 00000080:00200000:1.0:1368694813.261175:0:21387:0:(namei.c:1306:ll_rename_generic()) VFS Op:oldname=13,src_dir=144115205255725057/33554436(ffff880048d9cb08),newname=2,tgt_dir=144115205255725057/33554436(ffff880048d9cb08) This is all thanks to us not storing the name in op_data, but merely a pointer to some other location (dentry in this case, but e.g. in 3270 there is a bug of similar nature where op_data points to sai entry that gets freed unexpectedly). So after sending this malformed request, on the server side our string check fails (can't unpack short string) because we expect string length 2, but got only 1, this in turn returns error to mdt_reint_internal: rc = mdt_reint_unpack(info, op);
if (rc != 0) {
CERROR("Can't unpack reint, rc %d\n", rc);
RETURN(err_serious(rc));
}
That sets err_serious on the rc. Now we return all the way through the stack to mdt_enqueue, and there: return rc ? err_serious(rc) : req->rq_status; So we call err_serious(rc) AGAIN which triggers the assertion. I think we should just remove the assertion and allow double err_serious setting, as otherwise we will need to go through all the callchains to ensure it is never called twice, yet everywhere where it's needed it should be called once for sure, which might be quite cumbersome. |
| Comment by Oleg Drokin [ 18/May/13 ] |
|
http://review.whamcloud.com/6383 is err_serious change that I think is the most important. Applying it prevents the server from crashing on data from network. |
| Comment by Jodi Levi (Inactive) [ 31/May/13 ] |
|
Are more patches needed than Change, 6383 that has landed to master? Or can this ticket be closed? |
| Comment by Jodi Levi (Inactive) [ 16/Oct/13 ] |
|
Closing ticket as patch landed to Master. Please let me know if more work is needed in this ticket and I will reopen. |