[LU-4222] Oops in mdt_dump_lmm+0x16/0x410 [mdt] Created: 07/Nov/13 Updated: 20/Jan/14 Resolved: 07/Jan/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Minh Diep | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4, sdsc | ||
| Severity: | 3 |
| Rank (Obsolete): | 11490 |
| Description |
|
Nov 6 16:38:07 lustre-mds-0-0 kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c We were trying to setup active/active MDS/MDT from a cluster with two MDS and two MDT. while trying to mount the 1.8.9 clients, we hit this panic on the MDS. Our goal was from a 1.8.x server -> upgrade to 2.4.1 -> backup and restore the single MDT to a new system with 1 MDT -> add another MDT on different MDS as remote mdt. The last step is to use tunefs.lustre to configure active/active HA on MDS(s) |
| Comments |
| Comment by Peter Jones [ 07/Nov/13 ] |
|
Hi Bobijam Could you please look into this issue? Peter |
| Comment by Minh Diep [ 22/Nov/13 ] |
|
we hit this again Nov 21 11:51:30 lustre-mds-0-0 kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c |
| Comment by Di Wang [ 23/Nov/13 ] |
|
It seems ma_lmm is NULL, even MA_LOV is setup and ma_lmm_size > 0. I post a debug patch http://review.whamcloud.com/8377 , Minh will try later, hope we can get more information. |
| Comment by Ned Bass [ 11/Dec/13 ] |
|
We started hitting this same oops on a production filesystem. MDS is stuck in a crash/reboot cycle. It was upgraded from 2.1 last week. Has there been any progress on this bug? |
| Comment by Ned Bass [ 11/Dec/13 ] |
|
Our tree is here: https://github.com/chaos/lustre/commits/2.4.0-19chaos The backtrace looks like this: mdt_getattr_internal mdt_getattr_name_lock mdt_intent_getattr mdt_indent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 mdt_enqueue mdt_handle_common mds_regular_handle ptlrpc_server_handle_request pltrpc_main I can't provide crash dumps, but I can carry out debugging procedures and post results. |
| Comment by Ned Bass [ 11/Dec/13 ] |
|
Interestingly, I just learned that two different filesystems started hitting this within minutes of each other. Perhaps a common client is sending problematic requests to both MDSs. On the other system, many of the NULL pointer derefs are preceded by a message: LustreError: 4367:0:(mdt_lvb.c:175:mdt_lvbo_fill()) ls6-MDT0000: expected 80 actual 0. |
| Comment by Ned Bass [ 11/Dec/13 ] |
|
I started the MDS with the debug patch and got this: ASSERTION( ma->ma_lmm_size != 0 && ma->ma_lmm != NULL ) failed: Invalid EA 80:(null) [0x1a89081cdcb:0x13b:0x0] |
| Comment by Minh Diep [ 11/Dec/13 ] |
|
Hi Ned, could you do after it crash, it should have a debug log in /tmp. please attach it Thanks |
| Comment by Ned Bass [ 11/Dec/13 ] |
|
Hi Minh, sorry this system is classified so I can't provide debug logs. As a test, I modified mdt_getattr_internal() to log an error and return EFAULT if ma->ma_lmm == NULL. In that case, it logged the error for the same FID twice. I was able to use fid2path to get the file name, and I was able to read the file and get its striping EA info. But I don't know safe it is to run with that patch. |
| Comment by Di Wang [ 11/Dec/13 ] |
|
Are there any 1.8 clients attached to this server? The reason of this LBUG, which happened in SDSC, is because some 1.8 clients send RPC with invalid format to 2.4 server, which 2.4 does not handle it well enough. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
I don't think there's any 1.8 clients, but do you know of a way to determine that from the server side? |
| Comment by Di Wang [ 12/Dec/13 ] |
|
Hmm, checking debug log to find out which client the RPC is from? I probably can provide a patch to stop the LBUG, but we probably need find out why client send invalid RPC here. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
Di, OK I'm trying to get a debug log. |
| Comment by Di Wang [ 12/Dec/13 ] |
|
Here is a patch to check the RPC on MDT. http://review.whamcloud.com/8550 With this patch, it might not LBUG again, but we need figure out the reason why the client sends this RPC. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
Hmm, it still crashes in the same place with patch 8550. My only guess is that it changes the ma->ma_lmm pointer in mdt_attr_get_lov(). |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
I moved the check from patch 8550 to just above the call to mdt_dump_lmm(). Now I get the debug message with the client uuid. On both MDS the bad request came from the same x86_64 client cluster which is running lustre-2.4.0-19chaos. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
I can reproduce the bad request by opening any file from some, but not all, nodes on that client cluster (graph). The open on the client fails with EFAULT and logs LustreError: ... ll_inode_revalidate_fini() ls6: revalidate FID [0x...] error: rc = -14 Dropping caches or remounting the filesystem doesn't fix it. Rebooting the client node does fix it. That client cluster was updated today, and used to run Lustre 2.1. We updated another client cluster (zin) to the same software levels today, but I don't see the problem from zin. So it seems there is something funny going on with graph. |
| Comment by Di Wang [ 12/Dec/13 ] |
|
There is code snippet at the end of mdt_attr_get_complex int mdt_attr_get_complex(struct mdt_thread_info *info,
struct mdt_object *o, struct md_attr *ma)
.....
ma->ma_need = need;
CDEBUG(D_INODE, "after getattr rc = %d, ma_valid = "LPX64" ma_lmm=%p\n",
rc, ma->ma_valid, ma->ma_lmm);
RETURN(rc);
Could you please tell me what did you get when this -14 happens? |
| Comment by Di Wang [ 12/Dec/13 ] |
How did you check there? You can not check buffer->lb_buf anymore, if you move this check above mdt_dump_lmm, because buffer->lb_buf are from thread_info, which will be reused in the following function. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
I checked ma->ma_lmm. I'll post the other debug information you requested tomorrow morning PST when I get back to the office. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
Di, here is the CDEBUG message. mdt_attr_get_complex() after getattr rc = 0, ma_valid = 0x3 ma_lmm=(null) |
| Comment by Di Wang [ 12/Dec/13 ] |
|
Ned, please try this patch http://review.whamcloud.com/#/c/8550/, which will not retrieve lovEA, if client does not require, and also some other debug information, in case this does not resolve the issue. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
Di, with that patch I get: mdt_getattr_internal() ls4-MDT0000: RPC from <UUID>: does not need LOVEA mdt_attr_get_lov() [<FID>] retrieve lovEA with (null):0 mdt_attr_get_lov() [<FID>] got lovEA with (null):80 This would have crashed, but I kept the check before mdt_dump_lmm() to log an error and return EFAULT if ma->ma_lmm is NULL. |
| Comment by Di Wang [ 12/Dec/13 ] |
|
hmm, so the RPC does not need LOVEA, but somehow the server still trying to get it. hmm |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
My only guess is that mdt_getattr_name_lock() sets MA_LOV in ma_need: 1422 if (try_layout) { 1423 child_bits |= MDS_INODELOCK_LAYOUT; 1424 /* try layout lock, it may fail to be granted due to 1425 * contention at LOOKUP or UPDATE */ 1426 if (!mdt_object_lock_try(info, child, lhc, child_bits, 1427 MDT_CROSS_LOCK)) { 1428 child_bits &= ~MDS_INODELOCK_LAYOUT; 1429 LASSERT(child_bits != 0); 1430 rc = mdt_object_lock(info, child, lhc, 1431 child_bits, MDT_CROSS_LOCK); 1432 } else { 1433 ma_need |= MA_LOV; ... 1452 rc = mdt_getattr_internal(info, child, ma_need); |
| Comment by Di Wang [ 12/Dec/13 ] |
|
Hmm, it seems to me cl_max_md_size is not being set correctly in time in some cases, which client will reply to pack getattr RPC. I update the patch, but that requires to remount the client. As you said remount can fix the problem? So after remount you can not reproduce the problem anymore? Anyway can you try this patch? thanks. |
| Comment by Ned Bass [ 12/Dec/13 ] |
|
After rebooting more clients, many did come up still having this problem. So we should be able to verify pretty easily if the patch works. |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
Di, initial results are that the client-side patch does not fix the problem. |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Hmm, I think there are two problems here |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
Di, this client normally mounts 6 filesystems. The problem seems more likely to happen if 5 or more of those filesystems are mounted. I have yet to see it if I only mount 4 or fewer of the filesystems. I'm using about 100 client nodes to test so the evidence is pretty compelling. |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
My test looks something like this, where <hostlist> contains about 100 nodes (pdsh is a distributed remote command invoker): pdsh -w <hostlist> 'umount -a -t lustre ; for x in 1 2 3 4 5 6 ; do mount /p/lscratch$x ; done ; cat /p/lscratch5/bass6/x 2>&1' If I mount all six filesystems, I always get some subset of nodes for which the cat command failed with 'Bad address'. If I mount only five, I usually but not always get failures. If I mount four or fewer, I never get failures. |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Ned, Could you try this patch diff --git a/lustre/mdc/mdc_locks.c b/lustre/mdc/mdc_locks.c
index 28c07b4..4541f1a 100644
--- a/lustre/mdc/mdc_locks.c
+++ b/lustre/mdc/mdc_locks.c
@@ -434,6 +434,10 @@ static struct ptlrpc_request *mdc_intent_getattr_pack(struct obd_export *exp,
lit = req_capsule_client_get(&req->rq_pill, &RMF_LDLM_INTENT);
lit->opc = (__u64)it->it_op;
+ if (obddev->u.cli.cl_max_mds_easize == 0) {
+ CERROR("%s: cl_max_mds_easize is zero!\n", obddev->obd_name);
+ RETURN(ERR_PTR(-EINVAL));
+ }
/* pack the intended request */
mdc_getattr_pack(req, valid, it->it_flags, op_data,
obddev->u.cli.cl_max_mds_easize);
to see whether you can see the error message from client side? |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
I'll try it. Regarding my earlier comment, I do sometimes see failure with 3 4 or mounts, but usually only one client out of 100. So it seems to become more likely with more mounts. Also the order of mounts seems to matter. I don't get errors if I mount all six in reverse order :/ I can't figure out what the pattern is yet. |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
No, the error does not show up on the client when I reproduce the bug. |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Hmm, this probably mean cl_max_md_size initialization is being delayed when there are multiple client being mounted at the same time, since this cl_max_md_size is being updated when OSC is connected. so we need initialize the cl_max_md_size synchronously during the mount process. |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Hmm, could you please try this plus 8550 diff --git a/lustre/mdc/mdc_request.c b/lustre/mdc/mdc_request.c
index cc592a7..f9ef390 100644
--- a/lustre/mdc/mdc_request.c
+++ b/lustre/mdc/mdc_request.c
@@ -189,14 +189,13 @@ static int mdc_getattr_common(struct obd_export *exp,
CDEBUG(D_NET, "mode: %o\n", body->mode);
- if (body->eadatasize != 0) {
- mdc_update_max_ea_from_body(exp, body);
-
- eadata = req_capsule_server_sized_get(pill, &RMF_MDT_MD,
- body->eadatasize);
- if (eadata == NULL)
- RETURN(-EPROTO);
- }
+ mdc_update_max_ea_from_body(exp, body);
+ if (body->eadatasize != 0) {
+ eadata = req_capsule_server_sized_get(pill, &RMF_MDT_MD,
+ body->eadatasize);
+ if (eadata == NULL)
+ RETURN(-EPROTO);
+ }
if (body->valid & OBD_MD_FLRMTPERM) {
|
| Comment by Di Wang [ 13/Dec/13 ] |
|
Hmm, I compared lustre b2_4 code with lustre chaos 2.4, in mdc_intent_getattr_pack chaos lustre 2.4 /* pack the intended request */
mdc_getattr_pack(req, valid, it->it_flags, op_data,
obddev->u.cli.cl_default_mds_easize);
my lustre b2_4 branch
/* pack the intended request */
mdc_getattr_pack(req, valid, it->it_flags, op_data,
obddev->u.cli.cl_max_mds_easize);
Ned, Could you please confirm that? Some one change this code in chaos? |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Oh, it seems from |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
Yes, we are running with the |
| Comment by Ned Bass [ 13/Dec/13 ] |
|
Initial results with your latest patch to mdc_getattr_common() are good. The problem is not happening so far with that patch. |
| Comment by Di Wang [ 13/Dec/13 ] |
|
Hmm, the patch try to get the max_md_size from MDS initially, which is MAX_MD_SIZE; (4 stripes), so initially, it will set the default_md_size to be 4 stripes, intead of one, cli->cl_default_mds_easize =
min_t(__u32, body->max_mdsize, PAGE_CACHE_SIZE);
I am not so sure, this is the right way to fix this. |
| Comment by Di Wang [ 16/Dec/13 ] |
| Comment by Di Wang [ 17/Dec/13 ] |
|
http://review.whamcloud.com/8599 for master branch |
| Comment by Peter Jones [ 07/Jan/14 ] |
|
Fix landed for 2.6 |
| Comment by James Beal [ 08/Jan/14 ] |
|
Is it too late for this to hit 2.5.1 ? |
| Comment by Peter Jones [ 08/Jan/14 ] |
|
No it is not too late so this is still possible. |
| Comment by James Beal [ 08/Jan/14 ] |
|
Thank you that would be appreciated... |