[LU-5294] mdd_unlink() returning -7 Created: 03/Jul/14 Updated: 14/Jul/14 Resolved: 14/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Blake Caldwell | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
RHEL 6.4, kernel 2.6.32_358.23.2.el6, |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14769 |
| Description |
|
Any client on this filesystem will get back -7 from an unlink or rm. Creates, reads, writes work fine. We tried multiple users, and found no difference. Upcall is successful:
+trace and +rpctrace debugs were captured on the MDT while performing the unlink on a client 10.36.226.85@o2ib I can upload a full debug log. This is easy to recreate and capture, so just let me know which debug flags would useful. 00000040:00000001:19.0:1404405371.756302:0:13436:0:(llog_osd.c:317:llog_osd_declare_write_rec()) Process leaving (rc=0 : 0 : 0) ... ... |
| Comments |
| Comment by James Nunez (Inactive) [ 03/Jul/14 ] |
|
Mike, Would you please take a look at this issue and comment? Thank you, |
| Comment by Oleg Drokin [ 03/Jul/14 ] |
|
Is this vanilla 2.4.3 (servers)? |
| Comment by Blake Caldwell [ 03/Jul/14 ] |
|
This is the same 2.4.3 as the rest of our production servers with several patches listed below. It is for any file on the filesystem. Wide striping is not in use (only 1 OST) on this filesystem. Any client (including 1.8) can reproduce this error.
|
| Comment by Oleg Drokin [ 03/Jul/14 ] |
|
Hm, kind of weird in that apparently whatever was the function that returned -7, it's not in your trace. I assume you do not get any messages in dmesg on mds as well? Did this appear out of a sudden after you applied some patchesm or did this combination of patches worked until it stopped working? |
| Comment by Blake Caldwell [ 03/Jul/14 ] |
|
Nope, no dmesg output on mds. I'm attaching a complete log. This didn't correlate to any patch changes (we've been on this one for a couple months maybe now). We had strange things going on with DNS (outage this week) and syslog (was blocking on misconfigured server), but they have been resolved. I was initially looking for a correlation to an upcall result, but that part seems fine. |
| Comment by Oleg Drokin [ 03/Jul/14 ] |
|
Well, I assume you already trid to reboot the MDS and this did not clear the condition, so perhaps we should add a small patch as a firs tstep that would print the ops address that returned the failure and then we'll trace where did it came from and perhaps add more debug in there if it's still unclear of what's going on? |
| Comment by Blake Caldwell [ 03/Jul/14 ] |
|
We haven't rebooted the MDS yet, but since that sounds like the next step, we will plan on doing that. Probably Tuesday if we can hobble along in this mode for a bit. |
| Comment by Oleg Drokin [ 03/Jul/14 ] |
|
Is there anything else you will be able to do on the system before then? Also can you grab an mds log at -1 debug level while reproducing the error on the off chace it might catch more messages at some other level and shine some more light at what's going on? |
| Comment by John Fuchs-Chesney (Inactive) [ 03/Jul/14 ] |
|
Thank you for picking this up Oleg. |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
It may be that this is caused by the file having a too-large ACL xattr? I recall a similar problem in the past, maybe the patch is not landed on b2_4? Alternately, does this filesystem have wide striping enabled (for more than 160 OSTs)? |
| Comment by Matt Ezell [ 04/Jul/14 ] |
We don't use and typically disable ACLs on our file systems. This seems to be happening for all files.
This is our "link farm" file system. It houses symlinks for all users that point to one of the two large file systems. It's just 1 OST, so no wide striping. |
| Comment by Andreas Dilger [ 04/Jul/14 ] |
|
Can you please post the output of dumpe2fs -h /dev/mdtdev on the MDS. In particular, I'd like to check the features that are enabled on the filesystem. Also, is this problem specific to some files, or is this happening for all files? If it is specific to certain files, could you please run debugfs -c -R "stat /ROOT/path/to/file" /dev/mdtdev, where /path/to/file is just the part below the mountpoint (i.e. excluding the "/mnt/linkfarm" part, or whatever). |
| Comment by Blake Caldwell [ 07/Jul/14 ] |
|
Some updates after trying the above: The issue is still present after rebooting the mds. After setting lnet.printk to -1 it overwhelmed rsyslog, and necessitated a reboot. We need to fix rsyslog to drop messages. Attached is a debug log with all options turned on in /proc/sys/lnet/debug Also, as requested, I'm attaching the output from debugfs "stat $FILE" /dev/mdt and dumpe2fs -h /dev/mdtdev. The version of e2fsprogs is a little out of date, so I noticed that no "parent" fid was given (e2fsprogs-1.42.9.wc1-7.el6). Ideally it'd be best t wait until tomorrow if we need to update that version on the running nodes. |
| Comment by Oleg Drokin [ 09/Jul/14 ] |
|
hm, still nothing additionally useful in the full debug log. I inspected the code again. lod_object.c:464 looks like this in my 2.4.3 tree: rc = dt_declare_xattr_set(env, next, buf, name, fl, th);
RETURN(rc);
So the error must have come from that is declared as: static inline int dt_declare_xattr_set(const struct lu_env *env,
struct dt_object *dt,
const struct lu_buf *buf,
const char *name, int fl,
struct thandle *th)
{
LASSERT(dt);
LASSERT(dt->do_ops);
LASSERT(dt->do_ops->do_declare_xattr_set);
return dt->do_ops->do_declare_xattr_set(env, dt, buf, name, fl, th);
}
Now, I checked the code for all definitions of do_declare_xattr_set and there are only four: static int osd_declare_xattr_set(const struct lu_env *env,
struct dt_object *dt,
const struct lu_buf *buf, const char *name,
int fl, struct thandle *handle)
{
struct osd_thandle *oh;
LASSERT(handle != NULL);
oh = container_of0(handle, struct osd_thandle, ot_super);
LASSERT(oh->ot_handle == NULL);
osd_trans_declare_op(env, oh, OSD_OT_XATTR_SET,
strcmp(name, XATTR_NAME_VERSION) == 0 ?
osd_dto_credits_noquota[DTO_ATTR_SET_BASE] :
osd_dto_credits_noquota[DTO_XATTR_SET]);
return 0;
}
Master version of osd_declare_xattr_set is defined a bit differently, but still always returns 0. |
| Comment by Blake Caldwell [ 10/Jul/14 ] |
|
Thanks Oleg. From that I found that we have patch set 2 of The conditional that returns -E2BIG was marked "unlikely". Any idea what's causing the comparison to fail now? The relevant pieces of our LASSERT(handle != NULL); +#if defined(LDISKFS_FEATURE_INCOMPAT_EA_INODE) |
| Comment by Oleg Drokin [ 10/Jul/14 ] |
|
So, according to comments in In the end I think you just need to drop patchet 2 from your tree and add patchset 3. |
| Comment by James A Simmons [ 10/Jul/14 ] |
|
I merged the lastest version of the |
| Comment by James A Simmons [ 10/Jul/14 ] |
|
One more note. Lustre 2.5 already has all the correct needed fixes. |
| Comment by Blake Caldwell [ 14/Jul/14 ] |
|
Our new build with |
| Comment by James A Simmons [ 14/Jul/14 ] |
|
This ticket can be closed. |
| Comment by John Fuchs-Chesney (Inactive) [ 14/Jul/14 ] |
|
Thanks James! |