[LU-3140] fid sequence may reused for different objects. Created: 10/Apr/13 Updated: 22/Jul/18 Resolved: 22/Jul/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexey Lyashkov | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
single node setup. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7624 |
| Description |
|
While investing a Cray issues we create a debug patch to verify a correct mode assigned to the md object after revalidating, but result is very confusing. same fid assigned to the different files. 00000004:00000002:2.0:1365510202.512393:0:19372:0:(mdt_reint.c:281:mdt_md_create()) @@@ Create (bdir->[0x200000bd0:0x1:0x0]) in [0x200000400:0x7:0x0] req@ffff880036324928 x1431841008517781/t0(17179869191) o36->1e287f29-a399-fb10-7bd0-d9d83ebc1ce1@0@lo:0/0 lens 456/536 e 0 to 0 dl 1365510223 ref 1 fl Complete:/4/0 rc 0/0 00000004:00000002:6.0:1365510375.912867:0:23947:0:(mdt_reint.c:281:mdt_md_create()) @@@ Create (f9-2->[0x200000bd0:0x1:0x0]) in [0x61ab:0x54d71902:0x0] req@ffff88002a88c008 x1431841008518517/t0(0) o36->e35c9b0d-1221-a3ae-0357-637f30903536@0@lo:0/0 lens 456/536 e 0 to 0 dl 1365510395 ref 1 fl Interpret:/0/0 rc 0/0 as you see - bdir and f9-2 have a same fid. replicating a very easy bash-3.2$ git diff diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c index 7dad8f9..c40cfd2 100644 --- a/lustre/llite/llite_lib.c +++ b/lustre/llite/llite_lib.c @@ -1654,6 +1654,20 @@ void ll_update_inode(struct inode *inode, struct lustre_md *md) struct ll_sb_info *sbi = ll_i2sbi(inode); LASSERT ((lsm != NULL) == ((body->valid & OBD_MD_FLEASIZE) != 0)); + if (body->valid & OBD_MD_FLTYPE) { + if ((!S_ISDIR(inode->i_mode) && S_ISDIR(body->mode)) || + (S_ISDIR(inode->i_mode) && !S_ISDIR(body->mode))) { +// if ((inode->i_mode & S_IFMT) != (body->mode & S_IFMT)) { + CERROR("ino %p->%lu type is %s but server reply has %s " + "type (%o and %o). FID "DFID"\n", inode, inode->i_ino, + S_ISDIR(inode->i_mode) ? "directory" :"file", + S_ISDIR(body->mode) ? "directory" : "file", + inode->i_mode & S_IFMT, body->mode & S_IFMT, + PFID(&lli->lli_fid)); + LBUG(); + } + } + if (lsm != NULL) { LASSERT(S_ISREG(inode->i_mode)); 2) run PTLDEBUG=-1 SUBSYSTEM=-1 DEBUG_SIZE=400 llmount.sh to setup a single node lustre cluster 3) run a replay-dual with simple script. in 90% times you have hit a bug after first loop. |
| Comments |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
full debug log from hit assert. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
the same fid would hit -EEXIST at insertion into OI... at least I saw this in development few times. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
Alex, thanks for point - i will ask to look in logs about it, until I worked on patch to waiting a commit on client side, before we may use a sequence - like a synchronous rpc. |
| Comment by Alexander Boyko [ 10/Apr/13 ] |
|
Xyratex-bug-id: MRP-877 |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
that's true, with rdonly we can't ensure a sequence is unique. I do remember there was a discussion on this topic, something about rdonly vs. commit callbacks (which are vital to maintain sequences unique). I'm adding Mike to the conversation.. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
quick look - say that is it situation. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
iirc, in some tests we did something like file creation at the beginning and before R/O to ensure the sequence has been committed properly. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
and assume creation isn't blocked and fail always called after it. I prepared a prof of concept path, to illustrate my ideas - quick test say it's worked - but we need to fix much tests to new behavior. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
creation itself should not block, but sequence allocation MUST wait until it's committed. IOW, no sequence should be provided if it's not committed. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
that is same as i make, but most tests in replay dual have ... test_9() {
replay_barrier $SINGLEMDS
mcreate $MOUNT1/$tfile-1
mcreate $MOUNT2/$tfile-2
# drop first reint reply
do_facet $SINGLEMDS lctl set_param fail_loc=0x80000119
fail $SINGLEMDS
so fail $MDS will never called as mcreate blocked on allocation FID sequence. same for other tests in replay-dual - until we have free FID's - it's works, when need allocate - broke a test. some test was have same before. test_0a() {
touch $MOUNT2/$tfile-A # force sync FLD/SEQ update before barrier
replay_barrier $SINGLEMDS
#define OBD_FAIL_PTLRPC_FINISH_REPLAY | OBD_FAIL_ONCE
touch $MOUNT2/$tfile
createmany -o $MOUNT1/$tfile- 50
$LCTL set_param fail_loc=0x80000514
..
or
test_4() {
mkdir $MOUNT1/adir
replay_barrier $SINGLEMDS
mkdir $MOUNT1/adir && return 1
mkdir $MOUNT2/adir/bdir
fail $SINGLEMDS
checkstat $MOUNT2/adir || return 2
but we have no guaranteed fid will requested on first creation, because depend of previous state. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
prof of concept for fix. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
I don't understand the last comment. I'd think it should be enough to request and commit few sequences in mdt_iocontrol(), just before calling to ->dt_ro() |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
how you have send these sequences to the client? and we have no way to think that is enough for client or not. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
there is no need to send them to the client - the client will request them when needed and get as the sequences have been already allocated and committed. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
sequences on client side or in server side? |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
strictly speaking it doesn't matter, both the server and the client should be getting only committed sequences. |
| Comment by Mikhail Pershin [ 10/Apr/13 ] |
|
I agree with Alex that this can be just artifact of replay_barrier(). It tries to simulate real behavior of device but it doesn't that at 100%. Any sync operation will be a problem, because it will finish like non-sync. You can check if your situation is about that. In osd_ro() we can set new flag on osd_device, say od_readonly. Then osd_trans_stop() should check that there is no th_sync and od_readonly toghether and put error message with dump_stack(). Therefore we will know when we have such situation and shouldn't rely on test results. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
Mike, same situation you will have with WB enabled cache for a device, as ldiskfs disable a barrier and some other. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
btw. it's not an artifact of replay_barrier as ro/no transno flag used to official way to shutdown target with recovery. |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
sequence requests are never replayed as sequences are not supposed to be reused. when MDT shuts down it loses its sequences as well and will get new one upon reboot. |
| Comment by Alexey Lyashkov [ 10/Apr/13 ] |
|
Alex 1) sync operation only in case... if (seq->lss_set_transno > dev->ld_obd->obd_last_committed) 3) as seq allocation need replayed anyway - we may use async transaction so will be speedup for MD operations. |
| Comment by Mikhail Pershin [ 10/Apr/13 ] |
|
Alexey, could you give an example how r/o (barrier) is used for shutdown officially? What do you mean by that? the replay_barrier has known bugs and it doesn't work correctly with sync operations, it is real artifact no matter how official the way of using it. seq allocaions don't need replay if the only benefit is speedup. Note that we have pre-created window of many sequences, so in most cases allocation don't produce sync and doesn't affect MD operations performance, sync is very rare case, but during tests it may occur more often due to many server failover in recovery tests |
| Comment by Alex Zhuravlev [ 10/Apr/13 ] |
|
by the time rdonly is turned on we shouldn't be handling new requests, so no new sequences are needed while old in-use have been already committed. |
| Comment by Andreas Dilger [ 11/Apr/13 ] |
|
Alex, do we stille use rdonly for shutdown? I thought that was removed some time ago by Fan Yong. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
yes, I actually checked the code today and AFAICS we do not use rdonly. even with umount -f. |
| Comment by Mikhail Pershin [ 23/Apr/13 ] |
|
Not blocker for 2.4 because happens on old Lustre 2.1 version. The issue is related to sync sequence update during barrier, which can't handle sync. Current replay_barrier contains workaround to prevent that by creating extra file before barrier which bumps sequence update but that is not so in pre-2.3 Lustre and replay-dual may encounter such conditions. Strictly speaking this is not bug even though may be considered as request for some improvements in test infrastructure to avoid sync updated during replay_barrier() in better way. |
| Comment by Jodi Levi (Inactive) [ 23/Apr/13 ] |
|
Dropping priority per Mike's last comment. |
| Comment by Alexey Lyashkov [ 11/Dec/13 ] |
|
Finally investigated that issue. 2.4 send an 'name' as one of parameters of ll_intent_file_open request, but set also an MDS_OPEN_BY_FID flag. in that case open send an request with 'f1' name but with fid from older 'F1' file, but MDS don't know about OPEN_BY_FID flag and don't execute an lookup by fid to find an object in directory. execute an lookup by name 'f1' and return a data from older 'f0' object, after reply is accepted - client tried to update an older inode from 'f1' object with data from a server and client confuses about different FID in answer. SO bug in handling an interop in |
| Comment by Mikhail Pershin [ 22/Jul/18 ] |
|
outdated issue |