[LU-3140] fid sequence may reused for different objects. Created: 10/Apr/13  Updated: 22/Jul/18  Resolved: 22/Jul/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexey Lyashkov Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

single node setup.


Attachments: File lelus-134-prof.diff     File lustre.bin.bz2    
Severity: 3
Rank (Obsolete): 7624

 Description   

While investing a Cray issues we create a debug patch to verify a correct mode assigned to the md object after revalidating, but result is very confusing.

same fid assigned to the different files.

00000004:00000002:2.0:1365510202.512393:0:19372:0:(mdt_reint.c:281:mdt_md_create()) @@@ Create (bdir->[0x200000bd0:0x1:0x0]) in [0x200000400:0x7:0x0] req@ffff880036324928 x1431841008517781/t0(17179869191) o36->1e287f29-a399-fb10-7bd0-d9d83ebc1ce1@0@lo:0/0 lens 456/536 e 0 to 0 dl 1365510223 ref 1 fl Complete:/4/0 rc 0/0

00000004:00000002:6.0:1365510375.912867:0:23947:0:(mdt_reint.c:281:mdt_md_create()) @@@ Create (f9-2->[0x200000bd0:0x1:0x0]) in [0x61ab:0x54d71902:0x0] req@ffff88002a88c008 x1431841008518517/t0(0) o36->e35c9b0d-1221-a3ae-0357-637f30903536@0@lo:0/0 lens 456/536 e 0 to 0 dl 1365510395 ref 1 fl Interpret:/0/0 rc 0/0

as you see - bdir and f9-2 have a same fid.

replicating a very easy
1) apply an patch (to hit assert and panic) or dump log after test loop.

bash-3.2$ git diff
diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c
index 7dad8f9..c40cfd2 100644
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -1654,6 +1654,20 @@ void ll_update_inode(struct inode *inode, struct lustre_md *md)
         struct ll_sb_info *sbi = ll_i2sbi(inode);
 
         LASSERT ((lsm != NULL) == ((body->valid & OBD_MD_FLEASIZE) != 0));
+        if (body->valid & OBD_MD_FLTYPE) {
+               if ((!S_ISDIR(inode->i_mode) && S_ISDIR(body->mode)) ||
+                   (S_ISDIR(inode->i_mode) && !S_ISDIR(body->mode))) {
+//             if ((inode->i_mode & S_IFMT) != (body->mode & S_IFMT)) {
+                       CERROR("ino %p->%lu type is %s but server reply has %s "
+                              "type (%o and %o). FID "DFID"\n", inode, inode->i_ino,
+                              S_ISDIR(inode->i_mode) ? "directory" :"file",
+                              S_ISDIR(body->mode) ? "directory" : "file",
+                              inode->i_mode & S_IFMT, body->mode & S_IFMT,
+                              PFID(&lli->lli_fid));
+                       LBUG();
+               }
+       }
+
         if (lsm != NULL) {
                 LASSERT(S_ISREG(inode->i_mode));

2) run PTLDEBUG=-1 SUBSYSTEM=-1 DEBUG_SIZE=400 llmount.sh to setup a single node lustre cluster

3) run a replay-dual with simple script.
for ((i=0; i!=1000; i++)) do
for ((j=1; j!=10; j++)) do
echo "Iteration $i"
SLOW=YES ONLY=$j DIR=/mnt/lustre sh ./replay-dual.sh
done
done

in 90% times you have hit a bug after first loop.



 Comments   
Comment by Alexey Lyashkov [ 10/Apr/13 ]

full debug log from hit assert.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

the same fid would hit -EEXIST at insertion into OI... at least I saw this in development few times.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

Alex, thanks for point - i will ask to look in logs about it, until I worked on patch to waiting a commit on client side, before we may use a sequence - like a synchronous rpc.
But that is may be valid situation - when both seq allocation and OI update if under RO state and don't pushed to disk, but client think operation is OK.

Comment by Alexander Boyko [ 10/Apr/13 ]

Xyratex-bug-id: MRP-877

Comment by Alex Zhuravlev [ 10/Apr/13 ]

that's true, with rdonly we can't ensure a sequence is unique. I do remember there was a discussion on this topic, something about rdonly vs. commit callbacks (which are vital to maintain sequences unique). I'm adding Mike to the conversation..

Comment by Alexey Lyashkov [ 10/Apr/13 ]

quick look - say that is it situation.
SEQ allocation after r/o set and OI insert likely will be in same time, so ... both lost, but client have a valid inode in cache.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

iirc, in some tests we did something like file creation at the beginning and before R/O to ensure the sequence has been committed properly.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

and assume creation isn't blocked and fail always called after it.

I prepared a prof of concept path, to illustrate my ideas - quick test say it's worked - but we need to fix much tests to new behavior.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

creation itself should not block, but sequence allocation MUST wait until it's committed. IOW, no sequence should be provided if it's not committed.
in the current implementation we're trying to pre-allocate a window of sequences (proportional to the number of clients) and it becomes "available"
only up on commit callback.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

that is same as i make, but most tests in replay dual have ...

test_9() {
    replay_barrier $SINGLEMDS
    mcreate $MOUNT1/$tfile-1
    mcreate $MOUNT2/$tfile-2
    # drop first reint reply
    do_facet $SINGLEMDS lctl set_param fail_loc=0x80000119
    fail $SINGLEMDS

so fail $MDS will never called as mcreate blocked on allocation FID sequence.

same for other tests in replay-dual - until we have free FID's - it's works, when need allocate - broke a test.

some test was have same before.

test_0a() {
    touch $MOUNT2/$tfile-A # force sync FLD/SEQ update before barrier
    replay_barrier $SINGLEMDS
#define OBD_FAIL_PTLRPC_FINISH_REPLAY | OBD_FAIL_ONCE
    touch $MOUNT2/$tfile
    createmany -o $MOUNT1/$tfile- 50
    $LCTL set_param fail_loc=0x80000514

..
or
test_4() {
    mkdir $MOUNT1/adir
    replay_barrier $SINGLEMDS
    mkdir $MOUNT1/adir  && return 1
    mkdir $MOUNT2/adir/bdir
    
    fail $SINGLEMDS
    checkstat $MOUNT2/adir      || return 2

but we have no guaranteed fid will requested on first creation, because depend of previous state.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

prof of concept for fix.
need have result about test fixes, and restore allocation after replay. Look we need to send same allocation as last client have and resend on client side if that range exhausted.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

I don't understand the last comment. I'd think it should be enough to request and commit few sequences in mdt_iocontrol(), just before calling to ->dt_ro()

Comment by Alexey Lyashkov [ 10/Apr/13 ]

how you have send these sequences to the client? and we have no way to think that is enough for client or not.
and what about power lost before an update on storage?

Comment by Alex Zhuravlev [ 10/Apr/13 ]

there is no need to send them to the client - the client will request them when needed and get as the sequences have been already allocated and committed.
power lost before the update implies no commit callback is called and no sequence is available.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

sequences on client side or in server side?
if you talk about server side - have a good recovery with requests replay will enough to it.
i that case we don't need a waiting a commit callback on client - but will be need more work for recovery.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

strictly speaking it doesn't matter, both the server and the client should be getting only committed sequences.
in the context of sequence allocation, it's commit callback on the server making a sequence available.

Comment by Mikhail Pershin [ 10/Apr/13 ]

I agree with Alex that this can be just artifact of replay_barrier(). It tries to simulate real behavior of device but it doesn't that at 100%. Any sync operation will be a problem, because it will finish like non-sync. You can check if your situation is about that. In osd_ro() we can set new flag on osd_device, say od_readonly. Then osd_trans_stop() should check that there is no th_sync and od_readonly toghether and put error message with dump_stack(). Therefore we will know when we have such situation and shouldn't rely on test results.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

Mike,

same situation you will have with WB enabled cache for a device, as ldiskfs disable a barrier and some other.
most async operations is able to replayed except a SEQ request. I think question is incorrect replay for a sequence as that is FS modification request and should be replayed correctly or implemented as sync request to waiting a commits from MDT side.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

btw. it's not an artifact of replay_barrier as ro/no transno flag used to official way to shutdown target with recovery.

Comment by Alex Zhuravlev [ 10/Apr/13 ]

sequence requests are never replayed as sequences are not supposed to be reused. when MDT shuts down it loses its sequences as well and will get new one upon reboot.
i think you missed the point that when MDS sends a sequence back to the client it (sequence) is already committed to the disk.

Comment by Alexey Lyashkov [ 10/Apr/13 ]

Alex

1) sync operation only in case...

if (seq->lss_set_transno > dev->ld_obd->obd_last_committed)
sync = 1;
2) notransno and r/o is official way to shutdown target with failover, so we can't say - that is replay_barrier artifact, we need to make that way work correctly.

3) as seq allocation need replayed anyway - we may use async transaction so will be speedup for MD operations.

Comment by Mikhail Pershin [ 10/Apr/13 ]

Alexey,

could you give an example how r/o (barrier) is used for shutdown officially? What do you mean by that? the replay_barrier has known bugs and it doesn't work correctly with sync operations, it is real artifact no matter how official the way of using it.

seq allocaions don't need replay if the only benefit is speedup. Note that we have pre-created window of many sequences, so in most cases allocation don't produce sync and doesn't affect MD operations performance, sync is very rare case, but during tests it may occur more often due to many server failover in recovery tests

Comment by Alex Zhuravlev [ 10/Apr/13 ]

by the time rdonly is turned on we shouldn't be handling new requests, so no new sequences are needed while old in-use have been already committed.

Comment by Andreas Dilger [ 11/Apr/13 ]

Alex, do we stille use rdonly for shutdown? I thought that was removed some time ago by Fan Yong.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

yes, I actually checked the code today and AFAICS we do not use rdonly. even with umount -f.

Comment by Mikhail Pershin [ 23/Apr/13 ]

Not blocker for 2.4 because happens on old Lustre 2.1 version. The issue is related to sync sequence update during barrier, which can't handle sync. Current replay_barrier contains workaround to prevent that by creating extra file before barrier which bumps sequence update but that is not so in pre-2.3 Lustre and replay-dual may encounter such conditions. Strictly speaking this is not bug even though may be considered as request for some improvements in test infrastructure to avoid sync updated during replay_barrier() in better way.

Comment by Jodi Levi (Inactive) [ 23/Apr/13 ]

Dropping priority per Mike's last comment.

Comment by Alexey Lyashkov [ 11/Dec/13 ]

Finally investigated that issue.
Regression introduced by LU-220 and MDS_OPEN_BY_FID, and easily replicated with sanity 27B with 2.4 client and 2.1 server.

2.4 send an 'name' as one of parameters of ll_intent_file_open request, but set also an MDS_OPEN_BY_FID flag.
in script
create f0, f1;
open f1
mv f0 -> f1
ll_intent_file_open (via ll_file_ioctl)
....
close $f1

in that case open send an request with 'f1' name but with fid from older 'F1' file, but MDS don't know about OPEN_BY_FID flag and don't execute an lookup by fid to find an object in directory. execute an lookup by name 'f1' and return a data from older 'f0' object, after reply is accepted - client tried to update an older inode from 'f1' object with data from a server and client confuses about different FID in answer.

SO bug in handling an interop in LU-220 patch.
if it's need i may attach a full logs from that.

Comment by Mikhail Pershin [ 22/Jul/18 ]

outdated issue

Generated at Sat Feb 10 01:31:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.