[LU-2808] osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed: Created: 13/Feb/13  Updated: 23/Jun/21  Resolved: 23/Jun/21

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: John Hammond
Resolution: Duplicate Votes: 0
Labels: osp

Attachments: File sys_o_delay_open_ftruncate.c    
Issue Links:
Duplicate
is duplicated by LU-2399 create objects on truncate Resolved
Related
is related to LU-2523 ll_update_inode()) ASSERTION( lu_fid_... Resolved
Severity: 3
Rank (Obsolete): 6797

 Description   

Racy reproducer:

# llmount.sh
# cd /mnt/lustre
# while true; do lfs setstripe -c 1 f0; rm f0; done &
[1] 3814
# while true; do truncate --size=1 f0; done
cannot truncate `f0' to length 21705: No such file or directory
cannot truncate `f0' to length 17399: No such file or directory
cannot truncate `f0' to length 18024: No such file or directory
cannot truncate `f0' to length 25593: No such file or directory
cannot truncate `f0' to length 19126: No such file or directory
cannot truncate `f0' to length 29680: No such file or directory
cannot truncate `f0' to length 14928: No such file or directory
cannot truncate `f0' to length 23877: No such file or directory
cannot truncate `f0' to length 6911: No such file or directory
cannot truncate `f0' to length 868: No such file or directory
cannot truncate `f0' to length 791: No such file or directory
cannot truncate `f0' to length 28593: No such file or directory
cannot truncate `f0' to length 7330: No such file or directory
cannot truncate `f0' to length 9708: No such file or directory

Message from syslogd@m at Feb 13 10:29:39 ...
 kernel:LustreError: 3088:0:(osp_object.c:56:osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed:

Message from syslogd@m at Feb 13 10:29:39 ...
 kernel:LustreError: 3088:0:(osp_object.c:56:osp_object_assign_fid()) LBUG

Message from syslogd@m at Feb 13 10:29:39 ...
 kernel:Kernel panic - not syncing: LBUG

crash> bt
PID: 31940  TASK: ffff8801663b5500  CPU: 0   COMMAND: "mdt00_001"
 #0 [ffff8801663b78e8] machine_kexec at ffffffff81031f7b
 #1 [ffff8801663b7948] crash_kexec at ffffffff810b8c22
 #2 [ffff8801663b7a18] panic at ffffffff814eae18
 #3 [ffff8801663b7a98] lbug_with_loc at ffffffffa0ef3eeb [libcfs]
 #4 [ffff8801663b7ab8] osp_object_assign_fid at ffffffffa0b98942 [osp]
 #5 [ffff8801663b7ae8] osp_declare_attr_set at ffffffffa0b98b11 [osp]
 #6 [ffff8801663b7b38] lod_declare_attr_set at ffffffffa0b68083 [lod]
 #7 [ffff8801663b7b88] mdd_attr_set at ffffffffa051f5e9 [mdd]
 #8 [ffff8801663b7c08] mdt_attr_set at ffffffffa0a4deb8 [mdt]
 #9 [ffff8801663b7c58] mdt_reint_setattr at ffffffffa0a4e7ad [mdt]
#10 [ffff8801663b7cc8] mdt_reint_rec at ffffffffa0a486b1 [mdt]
#11 [ffff8801663b7ce8] mdt_reint_internal at ffffffffa0a41d13 [mdt]
#12 [ffff8801663b7d28] mdt_reint at ffffffffa0a42044 [mdt]
#13 [ffff8801663b7d48] mdt_handle_common at ffffffffa0a32fb8 [mdt]
#14 [ffff8801663b7d98] mds_regular_handle at ffffffffa0a6a5c5 [mdt]
#15 [ffff8801663b7da8] ptlrpc_server_handle_request at ffffffffa062c00c [ptlrpc]
#16 [ffff8801663b7ea8] ptlrpc_main at ffffffffa062d556 [ptlrpc]
#17 [ffff8801663b7f48] kernel_thread at ffffffff8100c0ca

A deterministic reproducer is attached. It does

fd1 = open("f0", O_RDWR|O_CREAT|O_LOV_DELAY_CREATE, 0666);
fd2 = open("f0", O_RDWR|O_CREAT, 0666);
ftruncate(fd2, 1);

This seems to be independent from LU-2523. I tried this with and without the patch from LU-2523 and got the same result.

Note that in the call to ftruncate() if fd1 is used instead of fd2 or if a zero length is used then there is no LBUG. If truncate("f0", 1) is used then there is also no LBUG.

Interestingly, if MOUNT_2=yes is used and the first open is of "/mnt/lustre/f0" and the second of "/mnt/lustre2/f0" then there is no LBUG. However while the setstripe ioctl() will appear to succeed, in fact the default striping will be applied to the file.



 Comments   
Comment by Alex Zhuravlev [ 14/Feb/13 ]

we discussed few times that it'd be better for truncate to create striping if it doesn't exist yet. this would help to reduce number of cases to handle (to LU-2794, i think) and fix one outstanding issue.

Comment by Andreas Dilger [ 15/Feb/13 ]

Alex, I agree. Not a critical issue, but it would be nice to get this fixed for the final 2.4 release.

Comment by John Hammond [ 19/Feb/13 ]

In this case the striping has already been created (by the second open()) when ftruncate() is called. The problem seems to be that the lsm is not returned to the client after the second open() or the client ignores it. Then in ftruncate() the client does not have lsm for the file so it sends setattr with the new size to the MDS. So you have the MDS setting the size of an already striped file which triggers the LASSERT() in ops_declare_attr_set()/osp_object_assign_fid(). It seems like the easiest fix here would be to check for an already assigned fid:

@@ -53,7 +53,9 @@ static void osp_object_assign_fid(const struct lu_env *env,
 {
        struct osp_thread_info *osi = osp_env_info(env);
 
-       LASSERT(fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)));
+       if (!fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)))
+               return;
+
        LASSERT(o->opo_reserved);
        o->opo_reserved = 0;

Please correct me if I'm wrong.

Comment by John Hammond [ 19/Feb/13 ]

Please see http://review.whamcloud.com/5473.

Comment by Andreas Dilger [ 19/Feb/13 ]

I agree that fixing the LASSERT() is the first priority. I don't think this is such a critical problem that is going to be hit by normal usage.

However, if the truncate is sent to the MDS, but is then ignored, the client cannot send the RPC to the OSTs. Either the MDS needs to proxy the RPC for the client, or return some error code to the client (-EAGAIN?) so that it will get the layout and send the RPC on to the OSTs.

Comment by John Hammond [ 19/Feb/13 ]

I admit that reproducer seems pretty unlikely. To me the upshot of fixing this is that truncate()/ftruncate() could probably be added to racer.sh. It won't hurt my feelings if this gets dropped from the blocker list.

Comment by Alex Zhuravlev [ 20/Feb/13 ]

frankly, this doesn't look like a solution.

Comment by Alex Zhuravlev [ 20/Feb/13 ]

I think few paths (including open/w, setattr) should be checking whether the striping has been already created, if not create it holding an exclusive lock.
later we'll wrap all this with appropriate layout lock.

Comment by Andreas Dilger [ 05/Mar/13 ]

John, are you going to rework your patch based on Alex's comment, or does Alex need to work on that?

Comment by John Hammond [ 05/Mar/13 ]

Sure, I'll give it a shot. I had stopped work on this after Alex's comment, which I may have misunderstood. Is disabling the assertion the correct band-aid here?

Comment by Alex Zhuravlev [ 06/Mar/13 ]

I tend to think that removing the assertion will just remove the only one symptom and hide all subsequent troubles we may get into.
there are number of issues in this area and I hope they will be solved almost automatically once we start to take locks around transactions.
as a temporary solution we could recognize the striping has been already created, return -EEXSIT and handle it in the caller ?

Comment by Jinshan Xiong (Inactive) [ 06/Mar/13 ]

I did similar thing in patch 5291 to support sending size info to MDT in truncate RPC.

Comment by John Hammond [ 20/Mar/13 ]

Jinshan's #5291 has resolved the LBUG(). There remains the issue of creating objects on truncate discussed at http://review.whamcloud.com/5473.

Comment by Andreas Dilger [ 15/Oct/13 ]

Creating objects on truncate is what LU-2808 is all about. Closing this bug, and assigning that one to John.

Generated at Sat Feb 10 01:28:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.