[LU-4363] (llite_lib.c:1683:ll_update_inode()) ASSERTION( lu_fid_eq(&lli->lli_fid, &body->fid1) ) failed Created: 09/Dec/13  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: mn1

Severity: 3
Rank (Obsolete): 11946

 Description   

Hi,

At IFERC customer site, 7 compute nodes crashed with the following message in the console:

2013-11-21 00:57:45 LustreError: 92325:0:(llite_lib.c:1683:ll_update_inode()) ASSERTION( lu_fid_eq(&lli->lli_fid, &body->fid1) ) failed: Trying to change FID [0x217294ce4:0x107f0:0x0] to the [0x217294ce4:0x107f1:0x0], inode 150634522759727089/35072332(ffff8807dcbf85f8)
2013-11-21 00:57:45 LustreError: 92325:0:(llite_lib.c:1683:ll_update_inode()) LBUG
2013-11-21 00:57:45 Pid: 92325, comm: writer_v131
2013-11-21 00:57:45
2013-11-21 00:57:45 Call Trace:
2013-11-21 00:57:45  [<ffffffffa046f7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-11-21 00:57:45  [<ffffffffa046fe07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-11-21 00:57:45  [<ffffffffa0a91ca0>] ll_update_inode+0x4a0/0xf60 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a928ea>] ll_prep_inode+0x18a/0xae0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a7c8c3>] ll_intent_file_open+0x563/0xb80 [lustre]
2013-11-21 00:57:45  [<ffffffffa0aa6a90>] ? ll_md_blocking_ast+0x0/0x700 [lustre]
2013-11-21 00:57:45  [<ffffffff8108163e>] ? down+0x2e/0x50
2013-11-21 00:57:45  [<ffffffffa0a7cf67>] ll_lov_setstripe_ea_info+0x87/0x2b0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a831a5>] ll_lov_setstripe+0x85/0x5a0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0aa3e8b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a84ac6>] ll_file_ioctl+0x826/0xe00 [lustre]
2013-11-21 00:57:45  [<ffffffff81179ff2>] vfs_ioctl+0x22/0xa0
2013-11-21 00:57:45  [<ffffffff8117a4ba>] do_vfs_ioctl+0x3aa/0x580
2013-11-21 00:57:45  [<ffffffff8117a711>] sys_ioctl+0x81/0xa0
2013-11-21 00:57:45  [<ffffffff8149970e>] ? do_device_not_available+0xe/0x10
2013-11-21 00:57:45  [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
2013-11-21 00:57:45
2013-11-21 00:57:45 Kernel panic - not syncing: LBUG
2013-11-21 00:57:45 Pid: 92325, comm: writer_v131 Tainted: G        W  ---------------    2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
2013-11-21 00:57:45 Call Trace:
2013-11-21 00:57:45  [<ffffffff81495fe3>] ? panic+0xa0/0x168
2013-11-21 00:57:45  [<ffffffffa046fe5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-11-21 00:57:45  [<ffffffffa0a91ca0>] ? ll_update_inode+0x4a0/0xf60 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a928ea>] ? ll_prep_inode+0x18a/0xae0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a7c8c3>] ? ll_intent_file_open+0x563/0xb80 [lustre]
2013-11-21 00:57:45  [<ffffffffa0aa6a90>] ? ll_md_blocking_ast+0x0/0x700 [lustre]
2013-11-21 00:57:45  [<ffffffff8108163e>] ? down+0x2e/0x50
2013-11-21 00:57:45  [<ffffffffa0a7cf67>] ? ll_lov_setstripe_ea_info+0x87/0x2b0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a831a5>] ? ll_lov_setstripe+0x85/0x5a0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0aa3e8b>] ? ll_stats_ops_tally+0x6b/0xd0 [lustre]
2013-11-21 00:57:45  [<ffffffffa0a84ac6>] ? ll_file_ioctl+0x826/0xe00 [lustre]
2013-11-21 00:57:45  [<ffffffff81179ff2>] ? vfs_ioctl+0x22/0xa0
2013-11-21 00:57:45  [<ffffffff8117a4ba>] ? do_vfs_ioctl+0x3aa/0x580
2013-11-21 00:57:45  [<ffffffff8117a711>] ? sys_ioctl+0x81/0xa0
2013-11-21 00:57:45  [<ffffffff8149970e>] ? do_device_not_available+0xe/0x10
2013-11-21 00:57:45  [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b

This issue looks like LU-2523 and LU-3311, but the patch for b2_1 has not made any progress since July.

I havetested with the following reproducer, given in LU-2523:

llmount.sh
cd /mnt/lustre
touch file1

In a single process do:
  struct lov_user_md_v3 *lum;
  /* Initialize lum */
  fd2 = open("file2", O_RDWR|O_CREAT|O_LOV_DELAY_CREATE, 0666);
  rename("file1", "file2");
  ioctl(fd2, LL_IOC_LOV_SETSTRIPE, lum);

With a stock 2.1.6 I can easily reproduce the issue. And unfortunately, with patch at http://review.whamcloud.com/6775 I am still able to hit the bug.

Thanks,
Sebastien.



 Comments   
Comment by Lai Siyao [ 09/Dec/13 ]

http://review.whamcloud.com/#/c/7476/ should be able to fix this, but this patch is for master code, and it has some dependency on patches not on 2.1.

Comment by Peter Jones [ 09/Dec/13 ]

Lai

Would this be easier to port to b2_4?

Sebastien

If the answer to the above is yes, would you consider deploying a 2.4.x release at IFERC?

Peter

Comment by Sebastien Buisson (Inactive) [ 09/Dec/13 ]

Peter,

The problem is upgrade to 2.4 at IFERC is planned for Q4 2014

Comment by Peter Jones [ 10/Dec/13 ]

ok Sebastien. We are looking into options that would work for b2_1

Comment by Lai Siyao [ 10/Dec/13 ]

Yes, Sebastien, I'm looking for a simpler way to handle this open-by-fid case only, and I'm still testing, will commit the patch tomorrow.

Comment by Lai Siyao [ 10/Dec/13 ]

Hi Sebastien, I just committed a patch http://review.whamcloud.com/#/c/8529/, you can apply it plus http://review.whamcloud.com/#/c/7476/ to make the test pass. However as is noted by John in LU-2523 that setstripe will return -ENOENT in your test, this is because MDS has strict check to forbid open again or create object on OST for unlinked file (even though it's currently opened). Do you think this is acceptable?

Comment by Sebastien Buisson (Inactive) [ 11/Dec/13 ]

Hi,

Patch http://review.whamcloud.com/8529 can be applied on Lustre 2.1.6, but http://review.whamcloud.com/7476 cannot because it is a master version (more than 20 hunks failed when trying on 2.1).
That being said, I do not clearly understand the relationship between the assertion we are suffering from, and the open-by-fid feature. I mean, all we need is something to prevent Lustre 2.1 to crash when users do something like the reproducer detailed in the description of this ticket.

Sebastien.

Comment by Lai Siyao [ 12/Dec/13 ]

Sorry Sebastien, I posted the wrong patch, it should be http://review.whamcloud.com/6775 + http://review.whamcloud.com/#/c/8529/.

You hit that assertion is because MDS_OPEN_BY_FID flag is not in 2.1 code, so open tends to be done by name on MDS, therefore when rename happens, the new file with different fid will be opened, and it causes the assert on fid change on client.

2.4 has this flag, and patch http://review.whamcloud.com/#/c/8529/ backports this flag to 2.1, so the assert will not be hit any more.

Comment by Sebastien Buisson (Inactive) [ 12/Dec/13 ]

Hi,

Thank you very much for the explanations! Now with http://review.whamcloud.com/6775 + http://review.whamcloud.com/8529 I am not able to hit the assertion anymore
So we definitively need both fixes in b2_1.

One more question: could you re-explain the drawback you identified with this solution (it was related to setstripe returning -ENOENT but I dd not get your point) ?

Thanks,
Sebastien.

Comment by Lai Siyao [ 12/Dec/13 ]
ioctl(fd2, LL_IOC_LOV_SETSTRIPE, lum);

As you can see, setstripe is done via ioctl on an opened file handle, but in the code setstripe is implemented as an open (so it's actually a re-open), this looks should succeed, but current MDS code doesn't allow re-open or create OST object for unlinked file. However there is no posix standard for setstripe call, this can be regarded as normal, but it should be documented somewhere IMO.

Generated at Sat Feb 10 01:42:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.