[LU-762] Hyperion - mdtest failure Created: 14/Oct/11 Updated: 10/Jan/12 Resolved: 16/Dec/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL5/x86_64 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4852 |
| Description |
|
mdtest in shared directory fails with 50 clients and MDTEST_NFILES=1024, also fails w/100 clients and MDTEST_NFILES=256. Failure is silent, no Lustre Errors or any servers messages. Failure is consistent and repeatable. Failure only occurs when run in shared directory mode. 000: Command line used: /opt/mdtest-1.8.3/bin/mdtest -d /p/l_wham/white215/hyperion.3867/mdtest -i3 -n1024 |
| Comments |
| Comment by Sarah Liu [ 14/Oct/11 ] |
|
I tried to reproduce this bug on Toro, it can pass with 3 clients and MDTEST_NFILES=1024, but failed with 3 clients and MDTEST_NFILES=10240: MDTEST_NFILES=1024: 12 tasks, 12288 files/directories SUMMARY: (of 3 iterations) – finished at 10/14/2011 16:45:06 – MDTEST_NFILES=10240: 12 tasks, 122880 files/directories NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. |
| Comment by Peter Jones [ 17/Oct/11 ] |
|
Oleg will look into this one |
| Comment by Oleg Drokin [ 19/Oct/11 ] |
|
Ok, so I was diving into it and it seems to be related to some ldiskfs problems. Here is a relevant mds log part: This tells us that ldiskfs somehow returned an inode from lookup that had zero nlink which is not right at all, as nobody could delete it yet. So after the job dies I located this dir (from client logs) and did ls on it - it was still there, and then removed it - and it removed with no fuss. Another intereting point, in my testing I am only able to reproduce the problem if I run the two mdtests sequentially like specified here - one with 1024 files followed by 10240 files in the same dir. If I run just the 10240 files run in another dir or same dir after it was deleted, I cannot seem to be able to reproduce it, so I wonder if there is some leftover state in the large dir left that runins the day later on? |
| Comment by Andreas Dilger [ 19/Oct/11 ] |
|
Has anyone tried running this same test with 1.8.6 on the MDS to see if it passes? It would also be useful to either compile the 1.8.6 Lustre against the 2.6.18-274.3.1.el5 (RHEL 5) kernel from 1.8.7, or 1.8.7 against the 2.6.18-238.12.1.el5 (RHEL 5) from 1.8.6. It is possible that this is either an upstream ldiskfs bug, or possibly an existing bug that we didn't notice before due to a different test environment. |
| Comment by Oleg Drokin [ 19/Oct/11 ] |
|
I asked Sarah to try on 1.8.5 an 1.8.6 to see if it happens too. |
| Comment by Sarah Liu [ 19/Oct/11 ] |
|
I finished the same tests on 1.8.6, run 4 times, they all pass: [root@client-15 ~]# mpirun -np 12 -H client-15,client-5,client-2 -mca plm_rsh_agent rsh /usr/bin/mdtest -d /mnt/lustre/mdtest -i3 -n1024 mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 12288 files/directories SUMMARY: (of 3 iterations) – finished at 10/19/2011 13:01:42 – mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 122880 files/directories SUMMARY: (of 3 iterations) – finished at 10/19/2011 13:09:44 – |
| Comment by Sarah Liu [ 20/Oct/11 ] |
|
As Peter and Andreas suggested, besides 1.8.6, I tried two builds between 1.8.6 and 1.8.7-rc1--build #130 and #131. For #131, kernel is updated from 2.6.18-238.12.1.el5 to 2.6.18-274.3.1.el5. The result is #130 can pass both mdtest(1024 and 10240) while #131 hit this problem and after this failure, test with MDTEST_NFILES=1024 cannot pass either. #130 mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 12288 files/directories SUMMARY: (of 3 iterations) – finished at 10/19/2011 17:20:12 – mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 122880 files/directories SUMMARY: (of 3 iterations) – finished at 10/19/2011 17:36:58 – [root@client-15 ~]# rpm -qa|grep lustre #131 mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 12288 files/directories SUMMARY: (of 3 iterations) – finished at 10/19/2011 21:53:21 – mdtest-1.8.3 was launched with 12 total task(s) on 3 nodes 12 tasks, 122880 files/directories NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. |
| Comment by Peter Jones [ 20/Oct/11 ] |
|
YangSheng Could you please look into what about the checkin identified as the culprit has caused this problem? thanks Peter |
| Comment by Yang Sheng [ 20/Oct/11 ] |
|
Ok. Taken. |
| Comment by Andreas Dilger [ 20/Oct/11 ] |
|
Yang Sheng, can you please attach a diff between those two kernels fs/.[ch] fs/ext4/.[ch]. It may be that this is a change in the ext4 code and/or VFS that is causing the problem. |
| Comment by Yang Sheng [ 20/Oct/11 ] |
|
fs diff |
| Comment by Yang Sheng [ 20/Oct/11 ] |
|
ext4 diff |
| Comment by Andreas Dilger [ 20/Oct/11 ] |
|
I don't see any obvious changes in those two patches that might be causing this problem. In the ext4 code there is a patch in ext4_free_inode() that is duplicating our own ext4-claim_inode-free_inode-race.patch, which has been removed from the ldiskfs-2.6-rhel5-ext4.series. The rest of the patches are for AIO with O_DIRECT files, which is completely irrelevant for the MDS, and a small fix to the quota code. There is one change in the VFS code in d_obtain_alias(), but I don't see that we use this on the MDS. Possibly the problem relates to our iopen patch? I can't see anything obvious. Hmm, I thought the two builds being tested were only before/after the kernel update. I see that there is another patch (bec818434c27bb390b4c8866e73d1afb0dd9e884, http://review.whamcloud.com/1259) that was also included in that patch, and it seems very relevant, and a good candidate for the cause of this bug. |
| Comment by Andreas Dilger [ 20/Oct/11 ] |
|
That patch was from |
| Comment by Oleg Drokin [ 20/Oct/11 ] |
|
Actually I kind of don't think the We only see the problem in mkdir/rmdir case. Besides, the erroneous lookup (rather, invalid nlink value) clearly comes from ldiskfs, and ldiskfs was not touched by Can you please elaborate? |
| Comment by Oleg Drokin [ 20/Oct/11 ] |
|
btw I would agree that running a test with just kernel different and perhaps just lu-146 applied would be a good idea to clear who's at fault. |
| Comment by Oleg Drokin [ 20/Oct/11 ] |
|
btwm another note. /* Step 1: Lookup parent */ |
| Comment by Sarah Liu [ 20/Oct/11 ] |
|
I will test the review build which contain one patch at a time. Will keep you updated. |
| Comment by Sarah Liu [ 21/Oct/11 ] |
|
I tested review build #2398 which included patch for |
| Comment by Johann Lombardi (Inactive) [ 21/Oct/11 ] |
|
From ext4_iget() (called from iopen): if (inode->i_nlink == 0) {
if (inode->i_mode == 0 ||
!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS)) {
/* this inode is deleted */
ret = -ESTALE;
goto bad_inode;
}
/* The only unlinked inodes we let through here have
* valid i_mode and are being read by the orphan
* recovery code: that's fine, we're about to complete
* the process of deleting those. */
}
So the fs has EXT4_ORPHAN_FS set? How is this possible? |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
Note that nlink=0 is reported on parent dir. Regarding the code you quote, I believe it's jsut that i_mode is NOT 0 (of course) and the EXT4_ORPHAN_FS is NOT set as well. So time to see what might be doing nlink-- on rmdir path even if briefly? |
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
Yes, I see this now - the directory link count is temporarily set to 0 if a subdirectory is deleted: int ext4_rmdir(struct inode *dir, struct dentry *dentry)
{
:
:
ext4_dec_count(handle, dir);
:
:
}
static void ext4_dec_count(handle_t *handle, struct inode *inode)
{
drop_nlink(inode);
if (S_ISDIR(inode->i_mode) && inode->i_nlink == 0)
inc_nlink(inode);
}
This is called via ll_vfs_rmdir(). In mds_reint_create() and mds_mfd_close it calls ll_vfs_rmdir(dir) under LOCK_INODE_MUTEX(dir). Perhaps we are not doing the lookup under i_mutex, or not refreshing the attributes under i_mutex? |
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
It looks like the patch landed to ext4 is not the same as what was in our original patch: static inline void ext3_dec_count(handle_t *handle, struct inode *inode)
{
if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2)
drop_nlink(inode);
}
I don't know if the locking changes in |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
|
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
What I meant was whether the changes in the It is possible to add another ext4 patch to change ext4_dec_count() back to how it was in our patches, which is probably the most expedient thing to do for 1.8.7, but it would be better to avoid the race in Lustre instead of adding more patches to ldiskfs. That said, I don't think this is needed for 2.x, so maybe that is OK until 1.8.x is gone. |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
I don't think lu-146 touches anything related to attrs readign and inude mutex around it, its likely a problem we had from the very beginning. |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
BTW having that said, why would it be fine in the kernel to have nlink 0 even temporarily? It's error prone and all, why not have it correct at all times instead? |
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
It is OK for the kernel because this is always accessed under i_mutex in cases that matter? I haven't checked yet, but my guess is that mds_get_parent_child_locked() (or somewhere else) is accessing i_nlink without a lock? That said, now I'm a bit confused because the code in question (ext4_iget()) is only hitting this code path when reading the inode from disk. That means either we are getting inode aliases in memory (seems unlikely, since it would be getting a valid i_nlink from disk, since ext4_dec_count() is only working on the in-memory inode) or we are accessing the parent inode before it is fully set up. It may possibly relate to strangeness in the iopen code, I haven't looked at that in a long time. It may be that my source of the i_nlink=0 condition is misdiagnosed. |
| Comment by Oleg Drokin [ 21/Oct/11 ] |
|
Your source of the problem seems to be very plausible. I am not sure what iget has to do with this, as we are getting this inode via normal lookup, we don't get any sort of an error from ldiskfs itself, just our own checking of nlink is finding that it's suddenly 0. And yes, we do check it without any locking: result = mds_lookup(obd, fid_name, mds->mds_fid_de, strlen(fid_name));
if (IS_ERR(result))
RETURN(result);
inode = result->d_inode;
if (!inode)
RETURN(ERR_PTR(-ENOENT));
if (inode->i_nlink == 0) {
if (inode->i_mode == 0 &&
LTIME_S(inode->i_ctime) == 0 ) {
LCONSOLE_WARN("Found inode with zero nlink, mode and "
"ctime -- this may indicate disk"
"corruption (device %s, inode %lu, link:"
" %lu, count: %d)\n", obd->obd_name, inode
->i_ino,
(unsigned long)inode->i_nlink,
atomic_read(&inode->i_count));
}
dput(result);
RETURN(ERR_PTR(-ENOENT));
}
|
| Comment by Andreas Dilger [ 21/Oct/11 ] |
|
I've just created http://review.whamcloud.com/1579 that has a patch which may fix this problem. It hasn't been tested at all, but hopefully it is right. |
| Comment by Sarah Liu [ 21/Oct/11 ] |
|
I will test it when Toro is accessible. |
| Comment by Sarah Liu [ 23/Oct/11 ] |
|
I've tested patch http://review.whamcloud.com/#change,1579 for 3 times, both 1024 and 10240 pass. |
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Build Master (Inactive) [ 24/Oct/11 ] |
|
Integrated in Johann Lombardi : 9500ebf6146dcd50dab094b502d4b50d0123296a
|
| Comment by Cliff White (Inactive) [ 28/Oct/11 ] |
|
The test fails on RC2 also: 0000: mdtest-1.7.3 was launched with 1032 total task(s) on 129 nodes |
| Comment by Oleg Drokin [ 28/Oct/11 ] |
|
This seems to be somewhat different problem now, fails in create because file does not exist? Need some logs I guess, probably from mds. e.g. +vfstrace +dlmtrace +...trace (this last one is the one that traces execution, though I forgot what the name is). |
| Comment by Cliff White (Inactive) [ 28/Oct/11 ] |
|
I re-ran the test with the requested debug options, it failed. Am uploading to maloo now. |
| Comment by Cliff White (Inactive) [ 31/Oct/11 ] |
|
Malloo results https://maloo.whamcloud.com/test_sessions/02a44c1a-01a0-11e1-bb4f-52540025f9af |
| Comment by Cliff White (Inactive) [ 31/Oct/11 ] |
|
MDS debug log |
| Comment by Oleg Drokin [ 31/Oct/11 ] |
|
Ok, this log is way too small, it only covers 9 directory creates out of 1038. I guess to get more consistent results, can you please increase debug_mb to a gig on clients annd to 2G on the MDS, in addition please use debug_daemon (with 10240 M argument to it) on the MDS. Once failed, capture the MDS log from the debug daemon, and also capture a couple of lustre debug logs from clinets that reported uniq create failures. I also checked mdtest 1.7.3 source and it appears the failure to create is the failure to create uniq dirs themselves, not during in the middle of the test surprisingly. |
| Comment by Cliff White (Inactive) [ 01/Nov/11 ] |
|
Ran second test for Oleg Results at: https://maloo.whamcloud.com/test_sessions/c4849490-04af-11e1-aa99-52540025f9af |
| Comment by Peter Jones [ 02/Nov/11 ] |
|
So, my understanding is from talking to Oleg and Cliff that the reported "reoccurence" has been linked to using an old and unreliable version of mdtest so this issue is actually fixed in 1.8.7 RC2. So, can this ticket be marked as resolved or does an equivalent fix need to land to master? |
| Comment by Oleg Drokin [ 02/Nov/11 ] |
|
I suspect we need this on master for the case of running on an upgraded fs that was originally started as 1.8.x formatted, so no oi is used. somebody need to confirm my understanding of this fact. |
| Comment by Peter Jones [ 03/Nov/11 ] |
|
Talking to Andreas it seems safest to port this change to master. YangSheng can you please take care of that? |
| Comment by Andreas Dilger [ 03/Nov/11 ] |
|
I'm not sure what you are asking about related to the OI. When a 1.8.x formatted filesystem is upgraded to 2.x it will get an OI to hold new FIDs, but old IGIF FIDs are not added to the OI (at least currently though it may make life easier in some cases, but that is a separate discussion). I think it is reasonable to land the ldiskfs patch to master as well, just to keep the ldiskfs patches in sync. The patch was already accepted upstream for ext4, so eventually the patch can go away. Yang Sheng, can you please add the ext4-nlink-2.6-rhel5.patch to the master ldiskfs series as well. |
| Comment by Yang Sheng [ 03/Nov/11 ] |
|
Hi, Peter, Adilger, patch upload to:http://review.whamcloud.com/#change,1644 |
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Yang Sheng [ 16/Dec/11 ] |
|
Close first. Please feel free to reopen. |