[LU-4584] Lock revocation process fails consistently - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
12530

Description

Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.

This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:

set directory striping default count to 48
touch a file on client A
rm file on client B

The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.

I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

172.16.66.4@tcp.log.bz2
40 kB
06/Feb/14 6:53 PM
172.16.66.5@tcp.log.bz2
53 kB
06/Feb/14 6:53 PM
172.20.20.201@o2ib500.log.bz2
8.52 MB
06/Feb/14 6:53 PM
client_log_20140206.txt
375 kB
07/Feb/14 2:05 AM
inflames.log
2.40 MB
02/Apr/14 6:58 PM

Issue Links

duplicates

LU-4963 client eviction during IOR test - lock callback timer expired

Closed

is related to

LU-5525 ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Resolved

LU-5632 ldlm_lock_addref()) ASSERTION( lock != ((void *)0) )

Resolved

LU-5686 (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

Resolved

is related to

LU-2827 mdt_intent_fixup_resent() cannot find the proper lock in hash

Resolved

Activity

[LU-4584] Lock revocation process fails consistently

James A Simmons added a comment - 11/Sep/14 2:57 AM - edited

Actually our Cray clients don't have the ~~LU-3338~~ patch and the problems of ~~LU-5530~~ still showed up. So these patches are still needed. We plan to do a full scale test shot Tuesday next week with all the above patches plus a few others. If it goes well we will leave this system running 2.5 clients and 2.5 servers.

We still need this work due to another file system running at the lab that will be stuck at 2.4 for the time being but some of our clients will be moving to 2.5 so these problems will show up.

James A Simmons added a comment - 11/Sep/14 2:57 AM - edited Actually our Cray clients don't have the LU-3338 patch and the problems of LU-5530 still showed up. So these patches are still needed. We plan to do a full scale test shot Tuesday next week with all the above patches plus a few others. If it goes well we will leave this system running 2.5 clients and 2.5 servers. We still need this work due to another file system running at the lab that will be stuck at 2.4 for the time being but some of our clients will be moving to 2.5 so these problems will show up.

Oleg Drokin added a comment - 11/Sep/14 12:25 AM

For -2827, we have:
~~LU-2827~~ itself http://review.whamcloud.com/5978 and http://review.whamcloud.com/#/c/10378
~~LU-5266~~ http://review.whamcloud.com/10903
~~LU-5496~~ http://review.whamcloud.com/11469 and http://review.whamcloud.com/#/c/11644
~~LU-5579~~ http://review.whamcloud.com/#/c/11839/
and the final nail in this bug ~~LU-5530~~: http://review.whamcloud.com/#/c/11841/1

Some of those patches are not picking cleanly into b2_5, but I know James Simmons from ORNL ported them. All of this is now in testing after which they hopefully will publish their tree with backports.

Or another easy way to get rid of all of these problems (occuring as frequently, at least, sicne resend might also happen if a reply from server was genuinely lost on the network) is to drop ~~LU-3338~~ patch, that is not part of standard b2_5 (only landed to b2_6)

Oleg Drokin added a comment - 11/Sep/14 12:25 AM For -2827, we have: LU-2827 itself http://review.whamcloud.com/5978 and http://review.whamcloud.com/#/c/10378 LU-5266 http://review.whamcloud.com/10903 LU-5496 http://review.whamcloud.com/11469 and http://review.whamcloud.com/#/c/11644 LU-5579 http://review.whamcloud.com/#/c/11839/ and the final nail in this bug LU-5530 : http://review.whamcloud.com/#/c/11841/1 Some of those patches are not picking cleanly into b2_5, but I know James Simmons from ORNL ported them. All of this is now in testing after which they hopefully will publish their tree with backports. Or another easy way to get rid of all of these problems (occuring as frequently, at least, sicne resend might also happen if a reply from server was genuinely lost on the network) is to drop LU-3338 patch, that is not part of standard b2_5 (only landed to b2_6)

Christopher Morrone (Inactive) added a comment - 10/Sep/14 9:56 PM

Ok, I will give those patches a try.

We are planning to try to move to b2_5 in less than 2 months, so we'll need the ~~LU-2827~~ solution for b2_5 as soon as possible.

Christopher Morrone (Inactive) added a comment - 10/Sep/14 9:56 PM Ok, I will give those patches a try. We are planning to try to move to b2_5 in less than 2 months, so we'll need the LU-2827 solution for b2_5 as soon as possible.

Oleg Drokin added a comment - 10/Sep/14 5:03 PM

patch http://review.whamcloud.com/#/c/9488/ fixed to eliminate the assertion in mdt_intent_lock_replace() now.
Also 2.4 deployments would need to carry patch http://review.whamcloud.com/6511 from ~~LU-3428~~. This should address all related woes in b2_4. 2.6+ will be fixed by patches from ~~LU-2827~~ and friends. As for 2.5 I think we'll still move with -2827 as the more generic solution.

Oleg Drokin added a comment - 10/Sep/14 5:03 PM patch http://review.whamcloud.com/#/c/9488/ fixed to eliminate the assertion in mdt_intent_lock_replace() now. Also 2.4 deployments would need to carry patch http://review.whamcloud.com/6511 from LU-3428 . This should address all related woes in b2_4. 2.6+ will be fixed by patches from LU-2827 and friends. As for 2.5 I think we'll still move with -2827 as the more generic solution.

Christopher Morrone (Inactive) added a comment - 26/Aug/14 10:37 PM

Yes, that was my belief. I would like Intel to enumerate the failure modes that users can expect to be fixed, and those that will not be fixed by ~~LU-4584~~.

Christopher Morrone (Inactive) added a comment - 26/Aug/14 10:37 PM Yes, that was my belief. I would like Intel to enumerate the failure modes that users can expect to be fixed, and those that will not be fixed by LU-4584 .

James A Simmons added a comment - 26/Aug/14 10:11 PM

I have several different reproducers of this problem. What I found what ~~LU-4584~~ address some of the reproducers but not all of them. The patch for ~~LU-2827~~ addressed more of my reproducers but I still had client evictions.

James A Simmons added a comment - 26/Aug/14 10:11 PM I have several different reproducers of this problem. What I found what LU-4584 address some of the reproducers but not all of them. The patch for LU-2827 addressed more of my reproducers but I still had client evictions.

Christopher Morrone (Inactive) added a comment - 26/Aug/14 8:45 PM

Could you please provide an explanation of what operations will not be fixed by the ~~LU-4584~~ patch, as compared with the more general ~~LU-2827~~ fix?

Christopher Morrone (Inactive) added a comment - 26/Aug/14 8:45 PM Could you please provide an explanation of what operations will not be fixed by the LU-4584 patch, as compared with the more general LU-2827 fix?

Christopher Morrone (Inactive) added a comment - 20/Aug/14 9:59 PM

FYI, we put the ~~LU-4584~~ patch into production, and it didn't take too long before we hit the assertion that James reported. I opened ~~LU-5525~~ for that bug.

I am beginning to suspect that the patch either did cause the assertion, or made the assertion more common.

Christopher Morrone (Inactive) added a comment - 20/Aug/14 9:59 PM FYI, we put the LU-4584 patch into production, and it didn't take too long before we hit the assertion that James reported. I opened LU-5525 for that bug. I am beginning to suspect that the patch either did cause the assertion, or made the assertion more common.

James A Simmons added a comment - 18/Aug/14 3:26 PM

I have been testing with the ~~LU-4584~~ patch and I'm still seeing client evictions. Could it be possible to get the ~~LU-2827~~ patch working on 2.4

James A Simmons added a comment - 18/Aug/14 3:26 PM I have been testing with the LU-4584 patch and I'm still seeing client evictions. Could it be possible to get the LU-2827 patch working on 2.4

James A Simmons added a comment - 12/Aug/14 1:59 PM

It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree

https://github.com/ORNL-TechInt/lustre

So people can examine our special sauce.

James A Simmons added a comment - 12/Aug/14 1:59 PM It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree https://github.com/ORNL-TechInt/lustre So people can examine our special sauce.

Christopher Morrone (Inactive) added a comment - 11/Aug/14 11:53 PM

James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

Christopher Morrone (Inactive) added a comment - 11/Aug/14 11:53 PM James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 29 Start watching this issue

Dates

Created:: 05/Feb/14 2:00 AM

Updated:: 13/Oct/21 3:05 AM

Resolved:: 12/Dec/17 8:30 AM