[LU-4584] Lock revocation process fails consistently - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
12530

Description

Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.

This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:

set directory striping default count to 48
touch a file on client A
rm file on client B

The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.

I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

172.16.66.4@tcp.log.bz2
40 kB
06/Feb/14 6:53 PM
172.16.66.5@tcp.log.bz2
53 kB
06/Feb/14 6:53 PM
172.20.20.201@o2ib500.log.bz2
8.52 MB
06/Feb/14 6:53 PM
client_log_20140206.txt
375 kB
07/Feb/14 2:05 AM
inflames.log
2.40 MB
02/Apr/14 6:58 PM

Issue Links

duplicates

LU-4963 client eviction during IOR test - lock callback timer expired

Closed

is related to

LU-5525 ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Resolved

LU-5632 ldlm_lock_addref()) ASSERTION( lock != ((void *)0) )

Resolved

LU-5686 (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

Resolved

is related to

LU-2827 mdt_intent_fixup_resent() cannot find the proper lock in hash

Resolved

Activity

[LU-4584] Lock revocation process fails consistently

James A Simmons added a comment - 18/Aug/14 3:26 PM

I have been testing with the ~~LU-4584~~ patch and I'm still seeing client evictions. Could it be possible to get the ~~LU-2827~~ patch working on 2.4

James A Simmons added a comment - 18/Aug/14 3:26 PM I have been testing with the LU-4584 patch and I'm still seeing client evictions. Could it be possible to get the LU-2827 patch working on 2.4

James A Simmons added a comment - 12/Aug/14 1:59 PM

It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree

https://github.com/ORNL-TechInt/lustre

So people can examine our special sauce.

James A Simmons added a comment - 12/Aug/14 1:59 PM It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree https://github.com/ORNL-TechInt/lustre So people can examine our special sauce.

Christopher Morrone (Inactive) added a comment - 11/Aug/14 11:53 PM

James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

Christopher Morrone (Inactive) added a comment - 11/Aug/14 11:53 PM James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

James A Simmons added a comment - 29/Jul/14 11:11 PM

Just finished a test shot with Cray 2.5 clients to see if the client evicts stopped. Their default client which is some 2.5 version with many many patches lacked the ~~LU-2827~~ and ~~LU-4861~~ patches I founded that helped with 2.5.2. So I applied patches from ~~LU-2827~~ and ~~LU-4861~~ but still had client evicts. I collected the logs from the server side and have placed them here:

ftp.whamcloud.com/uploads/~~LU-4584~~/atlas2_testshot_Jul_29_2014_debug_logs.tar.gz

James A Simmons added a comment - 29/Jul/14 11:11 PM Just finished a test shot with Cray 2.5 clients to see if the client evicts stopped. Their default client which is some 2.5 version with many many patches lacked the LU-2827 and LU-4861 patches I founded that helped with 2.5.2. So I applied patches from LU-2827 and LU-4861 but still had client evicts. I collected the logs from the server side and have placed them here: ftp.whamcloud.com/uploads/ LU-4584 /atlas2_testshot_Jul_29_2014_debug_logs.tar.gz

Bruno Faccini (Inactive) added a comment - 22/Jul/14 2:03 PM

BTW, I forgot to indicate here that my b2_4 patch/back-port for ~~LU-2827~~ (http://review.whamcloud.com/10902) has still some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from ~~LU-4584~~ or recovery-small/test_53 in auto-tests.
More to come, crash-dump is under investigations, but we still can use http://review.whamcloud.com/9488 as a fix for b2_4.

Bruno Faccini (Inactive) added a comment - 22/Jul/14 2:03 PM BTW, I forgot to indicate here that my b2_4 patch/back-port for LU-2827 ( http://review.whamcloud.com/10902 ) has still some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from LU-4584 or recovery-small/test_53 in auto-tests. More to come, crash-dump is under investigations, but we still can use http://review.whamcloud.com/9488 as a fix for b2_4.

Bruno Faccini (Inactive) added a comment - 30/Jun/14 7:39 PM - edited

Merged b2_4 backport of both #5978 and #10378 master changes from ~~LU-2827~~, is at http://review.whamcloud.com/10902.

Bruno Faccini (Inactive) added a comment - 30/Jun/14 7:39 PM - edited Merged b2_4 backport of both #5978 and #10378 master changes from LU-2827 , is at http://review.whamcloud.com/10902 .

Bruno Faccini (Inactive) added a comment - 27/Jun/14 9:02 AM

Because Client's reply buffer was not big enough to receive 1st Server's reply including LVB/layout due to default/large striping.

Bruno Faccini (Inactive) added a comment - 27/Jun/14 9:02 AM Because Client's reply buffer was not big enough to receive 1st Server's reply including LVB/layout due to default/large striping.

Christopher Morrone (Inactive) added a comment - 26/Jun/14 6:02 PM

Why are messages being resent?

Christopher Morrone (Inactive) added a comment - 26/Jun/14 6:02 PM Why are messages being resent?

Bruno Faccini (Inactive) added a comment - 26/Jun/14 10:06 AM

Prior to patch from this ticket and/or ~~LU-2827~~, there was a bug (new lock early creation causing itself instead of 1st to be found during lookup!) during Server handling of resent requests causing the 1st/old one to become orphaned and replicated.
It has been decided that patch(es) from ~~LU-2827~~ will be used to fix this issue because more generic and handling all cases, mainly by detecting resent case earlier and avoiding unnecessary new lock creation.
I am presently porting+testing a b2_4 backport of ~~LU-2827~~ patches and will provide updates asap.

Bruno Faccini (Inactive) added a comment - 26/Jun/14 10:06 AM Prior to patch from this ticket and/or LU-2827 , there was a bug (new lock early creation causing itself instead of 1st to be found during lookup!) during Server handling of resent requests causing the 1st/old one to become orphaned and replicated. It has been decided that patch(es) from LU-2827 will be used to fix this issue because more generic and handling all cases, mainly by detecting resent case earlier and avoiding unnecessary new lock creation. I am presently porting+testing a b2_4 backport of LU-2827 patches and will provide updates asap.

Christopher Morrone (Inactive) added a comment - 23/Jun/14 8:20 PM

So now that the unrelated issue being tracked for ORNL has moved to ~~LU-5225~~ and the patches from ~~LU-2827~~ have landed to master, can this issue be marked as a duplicate of ~~LU-2827~~?

First, can we get an explanation of how that fixes this problem, and then a clear list of the patch(es) I need to apply to b2_4?

Christopher Morrone (Inactive) added a comment - 23/Jun/14 8:20 PM So now that the unrelated issue being tracked for ORNL has moved to LU-5225 and the patches from LU-2827 have landed to master, can this issue be marked as a duplicate of LU-2827 ? First, can we get an explanation of how that fixes this problem, and then a clear list of the patch(es) I need to apply to b2_4?

Peter Jones added a comment - 21/Jun/14 2:04 PM

So now that the unrelated issue being tracked for ORNL has moved to ~~LU-5225~~ and the patches from ~~LU-2827~~ have landed to master, can this issue be marked as a duplicate of ~~LU-2827~~?

Peter Jones added a comment - 21/Jun/14 2:04 PM So now that the unrelated issue being tracked for ORNL has moved to LU-5225 and the patches from LU-2827 have landed to master, can this issue be marked as a duplicate of LU-2827 ?

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 29 Start watching this issue

Dates

Created:: 05/Feb/14 2:00 AM

Updated:: 13/Oct/21 3:05 AM

Resolved:: 12/Dec/17 8:30 AM