[LU-6664] (ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel Created: 29/May/15 Updated: 27/Jul/16 Resolved: 31/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
On all of our filesystems, the following error message is extremely common: LustreError: 8746:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel There is nothing else in the logs that gives any hint as to why this message is appearing. Our filesystems all use osd-zfs, and we are currently running Lustre 2.5.3-5chaos (see github.com/chaos/lustre). If this is a symptom of a bug, then please fix it. If this is not a symptom of a bug, then please stop scaring our system administrators with this message. |
| Comments |
| Comment by Peter Jones [ 29/May/15 ] |
|
Bobijam Could you please look into this issue? Thanks Peter |
| Comment by Oleg Drokin [ 29/May/15 ] |
|
Bobi: This message is ENOENT when we are supposedly trying to flush data on lock cancel. But technically if we have a lock, the object should be there (the only exception I can think of is the actual object destroy is happening under the lock, so at the end of destroy the lock is still there and the object is not, but then there should be nothing to flush). |
| Comment by Zhenyu Xu [ 01/Jun/15 ] |
|
Chris, Does your system just undergo recovery before these messages appears? |
| Comment by Christopher Morrone [ 01/Jun/15 ] |
|
No, it has not just undergone recovery. There are no other messages in the logs surrounding these ost_blocking_ast() messages. |
| Comment by Zhenyu Xu [ 06/Jun/15 ] |
|
http://review.whamcloud.com/15167 commit message LU-6664 ofd: LDLM lock should cover object destroy The exclusive PW lock protecting OST object destroy should be released after object destroy procedure. Quench error messages of object unavailability when trying to cancel a LDLM lock. |
| Comment by Oleg Drokin [ 09/Jun/15 ] |
|
Just as I was reviewing this patch (the comments are n the patch), I just remembered that LLNL had suspicions of double referenced objects in the past ( I wonder if lfsck in 2.5.4 is already in a good enough shape to be able to detect that. |
| Comment by Andreas Dilger [ 06/Jul/15 ] |
|
The LFSCK in 2.5.x does not check MDT<->OST consistency. That feature ("lctl lfsck_start -t layout") wasn't added until 2.6.0. |
| Comment by Andreas Dilger [ 14/Aug/15 ] |
|
Bobijam, can you please also make a version of your patch for master. |
| Comment by Gerrit Updater [ 15/Aug/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15997 |
| Comment by Andreas Dilger [ 21/Aug/15 ] |
|
Oleg and I looked into this issue more closely, and the current patch doesn't really solve the problem, since the race is when the two destroy threads are getting and dropping the DLM lock, and not when the actual destroy is happening. In master, the equivalent function tgt_blocking_ast() already has a check for dt_object_exists() and skips the call into ofd_sync() that generates this message completely. I think the right fix (for 2.5.x only) is to just skip this message for rc == -ENOENT as is already done in master. |
| Comment by Amit (Inactive) [ 24/Aug/15 ] |
|
I too see ton's of these error messages: Any help in resolving them will be very helpful. I can provide any debug logs if required. It is consistently appearing in most of the OSS's. LustreError: 7577:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Thank you, |
| Comment by Peter Jones [ 31/Aug/15 ] |
|
As per LLNL ok to close |