[LU-4403] ASSERTION( lock->l_readers > 0 ) Created: 20/Dec/13 Updated: 04/Aug/14 Resolved: 08/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, mn4 | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12085 |
| Description |
|
<0>LustreError: 5766:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: ^M |
| Comments |
| Comment by Peter Jones [ 20/Dec/13 ] |
|
Mike Do you think that this might be related to Peter |
| Comment by Di Wang [ 20/Dec/13 ] |
|
IMHO, the fix in |
| Comment by Di Wang [ 20/Dec/13 ] |
|
It seems this is caused by the race between mdt_intent_fixup_resend and mdt_object_unlock, i.e. mdt_intent_fixup_resend might return a released lock here. Here is the patch http://review.whamcloud.com/8642 |
| Comment by Mikhail Pershin [ 21/Dec/13 ] |
|
Thanks Di! |
| Comment by Di Wang [ 31/Dec/13 ] |
|
http://review.whamcloud.com/8680 patch for master. |
| Comment by Jay Lan (Inactive) [ 01/Jan/14 ] |
|
We just hit the same problem half an hour ago. The MDS runs the lustre server with patch #8642 included. |
| Comment by Peter Jones [ 01/Jan/14 ] |
|
Jay How long has the patch been applied? Is it possible to ascertain yet whether the frequency of occurrence has been altered since it was applied? Peter |
| Comment by Di Wang [ 01/Jan/14 ] |
|
Jay, Same stack trace? If not, please post here. Are there any other console error messages? WangDi |
| Comment by Mahmoud Hanafi [ 01/Jan/14 ] |
|
Here is the stack trace. bp7-mds1 login: Lustre: MGS: haven't heard from client fd1923ac-e3da-a3a1-46c2-d0613e7a86a3 (at 10.151.0.150@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880696019800, cur 1388538333 expire 1388538183 last 1388538106^M |
| Comment by Mahmoud Hanafi [ 01/Jan/14 ] |
|
One thing to note: when the MDS hits this lbug several OSS will crash here is their bt. |
| Comment by Di Wang [ 01/Jan/14 ] |
|
hmm, I think this OSS crash is a different issue, you probably need open a new ticket for this. |
| Comment by Di Wang [ 01/Jan/14 ] |
|
Hmm, Does your lustre version includes the fix from LU-4179 mdt: skip open lock enqueue during resent
Skip open lock enqueue, if the open lock has been
acquired(mdt_intent_fixup_resent) during resent.
Signed-off-by: wang di <di.wang@intel.com>
Change-Id: I625ca438e28520416ee2af884d0a9f9e6f21cf2e
Reviewed-on: http://review.whamcloud.com/8173
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
and LU-3273 mdt: Hold med_open_lock before walking med_open_head
Fixed a bug where during replay mdt_mfd_open() calls mdt_handle2mfd()
without acquiring the med_open_lock.
We now take the med_open_lock before traversing med_open_head list.
This bug was noticed during the analysis of LU-3233.
Signed-off-by: Swapnil Pimpale <spimpale@ddn.com>
Change-Id: Ib879f65d41d35f266897e8961dac78e6c4f0d9ec
Reviewed-on: http://review.whamcloud.com/7272
Tested-by: Hudson
Tested-by: Maloo <whamcloud.maloo@gmail.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Thanks. |
| Comment by Jay Lan (Inactive) [ 01/Jan/14 ] |
|
Peter, the patch was installed Saturday 12/28 afternoon. I do not know how frequent it crashed before. Mahmoud would know better. Di, the source is at https://github.com/jlan/lustre-nas/commits/nas-2.4.0-1. |
| Comment by Di Wang [ 02/Jan/14 ] |
|
Mahmoud, Could you please trace to the source code which line this "[<ffffffffa0e05a5c>] mdt_object_open_unlock+0xac/0x110 [mdt]" refers to? |
| Comment by Mahmoud Hanafi [ 02/Jan/14 ] |
|
FILE: mdt_open.c crash> bt -l |
| Comment by Di Wang [ 02/Jan/14 ] |
|
Thanks! Mahmoud. It seems in 2.4, MDS will enqueue the open lock anyway, and if the client does not require the open lock, MDT will release the lock later. It is different as b2_1, which only enqueue the open lock if the client requires it. (probably brought in by layout lock patch?). Anyway, this change will add some "local" open lock to the export hash list, which will trigger this problem. I will post a patch soon. |
| Comment by Di Wang [ 02/Jan/14 ] |
|
Ok, I just updated the patch. b2_4 http://review.whamcloud.com/8642 |
| Comment by Jay Lan (Inactive) [ 03/Jan/14 ] |
|
Di, is |
| Comment by Di Wang [ 03/Jan/14 ] |
|
Jay, probably no, the patch here should be enough for this problem. Btw: Did you create a new ticket for the crash you find on OSS? |
| Comment by Mahmoud Hanafi [ 06/Jan/14 ] |
|
We installed the latest version of the patch. It still crashed 2 times after the update. We now know what is causing this. A user is using hdf5 to do parallel writes to 1 file from ~2000 process. I am working on getting a reproducer. The new crashes were exactly the same. |
| Comment by Di Wang [ 07/Jan/14 ] |
|
Interesting. Hmm, if you can reproduce this on a test system and collecting some debug log when crash happened, it would be very helpful. Thanks. |
| Comment by Mahmoud Hanafi [ 07/Jan/14 ] |
|
Which debug log would you like for me to collect. Since the system drops to kdb as soon as we hit the LBUG we need a way to collect those logs. |
| Comment by Di Wang [ 07/Jan/14 ] |
|
You can disable panic_on_lbug on MDS (lctl set_param panic_on_lbug=0), then if LBUG happens, the system will dump the debug log somewhere automatically (you can see this in console message). And also if you can set debug level to -1 (lctl set_param debug=-1) and debug size to 20 (lctl set_param debug_size=30) on MDS, that would make sure the debug log include enough information we need during LBUG, but this parameter changes(debug and debug_size) will slow down your system, please be aware of this. |
| Comment by Jinshan Xiong (Inactive) [ 07/Jan/14 ] |
|
Hi Mahmoud, will you please tell me what's the tip of your branch in your comment at: "Mahmoud Hanafi added a comment - 02/Jan/14 9:05 AM" Jinshan |
| Comment by Jinshan Xiong (Inactive) [ 07/Jan/14 ] |
|
Just a quick update, will you try the patch below to see if I have some good luck here: [jinxiong@intel nasa]$ git diff diff --git a/lustre/mdt/mdt_open.c b/lustre/mdt/mdt_open.c index 545507f..f2a23ee 100644 --- a/lustre/mdt/mdt_open.c +++ b/lustre/mdt/mdt_open.c @@ -1437,7 +1437,7 @@ int mdt_reint_open(struct mdt_thread_info *info, struct mdt_lock_handle *lhc) struct lu_fid *child_fid = &info->mti_tmp_fid1; struct md_attr *ma = &info->mti_attr; __u64 create_flags = info->mti_spec.sp_cr_flags; - __u64 ibits; + __u64 ibits = 0; struct mdt_reint_record *rr = &info->mti_rr; struct lu_name *lname; int result, rc; |
| Comment by Jinshan Xiong (Inactive) [ 07/Jan/14 ] |
|
obviously ibits is not initialized in this case and it caused the lock in @lhc is dropped which we don't hold any references. |
| Comment by Di Wang [ 08/Jan/14 ] |
|
Ah, good catch. Jinshan. Really missed this. |
| Comment by Mahmoud Hanafi [ 08/Jan/14 ] |
|
I have attached mdt_thread_info structure from the dump. It may help... |
| Comment by Di Wang [ 08/Jan/14 ] |
|
Mahmoud: please try the update patch http://review.whamcloud.com/#/c/8642/ |
| Comment by Jinshan Xiong (Inactive) [ 08/Jan/14 ] |
|
patch http://review.whamcloud.com/6511 already fixed this problem. Worth trying this alone if you guys have a chance. |
| Comment by Jinshan Xiong (Inactive) [ 14/Jan/14 ] |
|
drop the priority as there is no response from customer meanwhile I believe we've found the root cause of this issue. |
| Comment by Jay Lan (Inactive) [ 14/Jan/14 ] |
|
We had #8642 patch set 5 installed on Jan 8th. Yesterday morning the mds crashed (patch set 4) and booted up with patch set 5. Today early morning the mds crashed again; however, it was caused by another bug in OSS and the OSS crash brought down the mds. So, we have patch set 5 running for > 1 day without hitting this problem. We will let it soak more time. |
| Comment by Jinshan Xiong (Inactive) [ 15/Jan/14 ] |
|
thanks for the update Jay and good luck with patch set 5. |
| Comment by Mahmoud Hanafi [ 28/Jan/14 ] |
|
Patch set 5 didn't fix the issue. We just hit this bug again. LustreError: 45299:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: PID: 20719 TASK: ffff880368864aa0 CPU: 24 COMMAND: "mdt01_059" |
| Comment by Jinshan Xiong (Inactive) [ 28/Jan/14 ] |
|
Can you share me the following info: Jinshan |
| Comment by Jay Lan (Inactive) [ 28/Jan/14 ] |
|
My bad. The patch set #5 was in my nas-2.4.0-1 branch, but not We just upgraded our server to 2.4.1 yesterday. |
| Comment by Peter Jones [ 08/Feb/14 ] |
|
Patch landed for 2.6 |
| Comment by javed shaikh (Inactive) [ 12/Feb/14 ] |
|
just fyi, we were hit on 9th feb...i've attached the mds.log. |
| Comment by James Nunez (Inactive) [ 18/Apr/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/9779/ |