[LU-5233] 2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed Created: 19/Jun/14 Updated: 26/Jun/14 Resolved: 26/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, dne2 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14584 | ||||||||
| Description |
|
On the same system as 0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed: Additionally, we had the following stuck thread: For some time before the LBUG. This thread is - in all of these instances - stuck in a rather odd spot in cfs_hash_bd_lookup_intent: Specifically, it reports as being stuck on the cfs_hash_keycmp line. It's not clear to me how a thread could get stuck there. I may be missing some operation it's doing as part of that. I'll make the dump available shortly. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 19/Jun/14 ] |
|
MDS dump will here in < 10 minutes: Then: |
| Comment by Patrick Farrell (Inactive) [ 19/Jun/14 ] |
|
There was also a client which was stuck waiting on a reply from MDS001/MDT000 before it crashed [Obviously, there were many time outs after it crashed, but before that.], and the times match roughly with those for the stuck thread. The stuck thread is probably a separate issue from the LBUG, but I don't want to separate them until we're further along. Here's the client bug information: One of the client nodes got stuck before that - This is thread refusing to exit because it's stuck in Lustre (Many other client threads were also stuck behind this one for the MDC rpc lock in mdc_close): The client is waiting for a ptlrpc reply. I strongly suspect this corresponds to the stuck thread messages on the MDS. The first stuck thread messages on the MDS come here: Jun 18 23:16:36 galaxy-esf-mds001 kernel: INFO: task mdt01_020:26426 blocked for more than 120 seconds. And are repeated up until when it LBUGged (always the same task). The stuck thread message from the client is coming on task exit, so it's already been stuck for some amount of time. The first stuck thread message on the MDS (Stuck for 600 seconds) comes 9 minutes or so after the client reports a stuck thread. So the time frames are pretty good. Without digging through data structures on the MDS I can't be sure, it seems likely the stuck thread on the MDS is the cause of the problem on the client. |
| Comment by Jodi Levi (Inactive) [ 20/Jun/14 ] |
|
Di, |
| Comment by Di Wang [ 21/Jun/14 ] |
|
Jodi: Yes, since it is a LBUG, probably could be a blocker, or at least critical one. But I think I know the reason, I will cook a patch soon. |
| Comment by Di Wang [ 21/Jun/14 ] |
| Comment by Jodi Levi (Inactive) [ 26/Jun/14 ] |
|
Patch landed to Master. Please reopen ticket if there is more work needed. |