[LU-5314] Lustre 2.4.2 MDS hit LBUG and crash Created: 10/Jul/14 Updated: 23/Sep/22 Resolved: 19/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Haisong Cai (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | sdsc | ||
| Environment: |
Linux meerkat-mds-10-1.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 14850 | ||||||||||||
| Description |
|
Our MDS hit LBUG and crashed this evening. Here are the /var/log/messages: Jul 9 18:40:22 meerkat-mds-10-1 kernel: Lustre: meerkat-MDT0000: Client e27741dc-f76c-ea5a-c426-4c6b5e86a758 (at 198.202.118.120@tcp) reconnecting |
| Comments |
| Comment by John Fuchs-Chesney (Inactive) [ 10/Jul/14 ] |
|
Could you please take a look at this Niu? |
| Comment by Niu Yawei (Inactive) [ 10/Jul/14 ] |
|
mdt_getattr_name_lock() cleared the MSG_RESENT flag: CWARN("Although resent, but still not get child lock" "parent:"DFID" child:"DFID"\n", PFID(mdt_object_fid(parent)), PFID(mdt_object_fid(child))); lustre_msg_clear_flags(req->rq_reqmsg, MSG_RESENT); LDLM_LOCK_PUT(lock); GOTO(relock, 0); That'll trigger the LASSERT on MSG_RESENT in mdt_intent_lock_replace(): if (new_lock->l_export == req->rq_export) { /* * Already gave this to the client, which means that we * reconstructed a reply. */ LASSERT(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT); lh->mlh_reg_lh.cookie = 0; RETURN(ELDLM_LOCK_REPLACED); } This part of code has been heavily changed in |
| Comment by Haisong Cai (Inactive) [ 10/Jul/14 ] |
|
Hi Niu, The same MDS has thrown FID errors like below and caused clients to hang. Do you think they are related? Jul 4 20:51:33 meerkat-mds-10-1 kernel: LustreError: 19626:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000030d7:0x220:0x0] is used by two objects: 647048266/1095373013 647048267/1095373014 |
| Comment by Bruno Faccini (Inactive) [ 09/Oct/14 ] |
|
Hello Niu, |
| Comment by Peter Jones [ 19/Jul/17 ] |
|
This issue is fixed on newer releases and SDSC upgraded some time back |