[LU-9196] MDS server for Atlas file system crashed due to memory exhaustion. Created: 08/Mar/17 Updated: 05/Jun/18 Resolved: 05/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James A Simmons | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.8 running non patched Lustre 2.8 server using ldiskfs. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Our MDS server crashed due to memory exhaustion. Examination of the system logs show nothing out of the ordinary expect it was noticed that an IO scrub did start off on the MDS server: Lustre: atlas1-MDT0000-o: trigger OI scrub by RPC for the [0x20003f37b:0x3e5:0x0] with flags 0x4a, rc = 0 Some time after that we encountered the following crash which is attached.
|
| Comments |
| Comment by James A Simmons [ 08/Mar/17 ] |
|
The IO scrubber was triggered at 02:06, 05:12, 08:26, 11:36, 14:47 during the day. |
| Comment by Jian Yu [ 08/Mar/17 ] |
|
Hi Nasf, |
| Comment by nasf (Inactive) [ 09/Mar/17 ] |
|
According to the logs, there are several issues: 1) OI scrub was triggered because of OI inconsistency, including the following three FIDs: Would you please to find out the file or object/inode corresponding to these FIDs via "lfs fid2path". The FID [0x1000:0x15c5020:0x0] is IGIF, some abnormal. Please dump such inode (#4096) via debugfs. 2) LBUG() during osd_object_release(). <0>[2339122.255892] LustreError: 16352:0:(osd_handler.c:1610:osd_object_release()) LBUG <4>[2339122.264739] Pid: 16352, comm: mdt01_381 <4>[2339122.269446] <4>[2339122.269446] Call Trace: <4>[2339122.274687] [<ffffffffa05b4875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[2339122.282893] [<ffffffffa05b4e77>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[2339122.290225] [<ffffffffa0ea93e8>] osd_object_release+0x88/0x90 [osd_ldiskfs] <4>[2339122.298765] [<ffffffffa074d6fd>] lu_object_put+0x16d/0x3b0 [obdclass] <4>[2339122.306500] [<ffffffffa102abc7>] mdt_getattr_name_lock+0x5f7/0x1900 [mdt] <4>[2339122.314609] [<ffffffffa102c3f2>] mdt_intent_getattr+0x292/0x470 [mdt] <4>[2339122.322330] [<ffffffffa101d93e>] mdt_intent_policy+0x4be/0xc70 [mdt] <4>[2339122.329981] [<ffffffffa091c0c7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc] <4>[2339122.337912] [<ffffffffa0946307>] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc] <4>[2339122.346565] [<ffffffffa09b9a71>] ? tgt_lookup_reply+0x31/0x190 [ptlrpc] <4>[2339122.354501] [<ffffffffa09cbbe1>] tgt_enqueue+0x61/0x230 [ptlrpc] <4>[2339122.361753] [<ffffffffa09cc69c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] <4>[2339122.369877] [<ffffffffa09796f1>] ptlrpc_main+0xd21/0x1800 [ptlrpc] <4>[2339122.377324] [<ffffffffa09789d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] <4>[2339122.384747] [<ffffffff810a640e>] kthread+0x9e/0xc0 <4>[2339122.390625] [<ffffffff8100c28a>] child_rip+0xa/0x20 <4>[2339122.396587] [<ffffffff810a6370>] ? kthread+0x0/0xc0 <4>[2339122.402556] [<ffffffff8100c280>] ? child_rip+0x0/0x20 It seems that the inode nlink attribute is invalid (be marked as zero but nobody destroy it). We hit similar trouble in 3) A lot of mdt_getattr_internal() failure as following: <3>[2230069.055437] LustreError: 16356:0:(mdt_handler.c:893:mdt_getattr_internal()) atlas1-MDT0000: getattr error for [0x2003863b4:0xc7fd:0x0]: rc = -2 It may be normal because of raced unlink operation from other. Let's ignore this failure temporarily. 4) Some directories are full: <4>[401753.420890] LDISKFS-fs warning (device dm-5): ldiskfs_dx_add_entry: Directory (ino: 438307289) index full, reach max htree level :2 <4>[401753.434706] LDISKFS-fs warning (device dm-5): ldiskfs_dx_add_entry: Large directory feature is not enabled on this filesystem It is because ldiskfs only supports two levels htree-based directory by default. If too many entries are inserted into single directory, it will be exhausted although there are still space on disk. 5) Out of memory. |
| Comment by nasf (Inactive) [ 19/Apr/17 ] |
|
Any further feedback or logs or reproduction? |
| Comment by James A Simmons [ 21/Apr/17 ] |
|
We haven't seen this problem since. As for the OI problems you saw we are running lfsck to clean those up. Once lfsck is done I will report if everything is fixed. |
| Comment by nasf (Inactive) [ 07/Aug/17 ] |
|
Any update? Thanks! |
| Comment by Jian Yu [ 04/Dec/17 ] |
|
Hi James,
Is everything fixed after running lfsck? |
| Comment by nasf (Inactive) [ 05/Jun/18 ] |
|
The main issues should have been fixed via LFSCK. Please reopen it if have more questions. |