[LU-1648] MDS Crash Created: 20/Jul/12 Updated: 22/Apr/14 Resolved: 21/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Fabio Verzelloni | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Luster Servers ---> 2.2.51.0 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4502 | ||||||||
| Description |
|
Lustre hung this morning, we are running a e2fsck on the MDT at the moment. Mounting the mdt with ldiskfs we saw many large file like 'oi.XX.XX', what are these files? [root@weisshorn01 mdt]# ls |
| Comments |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
Please don't touch this files, so I'm correct on Again, we need Fan Yong to comment on this, I have added him to CC list, I believe he has some way to fix this, which might require you to run a tool to rebuild these files. |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
Btw, I think the crash is not about these files, could you please post console output or whatever information from MDS about the crash so we can investigate on it? Thanks |
| Comment by Fabio Verzelloni [ 20/Jul/12 ] |
|
[root@weisshorn01 ~]# e2fsck -fp /dev/mapper/mds scratch-MDT0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. |
| Comment by nasf (Inactive) [ 20/Jul/12 ] |
|
These oi.16.xx files are used for mapping global identifier (FID) to local identifier (ino# & gen) for ldiskfs-based backend filesystem. These files are used on server only, invisible to client. According to current design and implementation, the OI file size/space cannot be shrink. I am making patch in Anyway, OI file size/space issue should not hung the system. Have you seen some error meesage for "-ENOSPC" on MDS when the system hung? |
| Comment by Liang Zhen (Inactive) [ 20/Jul/12 ] |
|
many threads are stuck at "start_this_handle": Jul 20 09:00:34 weisshorn02 kernel: Call Trace: Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff8127466d>] ? pointer+0xad/0xa60 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff812731ee>] ? number+0x2ee/0x320 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa01424f0>] jbd2_journal_start+0xd0/0x110 [jbd2] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0af8b08>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa05d0d41>] fsfilt_ldiskfs_start+0x91/0x480 [fsfilt_ldiskfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa063fdaa>] llog_origin_handle_cancel+0x3ea/0xa20 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc] it looks like |
| Comment by nasf (Inactive) [ 25/Jul/12 ] |
|
Yes, it is The original patch of But it ignored the case of journal handle restarting. Under such case, the caller may be blocked with holding "lgh_lock". I have no idea to resolve such deadlock yet. Jul 20 09:06:37 weisshorn02 kernel: Call Trace: |
| Comment by nasf (Inactive) [ 25/Jul/12 ] |
|
My current idea is to increase the credit for llog cancel to prevent journal restart during the transaction. It may be not perfect solution, but should be workable. This is the patch: |
| Comment by nasf (Inactive) [ 09/Aug/12 ] |
|
The patch has been landed to Lustre-2.3. Fabio, would have chance to verity it on your system? Thanks! |
| Comment by Emoly Liu [ 05/Dec/12 ] |
|
Port for b2_1 is at http://review.whamcloud.com/4743 |