[LU-1421] Client LBUG in ll_file_write after filesystem expansion Created: 18/May/12 Updated: 15/May/14 Resolved: 14/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Marek Magrys | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | client | ||
| Environment: |
SL5 on clients and servers, mix of 2.1.0, 2.1.1 and 2.2.0 clients, 2.2.0 on all servers |
||
| Attachments: |
|
| Severity: | 4 |
| Epic: | client |
| Rank (Obsolete): | 4593 |
| Description |
|
After adding 24 OSTs to the file system we get client LBUGs and crashes on Lustre 2.2.0. We expanded the file system by adding new resources and new OSTs had been seen by clients properly, however now we get dozens of crashes every day. Trace looks like this: May 18 15:18:36 <user.notice> n3-1-13.local Pid[]: 9127, comm: dtf3d_qdot.out Problem is hard to reproduce even though we know which binaries caused it. For now it looks like after client reboot the problem disappears, however a subsequent crash might have simply not happened yet. We don't have a crashkernel dump yet. There is nothing suspicious in the server logs. |
| Comments |
| Comment by Brian Behlendorf [ 18/May/12 ] |
|
We also observed this issue last night in the latest Orion branch when we started up our stress testing on Grove. Lustre version lustre-orion-2.2.49.57-45chaos. LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) page@ffff880c749b5680[4 ffff880ffdf34448:1024 ^(null)_ffff880c749b55c0 1 0 1 ffff880e6a836928 (null) 0x0] |
| Comment by Jinshan Xiong (Inactive) [ 18/May/12 ] |
|
Hi Brian, what's the kernel you were running? EL5 too? |
| Comment by Jinshan Xiong (Inactive) [ 29/May/12 ] |
|
The page was in a strange state where top page was in owned but sub-page is in cache state; also the vmpage is not locked at all. It will be helpful if I can get backtrace of all tasks in the system when crash happens, the exact kernel version is also needed in case it's related to kernel. |
| Comment by Marek Magrys [ 29/May/12 ] |
|
The kernel version is 2.6.18-308.4.1, latest from SL5. As soon as we will be able to reproduce the problem I'll post the information. We dedicated some hardware resources just to reproduce the problem with codes which are possibly the cause of the LBUG, but we're not lucky yet. |
| Comment by Brian Behlendorf [ 30/May/12 ] |
|
In our case this was with 2.6.32-220.17.1.2chaos.ch5.x86_64 which is RHEL6.2 plus a few patches. |
| Comment by Prakash Surya (Inactive) [ 31/May/12 ] |
|
We seem to hit this running a specific test on our Grove/Sequoia filesystem. We don't see this running the same test on our 2.1.1 based test system. Perhaps it is a regression in 2.2? |
| Comment by Marek Magrys [ 31/May/12 ] |
|
Yes, as I mentioned this bug doesn't concern 2.1.1, that's the version to which we are doing rollback. |
| Comment by Prakash Surya (Inactive) [ 31/May/12 ] |
|
Jinshan, here is a dump of the backtraces for all processes running on the node at the time of the ASSERT. Let me know if any other information would be helpful to get out of the crash dump. |
| Comment by Peter Jones [ 31/May/12 ] |
|
Jinshan Could you please look into this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 01/Jun/12 ] |
|
This issue is imported by page writeback support at c5361360e51de22a59d4427327bddf9fd398f352. I'll cook a patch soon. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/12 ] |
|
A patch is pushed to: http://review.whamcloud.com/3027 |
| Comment by Prakash Surya (Inactive) [ 05/Jun/12 ] |
|
Thanks, Jinshan! I've pulled it into our Orion branch. |
| Comment by Peter Jones [ 14/Jun/12 ] |
|
Landed for 2.3 |