[LU-7777] toss 3 client kernel panic in ll_get_dir_page() Created: 15/Feb/16 Updated: 23/Feb/16 Resolved: 23/Feb/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ruth Klundt (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
toss-release-3.0-36alpha.ch6 lustre 2.5.5-3chaos-CHANGED-3.10.0-327.0.0.1chaos.ch6.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We've started some client testing of TOSS 3 alpha against servers running toss-release-2.4-2.ch5.4. This is more of a heads up, perhaps there are already newer builds of the client in action somewhere? If not and you cannot reproduce I'll start an Intel issue. We can repeatably cause a kernel panic by doing active I/O to one file system (cp -ar <something large>) while simultaneously running ls -l in the other mounted file system, or occasionally in the same file system from a different session or node. This is regardless of the backing fs of the file system. Here's a representative trace: |
| Comments |
| Comment by Ruth Klundt (Inactive) [ 15/Feb/16 ] |
|
ah so I created this in the wrong jira, but I was likely about to report this at Intel as well. Sorry for the confusion. |
| Comment by Peter Jones [ 16/Feb/16 ] |
|
Bobijam Could you please advise on this issue? Thanks Peter |
| Comment by Ruth Klundt (Inactive) [ 17/Feb/16 ] |
|
Update and clarification, the panic is repeatable also when one process is doing rm -rf on some part of the lustre tree, and another one is doing recursive ls -l. Also in the description I misspoke, the processes are not on different nodes, but the same node. The processes can be on the same file system or 2 different ones. Thanks for taking a look. |
| Comment by Zhenyu Xu [ 18/Feb/16 ] |
|
can we know which code line does ll_get_dir_page+0x3e5 point at? |
| Comment by Ruth Klundt (Inactive) [ 18/Feb/16 ] |
|
I haven't got the source for this particular build, I've asked for it and will let you know. I believe it is built from the latest supported Intel 2.5.5 tag, but there may be other patches applied. |
| Comment by Peter Jones [ 18/Feb/16 ] |
|
Ruth We have access to the source, but Bobijam is hoping that you can run a gdb command to map the address referenced in the crash to the line of code affected. Are you familar with this process? Peter |
| Comment by Ruth Klundt (Inactive) [ 18/Feb/16 ] |
|
I was missing the debuginfo rpm, I have that now. it's a bit busy here today. I'm familiar with disassembly various ways, was going to do objdump but whatever is needed. probably tomorrow |
| Comment by Oleg Drokin [ 19/Feb/16 ] |
|
objdump is a bit overkill. Just use gdb on the lustre.ko (with debug symbols still in, out of the debuginfo rpm) and issue a "l *(ll_get_dir_page+0x3e5)" command. |
| Comment by Ruth Klundt (Inactive) [ 19/Feb/16 ] |
|
here you go, let me know if more is needed. (gdb) l *(ll_get_dir_page+0x3e5) 321 |
| Comment by Oleg Drokin [ 19/Feb/16 ] |
|
hm, thats a bit less useful than we hoped, I guess. Can you substract 4 from the the 0x3e5 number until the resulting l command output in gdb will show us something in the ll_get_dir_page, please? |
| Comment by Ruth Klundt (Inactive) [ 19/Feb/16 ] |
|
sure, this looks a bit better: |
| Comment by Oleg Drokin [ 19/Feb/16 ] |
|
Ok, that clear it. Your bug is a duplicate of |
| Comment by Ruth Klundt (Inactive) [ 19/Feb/16 ] |
|
thanks, will do fyi, the patch in |
| Comment by Christopher Morrone [ 22/Feb/16 ] |
|
Intel, you will be providing a patch for b2_5_fe, I assume? |
| Comment by Peter Jones [ 22/Feb/16 ] |
|
Yup. One is in flight already and will be flagged when it is ready for you to pick up. |
| Comment by Ruth Klundt (Inactive) [ 22/Feb/16 ] |
|
I'm not able to trigger the oops with the following change built in: diff --git a/lustre/llite/dir.c b/lustre/llite/dir.c
index 4f1b853..cf69cc4 100644
--- a/lustre/llite/dir.c
+++ b/lustre/llite/dir.c
@@ -276,7 +276,7 @@ static struct page *ll_dir_page_locate(struct inode *dir, __u64 *hash,
spin_lock_irq(&mapping->tree_lock);
found = radix_tree_gang_lookup(&mapping->page_tree,
(void **)&page, offset, 1);
- if (found > 0) {
+ if (found > 0 && !radix_tree_exceptional_entry(page)) {
struct lu_dirpage *dp;
page_cache_get(page);
|
| Comment by Peter Jones [ 23/Feb/16 ] |
|
2.5.x FE fix for |