Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.2, Lustre 2.5.3
-
2
-
14213
Description
After upgrading our servers from Lustre 2.4.0-28chaos to Lustre 2.4.2-11chaos (see github.com/chaos/lustre), we are seeing many client crashes with a NULL pointer dereference in posix_acl_valid() under mdc_get_lustre_md(). Note that both 2.4.0-19chaos client nodes and 2.4.2-11chaos client nodes are exhibiting this behavior.
The backtrace looks like:
PID: 3690 TASK: ffff880338d69540 CPU: 7 COMMAND: "ll_sa_3689" #0 [ffff8802ddf51800] machine_kexec+0x18b at ffffffff810391ab #1 [ffff8802ddf51860] crash_kexec+0x72 at ffffffff810c5d52 #2 [ffff8802ddf51930] oops_end+0xc0 at ffffffff8152e630 #3 [ffff8802ddf51960] no_context+0xfb at ffffffff8104a00b #4 [ffff8802ddf519b0] __bad_area_nosemaphore+0x125 at ffffffff8104a295 #5 [ffff8802ddf51a00] bad_area_nosemaphore+0x13 at ffffffff8104a363 #6 [ffff8802ddf51a10] __do_page_fault+0x32f at ffffffff8104aacf #7 [ffff8802ddf51b30] do_page_fault+0x3e at ffffffff8153057e #8 [ffff8802ddf51b60] page_fault+0x25 at ffffffff8152d935 [exception RIP: posix_acl_valid+9] RIP: ffffffff811ea9b9 RSP: ffff8802ddf51c10 RFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff8805ec607000 RCX: ffff8805ebddda00 RDX: 0000000000000004 RSI: 0000000000000004 RDI: 0000000000000000 RBP: ffff8802ddf51c10 R8: 0000000000000000 R9: ffff8805ec46dc40 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88063987bc00 R13: ffff8802ddf51cf0 R14: 0000000000000000 R15: 0000000000000050 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff8802ddf51c18] mdc_get_lustre_md+0xc5a at ffffffffa0af4faa [mdc] #10 [ffff8802ddf51c78] lmv_get_lustre_md+0x153 at ffffffffa0d668d3 [lmv] #11 [ffff8802ddf51cc8] ll_prep_inode+0x3f7 at ffffffffa0c7e217 [lustre] #12 [ffff8802ddf51da8] ll_post_statahead+0x2f7 at ffffffffa0ca0577 [lustre] #13 [ffff8802ddf51e18] ll_statahead_thread+0xd38 at ffffffffa0ca4ff8 [lustre] #14 [ffff8802ddf51f48] child_rip+0xa at ffffffff8100c24a
The crash is on this line in posix_acl_valid():
crash> gdb list *(posix_acl_valid+9) 0xffffffff811ea9b9 is in posix_acl_valid (fs/posix_acl.c:88). 83 const struct posix_acl_entry *pa, *pe; 84 int state = ACL_USER_OBJ; 85 unsigned int id = 0; /* keep gcc happy */ 86 int needs_mask = 0; 87 88 FOREACH_ACL_ENTRY(pa, acl, pe) { 89 if (pa->e_perm & ~(ACL_READ|ACL_WRITE|ACL_EXECUTE)) 90 return -EINVAL; 91 switch (pa->e_tag) { 92 case ACL_USER_OBJ:
The problem is not particular to the statahead thread. That was just one example. Here is another where I ran getfattr on the problem file:
2014-06-05 14:51:29 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 2014-06-05 14:51:29 IP: [<ffffffff811ea9b9>] posix_acl_valid+0x9/0x120 2014-06-05 14:51:29 PGD 638f93067 PUD 5f1873067 PMD 0 2014-06-05 14:51:29 Oops: 0000 [#1] SMP 2014-06-05 14:51:29 last sysfs file: /sys/devices/system/edac/pci/pci_parity_count 2014-06-05 14:51:29 CPU 10 2014-06-05 14:51:29 Modules linked in: lmv(U) mgc(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate lustre(U) lov(U) osc(U) mdc(U 2014-06-05 14:51:29 2014-06-05 14:51:29 Pid: 6114, comm: getfattr Tainted: P --------------- 2.6.32-431.17.2.1chaos.ch5.2.x86_64 #1 Dell XS23-TY35 /0GW08P 2014-06-05 14:51:29 RIP: 0010:[<ffffffff811ea9b9>] [<ffffffff811ea9b9>] posix_acl_valid+0x9/0x120 2014-06-05 14:51:29 RSP: 0018:ffff8805f0e698d8 EFLAGS: 00010207 2014-06-05 14:51:29 RAX: 0000000000000000 RBX: ffff880639212000 RCX: ffff8805f1870a00 2014-06-05 14:51:29 RDX: 0000000000000004 RSI: 0000000000000004 RDI: 0000000000000000 2014-06-05 14:51:29 RBP: ffff8805f0e698d8 R08: 0000000000000000 R09: 0000000000000040 2014-06-05 14:51:29 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8805f3b22c00 2014-06-05 14:51:29 R13: ffff8805f0e699b8 R14: 0000000000000000 R15: 0000000000000050 2014-06-05 14:51:29 FS: 00002aaaab266fa0(0000) GS:ffff88034ac80000(0000) knlGS:0000000000000000 2014-06-05 14:51:29 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2014-06-05 14:51:29 CR2: 0000000000000004 CR3: 0000000638355000 CR4: 00000000000007e0 2014-06-05 14:51:29 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2014-06-05 14:51:29 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2014-06-05 14:51:29 Process getfattr (pid: 6114, threadinfo ffff8805f0e68000, task ffff88063a007500) 2014-06-05 14:51:29 Stack: 2014-06-05 14:51:29 ffff8805f0e69938 ffffffffa091dfaa ffff8805f1870a00 ffff8806389acc00 2014-06-05 14:51:29 <d> 0000000000000000 0000000000000000 6c0db3630c468718 ffff8806389acc00 2014-06-05 14:51:29 <d> ffff880639212000 ffff880639212400 ffff8805f0e699b8 ffff8805f3b22c00 2014-06-05 14:51:29 Call Trace: 2014-06-05 14:51:29 [<ffffffffa091dfaa>] mdc_get_lustre_md+0xc5a/0x1420 [mdc] 2014-06-05 14:51:29 [<ffffffffa0d6c8d3>] lmv_get_lustre_md+0x153/0x3d0 [lmv] 2014-06-05 14:51:29 [<ffffffffa0a99217>] ll_prep_inode+0x3f7/0xf60 [lustre] 2014-06-05 14:51:29 [<ffffffffa079faa8>] ? req_capsule_server_get+0x18/0x20 [ptlrpc] 2014-06-05 14:51:29 [<ffffffffa0d85e4a>] ? lmv_intent_lookup+0x25a/0x770 [lmv] 2014-06-05 14:51:29 [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre] 2014-06-05 14:51:29 [<ffffffffa0aa89aa>] ll_lookup_it_finish+0x1da/0xe80 [lustre] 2014-06-05 14:51:29 [<ffffffffa0d86fca>] ? lmv_intent_lock+0x32a/0x380 [lmv] 2014-06-05 14:51:29 [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre] 2014-06-05 14:51:29 [<ffffffffa0aa9a3d>] ll_lookup_it+0x3ed/0xbd0 [lustre] 2014-06-05 14:51:29 [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre] 2014-06-05 14:51:29 [<ffffffffa0aaa2ac>] ll_lookup_nd+0x8c/0x430 [lustre] 2014-06-05 14:51:29 [<ffffffff811a457e>] ? d_alloc+0x13e/0x1b0 2014-06-05 14:51:29 [<ffffffff811998a5>] do_lookup+0x1a5/0x230 2014-06-05 14:51:29 [<ffffffff81199fb7>] __link_path_walk+0x587/0x850 2014-06-05 14:51:29 [<ffffffff811680ea>] ? alloc_pages_current+0xaa/0x110 2014-06-05 14:51:29 [<ffffffff8119a97a>] path_walk+0x6a/0xe0 2014-06-05 14:51:29 [<ffffffff8119ab8b>] filename_lookup+0x6b/0xc0 2014-06-05 14:51:29 [<ffffffff8119bcb7>] user_path_at+0x57/0xa0 2014-06-05 14:51:29 [<ffffffff8104a9a4>] ? __do_page_fault+0x204/0x490 2014-06-05 14:51:29 [<ffffffff8128ae05>] ? rb_insert_color+0x125/0x160 2014-06-05 14:51:29 [<ffffffff8114f020>] ? __vma_link_rb+0x30/0x40 2014-06-05 14:51:29 [<ffffffff8118f7a0>] vfs_fstatat+0x50/0xa0 2014-06-05 14:51:29 [<ffffffff8118f85e>] vfs_lstat+0x1e/0x20 2014-06-05 14:51:29 [<ffffffff8118f884>] sys_newlstat+0x24/0x50 2014-06-05 14:51:29 [<ffffffff8153057e>] ? do_page_fault+0x3e/0xa0 2014-06-05 14:51:29 [<ffffffff8152d935>] ? page_fault+0x25/0x30 2014-06-05 14:51:29 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b