[LU-5150] NULL pointer dereference in posix_acl_valid() under mdc_get_lustre_md() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.4.2, Lustre 2.5.3
Labels:
- llnl
- mn4
- prz

Severity:
2
Rank (Obsolete):
14213

Description

After upgrading our servers from Lustre 2.4.0-28chaos to Lustre 2.4.2-11chaos (see github.com/chaos/lustre), we are seeing many client crashes with a NULL pointer dereference in posix_acl_valid() under mdc_get_lustre_md(). Note that both 2.4.0-19chaos client nodes and 2.4.2-11chaos client nodes are exhibiting this behavior.

The backtrace looks like:

PID: 3690   TASK: ffff880338d69540  CPU: 7   COMMAND: "ll_sa_3689"
 #0 [ffff8802ddf51800] machine_kexec+0x18b at ffffffff810391ab
 #1 [ffff8802ddf51860] crash_kexec+0x72 at ffffffff810c5d52
 #2 [ffff8802ddf51930] oops_end+0xc0 at ffffffff8152e630
 #3 [ffff8802ddf51960] no_context+0xfb at ffffffff8104a00b
 #4 [ffff8802ddf519b0] __bad_area_nosemaphore+0x125 at ffffffff8104a295
 #5 [ffff8802ddf51a00] bad_area_nosemaphore+0x13 at ffffffff8104a363
 #6 [ffff8802ddf51a10] __do_page_fault+0x32f at ffffffff8104aacf
 #7 [ffff8802ddf51b30] do_page_fault+0x3e at ffffffff8153057e
 #8 [ffff8802ddf51b60] page_fault+0x25 at ffffffff8152d935
    [exception RIP: posix_acl_valid+9]
    RIP: ffffffff811ea9b9  RSP: ffff8802ddf51c10  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff8805ec607000  RCX: ffff8805ebddda00
    RDX: 0000000000000004  RSI: 0000000000000004  RDI: 0000000000000000
    RBP: ffff8802ddf51c10   R8: 0000000000000000   R9: ffff8805ec46dc40
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88063987bc00
    R13: ffff8802ddf51cf0  R14: 0000000000000000  R15: 0000000000000050
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8802ddf51c18] mdc_get_lustre_md+0xc5a at ffffffffa0af4faa [mdc]
#10 [ffff8802ddf51c78] lmv_get_lustre_md+0x153 at ffffffffa0d668d3 [lmv]
#11 [ffff8802ddf51cc8] ll_prep_inode+0x3f7 at ffffffffa0c7e217 [lustre]
#12 [ffff8802ddf51da8] ll_post_statahead+0x2f7 at ffffffffa0ca0577 [lustre]
#13 [ffff8802ddf51e18] ll_statahead_thread+0xd38 at ffffffffa0ca4ff8 [lustre]
#14 [ffff8802ddf51f48] child_rip+0xa at ffffffff8100c24a

The crash is on this line in posix_acl_valid():

crash> gdb list *(posix_acl_valid+9)
0xffffffff811ea9b9 is in posix_acl_valid (fs/posix_acl.c:88).
83              const struct posix_acl_entry *pa, *pe;
84              int state = ACL_USER_OBJ;
85              unsigned int id = 0;  /* keep gcc happy */
86              int needs_mask = 0;
87     
88              FOREACH_ACL_ENTRY(pa, acl, pe) {
89                      if (pa->e_perm & ~(ACL_READ|ACL_WRITE|ACL_EXECUTE))
90                              return -EINVAL;
91                      switch (pa->e_tag) {
92                              case ACL_USER_OBJ:

The problem is not particular to the statahead thread. That was just one example. Here is another where I ran getfattr on the problem file:

2014-06-05 14:51:29 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
2014-06-05 14:51:29 IP: [<ffffffff811ea9b9>] posix_acl_valid+0x9/0x120
2014-06-05 14:51:29 PGD 638f93067 PUD 5f1873067 PMD 0 
2014-06-05 14:51:29 Oops: 0000 [#1] SMP 
2014-06-05 14:51:29 last sysfs file: /sys/devices/system/edac/pci/pci_parity_count
2014-06-05 14:51:29 CPU 10 
2014-06-05 14:51:29 Modules linked in: lmv(U) mgc(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate lustre(U) lov(U) osc(U) mdc(U
2014-06-05 14:51:29 
2014-06-05 14:51:29 Pid: 6114, comm: getfattr Tainted: P           ---------------    2.6.32-431.17.2.1chaos.ch5.2.x86_64 #1 Dell     XS23-TY35   /0GW08P
2014-06-05 14:51:29 RIP: 0010:[<ffffffff811ea9b9>]  [<ffffffff811ea9b9>] posix_acl_valid+0x9/0x120
2014-06-05 14:51:29 RSP: 0018:ffff8805f0e698d8  EFLAGS: 00010207
2014-06-05 14:51:29 RAX: 0000000000000000 RBX: ffff880639212000 RCX: ffff8805f1870a00
2014-06-05 14:51:29 RDX: 0000000000000004 RSI: 0000000000000004 RDI: 0000000000000000
2014-06-05 14:51:29 RBP: ffff8805f0e698d8 R08: 0000000000000000 R09: 0000000000000040
2014-06-05 14:51:29 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8805f3b22c00
2014-06-05 14:51:29 R13: ffff8805f0e699b8 R14: 0000000000000000 R15: 0000000000000050
2014-06-05 14:51:29 FS:  00002aaaab266fa0(0000) GS:ffff88034ac80000(0000) knlGS:0000000000000000
2014-06-05 14:51:29 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2014-06-05 14:51:29 CR2: 0000000000000004 CR3: 0000000638355000 CR4: 00000000000007e0
2014-06-05 14:51:29 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2014-06-05 14:51:29 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2014-06-05 14:51:29 Process getfattr (pid: 6114, threadinfo ffff8805f0e68000, task ffff88063a007500)
2014-06-05 14:51:29 Stack:
2014-06-05 14:51:29  ffff8805f0e69938 ffffffffa091dfaa ffff8805f1870a00 ffff8806389acc00
2014-06-05 14:51:29 <d> 0000000000000000 0000000000000000 6c0db3630c468718 ffff8806389acc00
2014-06-05 14:51:29 <d> ffff880639212000 ffff880639212400 ffff8805f0e699b8 ffff8805f3b22c00
2014-06-05 14:51:29 Call Trace:
2014-06-05 14:51:29  [<ffffffffa091dfaa>] mdc_get_lustre_md+0xc5a/0x1420 [mdc]
2014-06-05 14:51:29  [<ffffffffa0d6c8d3>] lmv_get_lustre_md+0x153/0x3d0 [lmv]
2014-06-05 14:51:29  [<ffffffffa0a99217>] ll_prep_inode+0x3f7/0xf60 [lustre]
2014-06-05 14:51:29  [<ffffffffa079faa8>] ? req_capsule_server_get+0x18/0x20 [ptlrpc]
2014-06-05 14:51:29  [<ffffffffa0d85e4a>] ? lmv_intent_lookup+0x25a/0x770 [lmv]
2014-06-05 14:51:29  [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre]
2014-06-05 14:51:29  [<ffffffffa0aa89aa>] ll_lookup_it_finish+0x1da/0xe80 [lustre]
2014-06-05 14:51:29  [<ffffffffa0d86fca>] ? lmv_intent_lock+0x32a/0x380 [lmv]
2014-06-05 14:51:29  [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre]
2014-06-05 14:51:29  [<ffffffffa0aa9a3d>] ll_lookup_it+0x3ed/0xbd0 [lustre]
2014-06-05 14:51:29  [<ffffffffa0aa53b0>] ? ll_md_blocking_ast+0x0/0x740 [lustre]
2014-06-05 14:51:29  [<ffffffffa0aaa2ac>] ll_lookup_nd+0x8c/0x430 [lustre]
2014-06-05 14:51:29  [<ffffffff811a457e>] ? d_alloc+0x13e/0x1b0
2014-06-05 14:51:29  [<ffffffff811998a5>] do_lookup+0x1a5/0x230
2014-06-05 14:51:29  [<ffffffff81199fb7>] __link_path_walk+0x587/0x850
2014-06-05 14:51:29  [<ffffffff811680ea>] ? alloc_pages_current+0xaa/0x110
2014-06-05 14:51:29  [<ffffffff8119a97a>] path_walk+0x6a/0xe0
2014-06-05 14:51:29  [<ffffffff8119ab8b>] filename_lookup+0x6b/0xc0
2014-06-05 14:51:29  [<ffffffff8119bcb7>] user_path_at+0x57/0xa0
2014-06-05 14:51:29  [<ffffffff8104a9a4>] ? __do_page_fault+0x204/0x490
2014-06-05 14:51:29  [<ffffffff8128ae05>] ? rb_insert_color+0x125/0x160
2014-06-05 14:51:29  [<ffffffff8114f020>] ? __vma_link_rb+0x30/0x40
2014-06-05 14:51:29  [<ffffffff8118f7a0>] vfs_fstatat+0x50/0xa0
2014-06-05 14:51:29  [<ffffffff8118f85e>] vfs_lstat+0x1e/0x20
2014-06-05 14:51:29  [<ffffffff8118f884>] sys_newlstat+0x24/0x50
2014-06-05 14:51:29  [<ffffffff8153057e>] ? do_page_fault+0x3e/0xa0
2014-06-05 14:51:29  [<ffffffff8152d935>] ? page_fault+0x25/0x30
2014-06-05 14:51:29  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

vmcore-dmesg.txt
69 kB
09/Jun/14 8:43 AM

Issue Links

is related to

LU-4787 OOPS from null pointer dereference in mdc_unpack_acl

Resolved

is related to

LU-4680 Invalid system.posix_acl_default breaks umask on ZFS MDT

Resolved

mentioned in: Page Loading...

NULL pointer dereference in posix_acl_valid() under mdc_get_lustre_md()

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates