[LU-3487] Oops in vvp_io_fault_iter_init() Created: 20/Jun/13  Updated: 29/Sep/14  Resolved: 28/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.4.3
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: llite, sdsc

Severity: 3
Rank (Obsolete): 8769

 Description   

Running racer.sh 2.4.50-79-gaed8203 with some local patches (LU-3072, LU-3348, LU-3233, LU-3448) I can reproduce the following oops in vvp_io_fault_iter_init():

00000100:00100000:0.0:1371755423.271136:0:30780:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc 19:1868864a-63ce-938d-b909-2dada17703ae:30780:1438389041886196:0@lo:49
00000080:00020000:0.0:1371755423.271146:0:30780:0:(vvp_io.c:1241:vvp_io_init()) lustre: refresh file layout [0x2c0000401:0x3467:0x0] error -13.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: [<ffffffffa0ce2bbc>] vvp_io_fault_iter_init+0x4c/0xc0 [lustre]
...
Pid: 30780, comm: 19 Tainted: P           ---------------    2.6.32-279.19.1.el6_lustre_gcov.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffffa0ce2bbc>]  [<ffffffffa0ce2bbc>] vvp_io_fault_iter_init+0x4c/0xc0 [lustre]
RSP: 0018:ffff88014196daf8  EFLAGS: 00010292
RAX: 0000000000000000 RBX: ffff88016554b870 RCX: 0000000000000000
RDX: ffff880160625400 RSI: ffffffffa0d0f0a0 RDI: ffff8801647e4ca0
RBP: ffff88014196db18 R08: 0000000000000000 R09: ffff880164fe68c8
R10: 0000000000000003 R11: 0000000000000000 R12: ffff8801647e4c68
R13: ffff880147805738 R14: ffff88016554c610 R15: ffff88014196dbd8
FS:  00007f6d52008700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001eac45c CR3: 0000000141551000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process 19 (pid: 30780, threadinfo ffff88014196c000, task ffff88016547e040)
Stack:
 ffff88016554b870 ffff880164fe68c8 ffff8801647e4c68 ffff88014196dbc8
<d> ffff88014196db48 ffffffffa04a529d ffff88016547e040 ffff880164fe68c8
<d> ffff8801647e4c68 ffff88016356b148 ffff88014196db78 ffffffffa04a9cec
Call Trace:
 [<ffffffffa04a529d>] cl_io_iter_init+0x5d/0x110 [obdclass]
 [<ffffffffa04a9cec>] cl_io_loop+0x4c/0x1b0 [obdclass]
 [<ffffffffa0cc6552>] ll_fault+0x2c2/0x4d0 [lustre]
 [<ffffffff8113bd54>] __do_fault+0x54/0x510
 [<ffffffff81128750>] ? __lru_cache_add+0x40/0x90
 [<ffffffff8113c307>] handle_pte_fault+0xf7/0xb50
 [<ffffffff81278cec>] ? __bitmap_weight+0x8c/0xb0
 [<ffffffff8116ba07>] ? mem_cgroup_update_file_mapped+0x17/0x90
 [<ffffffff8114536a>] ? page_remove_rmap+0x7a/0xa0
 [<ffffffff8113cf9a>] handle_mm_fault+0x23a/0x310
 [<ffffffff810432d9>] __do_page_fault+0x139/0x480
 [<ffffffff81196b40>] ? mntput_no_expire+0x30/0x110
 [<ffffffff811793e1>] ? __fput+0x1a1/0x210
 [<ffffffff8113fcee>] ? remove_vma+0x6e/0x90
 [<ffffffff814f0f5e>] do_page_fault+0x3e/0xa0
 [<ffffffff814ee315>] page_fault+0x25/0x30
Code: 89 f3 49 89 fc e8 65 ff ff ff 48 8b 7b 08 49 89 c6 e8 69 71 ff ff 48 89 de 4c 89 e7 49 89 c5 e8 db 96 ff ff 48 8b 80 a8 00 00 00 <48> 8b 80 b0 00 00 00 48 8b 40 18 4c 3b 68 10 75 22 49 8b 85 80 
RIP  [<ffffffffa0ce2bbc>] vvp_io_fault_iter_init+0x4c/0xc0 [lustre]
 RSP <ffff88014196daf8>
CR2: 00000000000000b0

The oops is in the assertion

LASSERT(inode == cl2ccc_io(env, ios)->cui_fd->fd_file->f_dentry->d_inode);

because the ccc_io has a NULL cui_fd member:

crash> p *((struct cl_io *)0xffff880164fe68c8)
$3 = {
  ci_type = CIT_FAULT,
  ci_state = CIS_ZERO,
  ci_obj = 0xffff880142b48148,
  ci_parent = 0x0,
  ci_layers = {
    next = 0xffff88016554b888,
    prev = 0xffff88016554b888
  },
  ...
  },
  ci_lockreq = CILR_MANDATORY,
  u = {
    ...
    ci_fault = {
      ft_index = 3,
      ft_nob = 0,
      ft_writable = 0,
      ft_executable = 4,
      ft_mkwrite = 0,
      ft_page = 0x0
    },
    ...
  },
  ...
  ci_nob = 0,
  ci_result = 0,
  ci_continue = 0,
  ci_no_srvlock = 0,
  ci_need_restart = 0,
  ci_ignore_layout = 0,
  ci_verify_layout = 0,
  ci_owned_nr = 0
}

crash> p *((struct ccc_io *)0xffff88016554b870)
$9 = {
  cui_cl = {
    cis_io = 0xffff880164fe68c8,
    cis_obj = 0xffff880142b48148,
    cis_iop = 0xffffffffa0ce8200,
    cis_linkage = {
      next = 0xffff880164fe68e0,
      prev = 0xffff880164fe68e0
    }
  },
  cui_link = {
    cill_linkage = {
      next = 0x0,
      prev = 0x0
    },
    cill_descr = {
      cld_obj = 0x0,
      cld_start = 0,
      cld_end = 0,
      cld_gid = 0,
      cld_mode = CLM_PHANTOM,
      cld_enq_flags = 0
    },
    cill_lock = 0x0,
    cill_fini = 0
  },
  cui_iov = 0x0,
  cui_nrsegs = 0,
  cui_tot_nrsegs = 0,
  cui_iov_olen = 0,
  cui_tot_count = 0,
  u = {
    setattr = {
      cui_local_lock = SETATTR_NOLOCK
    }
  },
  cui_glimpse = 0,
  cui_layout_gen = 4294967294,
  cui_fd = 0x0,
  cui_iocb = 0x0
}

I'm not sure why MDS_GETXATTR is returning -EACCES, but form here it's easy to see that the ll_fault_io_init() fails to handle the subsequent error from cl_io_init() and returns io without cui_fd being set.



 Comments   
Comment by John Hammond [ 21/Jun/13 ]

Please see http://review.whamcloud.com/6735.

Comment by John Hammond [ 28/Jun/13 ]

Patch landed to master.

Comment by Marek Magrys [ 31/Oct/13 ]

I think we hit the same bug with 2.4.X:

Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa138825c>]: vvp_io_fault_iter_init+0x4c/0xc0 [lustre]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus Call[]: Trace:
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa0f8186d>]: cl_io_iter_init+0x5d/0x110 [obdclass]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa0f85e0c>]: cl_io_loop+0x4c/0x1b0 [obdclass]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa136b6c2>]: ll_fault+0x2c2/0x4d0 [lustre]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff81143294>]: __do_fault+0x54/0x530
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff8112fdd0>]: ? __lru_cache_add+0x40/0x90
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff81143867>]: handle_pte_fault+0xf7/0xb50
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa1381171>]: ? cl2ccc_io+0x21/0xa0 [lustre]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffffa13874a5>]: ? vvp_io_fini+0x25/0x1b0 [lustre]
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff811444fa>]: handle_mm_fault+0x23a/0x310
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff810474e9>]: __do_page_fault+0x139/0x480
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff8100988e>]: ? __switch_to+0x26e/0x320
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff8150e130>]: ? thread_return+0x4e/0x76e
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff81513b6e>]: do_page_fault+0x3e/0xa0
Oct 31 10:26:22 <user.notice> n0466-g6l.zeus [<ffffffff81510f25>]: page_fault+0x25/0x30

So maybe it's worth to push this patch into b2_4, unless you don't plan to release 2.4.2.

Comment by Mike O'Connor [ 29/Sep/14 ]

We appear to have hit the same bug with 2.4.2, as well.

Is there a plan to backport this fix to the 2.4 tree?

Generated at Sat Feb 10 01:34:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.