[LU-2175] sanity-quota test 8 crash Created: 14/Oct/12 Updated: 17/Oct/12 Resolved: 17/Oct/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Johann Lombardi (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 5207 |
| Description |
|
This might not be directly quota-related, but I saw this for the first time only when trying sanity-quota and I do lots of other testing including dbench itself. [ 1544.513445] Lustre: DEBUG MARKER: == sanity-quota test 8: Run dbench with quota enabled == 03:01:54 (1350198114) [ 2076.385452] BUG: unable to handle kernel paging request at ffff89001f9ddd60 [ 2076.385990] IP: [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc] [ 2076.386515] PGD 0 [ 2076.386851] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 2076.387283] last sysfs file: /sys/devices/system/cpu/possible [ 2076.387727] CPU 0 [ 2076.387788] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [ 2076.389330] [ 2076.389330] Pid: 18077, comm: dbench Not tainted 2.6.32-debug #6 Bochs Bochs [ 2076.389330] RIP: 0010:[<ffffffffa05a2915>] [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc] [ 2076.389330] RSP: 0018:ffff880010b01748 EFLAGS: 00010206 [ 2076.389330] RAX: 0000000000000000 RBX: ffff88001f9ddf30 RCX: 0000000000000000 [ 2076.389330] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b017d400 [ 2076.389330] RBP: ffff880010b017d8 R08: 0000000000000001 R09: ffff880000000000 [ 2076.389330] R10: 0000000000000000 R11: 0000000087654321 R12: ffff88001f9ddf70 [ 2076.389330] R13: ffff89001f9ddf70 R14: ffff88008d4f3bf0 R15: ffff880010b019a4 [ 2076.389330] FS: 00007f33f47fa700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000 [ 2076.389330] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2076.389330] CR2: ffff89001f9ddd60 CR3: 0000000012034000 CR4: 00000000000006f0 [ 2076.389330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2076.389330] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2076.389330] Process dbench (pid: 18077, threadinfo ffff880010b00000, task ffff8800b52541c0) [ 2076.389330] Stack: [ 2076.389330] ffff8800ffffff9d ffff8800b5254860 00000000000001f4 d84156c5635688c0 [ 2076.389330] <d> ffff880010b01798 ffffffffa1089fa4 ffff880010b01798 00000020a05ab4fc [ 2076.389330] <d> ffff8800ae00cbf0 0000000000000000 ffff880010b017b8 ffffffffa05ac5e6 [ 2076.389330] [<ffffffffa05049a3>] lmv_intent_lock+0x2b3/0x380 [lmv] [ 2076.389330] [<ffffffffa0c238e0>] ? ll_md_blocking_ast+0x0/0x780 [lustre] [ 2076.389330] [<ffffffffa0c2382e>] ? ll_i2gids+0x2e/0xe0 [lustre] [ 2076.389330] [<ffffffffa0c07b3f>] ? ll_prep_md_op_data+0x12f/0x3f0 [lustre] [ 2076.389330] [<ffffffffa0c22fb0>] ll_lookup_it+0x3b0/0xbd0 [lustre] [ 2076.389330] [<ffffffffa0c238e0>] ? ll_md_blocking_ast+0x0/0x780 [lustre] [ 2076.389330] [<ffffffffa0f68c10>] ? cfs_alloc+0x30/0x60 [libcfs] [ 2076.389330] [<ffffffffa0c24839>] ll_lookup_nd+0x89/0x3b0 [lustre] [ 2076.389330] [<ffffffff81193f36>] ? d_alloc+0x166/0x1d0 [ 2076.389330] [<ffffffff811877e2>] __lookup_hash+0x102/0x160 [ 2076.389330] [<ffffffff81187e8a>] lookup_hash+0x3a/0x50 [ 2076.389330] [<ffffffff8118bda6>] do_filp_open+0x2c6/0xd90 [ 2076.389330] [<ffffffff8118827b>] ? getname+0x3b/0x250 [ 2076.389330] [<ffffffff8119861b>] ? alloc_fd+0xab/0x160 [ 2076.389330] [<ffffffff81178839>] do_sys_open+0x69/0x140 [ 2076.389330] [<ffffffff81178950>] sys_open+0x20/0x30 [ 2076.389330] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [ 2076.389330] Code: 00 00 00 0f 85 5a fa ff ff 4c 8b 6b 40 4c 8d 63 40 4d 39 e5 75 15 e9 52 ff ff ff 0f 1f 00 4d 8b 6d 00 4d 39 e5 0f 84 96 fa ff ff <41> 81 bd f0 fd ff ff 00 de c0 eb 49 8d bd 68 fd ff ff 75 df e8 [ 2076.389330] RIP [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc] [ 2076.389330] RSP <ffff880010b01748> [ 2076.389330] CR2: ffff89001f9ddd60 I have a crashdump for this. |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ] |
|
This is indeed not related to quota (the crash is in the client code). I'm checking the lmv code in case i can put the finger on something, but i'm afraid that more information (a crash dump?) will be required ... |
| Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ] |
Ah, i missed this. Could you please upload it to brent? |
| Comment by Oleg Drokin [ 15/Oct/12 ] |
|
lu2175.tar.bz2 and vmlinux.bz2 in my home dir on brent |
| Comment by Johann Lombardi (Inactive) [ 16/Oct/12 ] |
|
Trying to map the address, i get: (gdb) list *(ptlrpc_set_wait+1541)
0x41945 is in ptlrpc_set_wait (/home/green/git/lustre-release/lustre/ptlrpc/client.c:2121).
2116 /home/green/git/lustre-release/lustre/ptlrpc/client.c: No such file or directory.
2107 int ptlrpc_set_wait(struct ptlrpc_request_set *set)
2108 {
2109 cfs_list_t *tmp;
2110 struct ptlrpc_request *req;
2111 struct l_wait_info lwi;
2112 int rc, timeout;
2113 ENTRY;
2114
2115 if (set->set_producer)
2116 (void)ptlrpc_set_producer(set);
2117 else
2118 cfs_list_for_each(tmp, &set->set_requests) {
2119 req = cfs_list_entry(tmp, struct ptlrpc_request,
2120 rq_set_chain);
2121 if (req->rq_phase == RQ_PHASE_NEW)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2122 (void)ptlrpc_send_new_req(req);
2123 }
So it seems that we crashed while trying to access a request from set_requests. crash> px *((struct ptlrpc_request_set *)0xffff88001f9ddf30)
$9 = {
set_refcount = {
counter = 0x1
},
set_new_count = {
counter = 0x0
},
set_remaining = {
counter = 0x1
},
set_waitq = {
lock = {
raw_lock = {
slock = 0x0
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff
},
task_list = {
next = 0xffff88001f9ddf58,
prev = 0xffff88001f9ddf58
}
},
set_wakeup_ptr = 0x0,
set_requests = {
next = 0xffff8800ae00de88,
prev = 0xffff8800ae00de88
^^^^^^^^^^^^^^^^^^^
},
set_cblist = {
next = 0xffff88001f9ddf80,
prev = 0xffff88001f9ddf80
},
set_interpret = 0,
set_arg = 0x0,
set_new_req_lock = {
raw_lock = {
slock = 0x0
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff
},
set_new_requests = {
next = 0xffff88001f9ddfb8,
prev = 0xffff88001f9ddfb8
},
set_rc = 0x0,
set_max_inflight = 0xffffffff,
set_producer = 0,
set_producer_arg = 0x0
}
So there is indeed a request in set_requests, which is: (gdb) p &(((struct ptlrpc_request *)0)->rq_set_chain)
$4 = (cfs_list_t *) 0x298
(gdb) p/x (0xffff8800ae00de88 - 0x298)
$5 = 0xffff8800ae00dbf0
crash> px *((struct ptlrpc_request *)(0xffff8800ae00dbf0))
$18 = {
rq_type = 0x1267,
rq_status = 0x0,
rq_list = {
next = 0xffff88008ad47890,
prev = 0xffff88008ad47890
},
rq_timed_list = {
next = 0xffff8900ae00dc08,
prev = 0xffff8800ae00dc08
},
rq_history_list = {
next = 0xffff8900ae00dc18,
prev = 0xffff8900ae00dc18
},
rq_exp_list = {
next = 0xffff8900ae00dc28,
prev = 0xffff8800ae00dc28
},
rq_ops = 0x10000000000,
rq_svc_thread = 0x0,
rq_history_seq = 0x0,
rq_at_index = 0x0,
rq_lock = {
raw_lock = {
slock = 0x0
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff
},
rq_intr = 0x0,
rq_replied = 0x1,
rq_err = 0x0,
rq_timedout = 0x0,
rq_resend = 0x0,
rq_restart = 0x0,
rq_replay = 0x1,
rq_no_resend = 0x0,
rq_waiting = 0x0,
rq_receiving_reply = 0x0,
rq_no_delay = 0x0,
rq_net_err = 0x0,
rq_wait_ctx = 0x0,
rq_early = 0x0,
rq_must_unlink = 0x1,
rq_fake = 0x0,
rq_memalloc = 0x0,
rq_packed_final = 0x0,
rq_hp = 0x0,
rq_at_linked = 0x0,
rq_reply_truncate = 0x0,
rq_committed = 0x0,
rq_invalid_rqset = 0x0,
rq_generation_set = 0x0,
rq_no_retry_einprogress = 0x1,
rq_nr_resend = 0x0,
rq_phase = 3955285505,
rq_next_phase = 3955285510,
rq_refcount = {
counter = 0x2
},
rq_request_portal = 0x10c,
rq_reply_portal = 0xa,
rq_nob_received = 0x258,
rq_reqlen = 0x348,
rq_replen = 0x470,
rq_reqmsg = 0xffff8900ae00cbf0,
rq_repmsg = 0x0,
rq_transno = 0x10000000000,
rq_xid = 0x507a5fe4aac3c,
rq_replay_list = {
next = 0xffff8800ae00dcb8,
prev = 0xffff8800ae00dcb8
},
rq_cli_ctx = 0xffffffffa0677380,
rq_svc_ctx = 0x0,
rq_ctx_chain = {
next = 0xffff8800ae00dcd8,
prev = 0xffff8800ae00dcd8
},
rq_flvr = {
sf_rpc = 0x0,
sf_flags = 0x0,
u_rpc = {<No data fields>},
u_bulk = {
hash = {
hash_alg = 0x0
}
}
},
rq_sp_from = LUSTRE_SP_CLI,
rq_ctx_init = 0x0,
rq_ctx_fini = 0x0,
rq_bulk_read = 0x0,
rq_bulk_write = 0x0,
rq_auth_gss = 0x0,
rq_auth_remote = 0x0,
rq_auth_usr_root = 0x0,
rq_auth_usr_mdt = 0x0,
rq_auth_usr_ost = 0x0,
rq_pack_udesc = 0x0,
rq_pack_bulk = 0x0,
rq_no_reply = 0x0,
rq_pill_init = 0x1,
rq_auth_uid = 0x0,
rq_auth_mapped_uid = 0x0,
rq_user_desc = 0x10000000000,
rq_reqbuf = 0xffff8800ae00cbf0,
rq_repbuf = 0xffff8800ae00e7f0 "",
rq_repdata = 0x0,
rq_clrbuf = 0x0,
rq_reqbuf_len = 0x400,
rq_reqdata_len = 0x248,
rq_repbuf_len = 0x800,
rq_repdata_len = 0x0,
rq_clrbuf_len = 0x0,
rq_clrdata_len = 0x0,
rq_reply_off = 0xc0,
rq_req_swab_mask = 0x0,
rq_rep_swab_mask = 0x0,
rq_import_generation = 0x1,
rq_send_state = LUSTRE_IMP_FULL,
rq_early_count = 0x0,
rq_req_md_h = {
cookie = 0x2266a11
},
rq_req_cbid = {
cbid_fn = 0xffffffffa05b1720,
cbid_arg = 0xffff8800ae00dbf0
},
rq_delay_limit = 0x0,
rq_queued_time = 0x10006c6be,
rq_arrival_time = {
tv_sec = 0x507a6575,
tv_usec = 0xe6e6b
},
rq_reply_state = 0x10000000000,
rq_rqbd = 0x0,
rq_reply_md_h = {
cookie = 0x2266a09
},
rq_reply_waitq = {
lock = {
raw_lock = {
slock = 0x0
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff
},
task_list = {
next = 0xffff8800ae00ddc8,
prev = 0xffff8800ae00ddc8
}
},
rq_reply_cbid = {
cbid_fn = 0xffffffffa05b22c0,
cbid_arg = 0xffff8800ae00dbf0
},
rq_self = 0x0,
rq_peer = {
nid = 0x0,
pid = 0x0
},
rq_export = 0x10000000000,
rq_import = 0xffff89008ad477f0,
rq_replay_cb = 0,
rq_commit_cb = 0x10000000000,
rq_cb_data = 0x10000000000,
rq_bulk = 0x0,
rq_sent = 0x507a6575,
rq_real_sent = 0x507a6575,
rq_deadline = 0x507a65a1,
rq_reply_deadline = 0x0,
rq_bulk_deadline = 0x0,
rq_timeout = 0x2b,
rq_set_waitq = {
lock = {
raw_lock = {
slock = 0x0
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff
},
task_list = {
next = 0xffff8800ae00de78,
prev = 0xffff8800ae00de78
}
},
rq_set_chain = {
next = 0xffff89001f9ddf70,
prev = 0xffff89001f9ddf70
},
rq_set = 0xffff88001f9ddf30,
rq_interpret_reply = 0,
rq_async_args = {
pointer_arg = {0x0, 0x10000000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
space = {0x0, 0x10000000000, 0x0, 0x0, 0x0, 0x0, 0x0}
},
rq_pool = 0x0,
rq_session = {
lc_tags = 0x0,
lc_state = 0,
lc_thread = 0x10000000000,
lc_value = 0x10000000000,
lc_remember = {
next = 0x0,
prev = 0x0
},
lc_version = 0x0,
lc_cookie = 0x0
},
rq_recov_session = {
lc_tags = 0x0,
lc_state = 0,
lc_thread = 0x0,
lc_value = 0x0,
lc_remember = {
next = 0x0,
prev = 0x0
},
lc_version = 0x0,
lc_cookie = 0x0
},
rq_pill = {
rc_req = 0xffff8800ae00dbf0,
rc_fmt = 0xffffffffa064f8a0,
rc_loc = RCL_CLIENT,
rc_area = {{0xb8, 0x68, 0x8, 0x88, 0x100, 0x0, 0x10b, 0x48, 0xfffffeff}, {0xb8, 0x70, 0xd8, 0x38, 0x104, 0x178, 0x78, 0xfffffeff, 0xffffffff}}
}
|
| Comment by Johann Lombardi (Inactive) [ 16/Oct/12 ] |
rq_set_chain = {
next = 0xffff89001f9ddf70,
prev = 0xffff89001f9ddf70
},
So the request almost points back to the set 0xffff89001f9ddf70 = 0xffff88001f9ddf30(set address) + 0x40(set_requests offset) + .... 0x10000000000. It seems that there was a bit swap |
| Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ] |
|
I talked to Oleg and he indeed experienced some memory issues which are likely the root cause of this crash. |