[LU-2175] sanity-quota test 8 crash Created: 14/Oct/12  Updated: 17/Oct/12  Resolved: 17/Oct/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Johann Lombardi (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5207

 Description   

This might not be directly quota-related, but I saw this for the first time only when trying sanity-quota and I do lots of other testing including dbench itself.
So for now I suspect it's due to quota.

[ 1544.513445] Lustre: DEBUG MARKER: == sanity-quota test 8: Run dbench with quota enabled == 03:01:54 (1350198114)
[ 2076.385452] BUG: unable to handle kernel paging request at ffff89001f9ddd60
[ 2076.385990] IP: [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc]
[ 2076.386515] PGD 0 
[ 2076.386851] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2076.387283] last sysfs file: /sys/devices/system/cpu/possible
[ 2076.387727] CPU 0 
[ 2076.387788] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[ 2076.389330] 
[ 2076.389330] Pid: 18077, comm: dbench Not tainted 2.6.32-debug #6 Bochs Bochs
[ 2076.389330] RIP: 0010:[<ffffffffa05a2915>]  [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc]
[ 2076.389330] RSP: 0018:ffff880010b01748  EFLAGS: 00010206
[ 2076.389330] RAX: 0000000000000000 RBX: ffff88001f9ddf30 RCX: 0000000000000000
[ 2076.389330] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b017d400
[ 2076.389330] RBP: ffff880010b017d8 R08: 0000000000000001 R09: ffff880000000000
[ 2076.389330] R10: 0000000000000000 R11: 0000000087654321 R12: ffff88001f9ddf70
[ 2076.389330] R13: ffff89001f9ddf70 R14: ffff88008d4f3bf0 R15: ffff880010b019a4
[ 2076.389330] FS:  00007f33f47fa700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
[ 2076.389330] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2076.389330] CR2: ffff89001f9ddd60 CR3: 0000000012034000 CR4: 00000000000006f0
[ 2076.389330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2076.389330] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2076.389330] Process dbench (pid: 18077, threadinfo ffff880010b00000, task ffff8800b52541c0)
[ 2076.389330] Stack:
[ 2076.389330]  ffff8800ffffff9d ffff8800b5254860 00000000000001f4 d84156c5635688c0
[ 2076.389330] <d> ffff880010b01798 ffffffffa1089fa4 ffff880010b01798 00000020a05ab4fc
[ 2076.389330] <d> ffff8800ae00cbf0 0000000000000000 ffff880010b017b8 ffffffffa05ac5e6
[ 2076.389330]  [<ffffffffa05049a3>] lmv_intent_lock+0x2b3/0x380 [lmv]
[ 2076.389330]  [<ffffffffa0c238e0>] ? ll_md_blocking_ast+0x0/0x780 [lustre]
[ 2076.389330]  [<ffffffffa0c2382e>] ? ll_i2gids+0x2e/0xe0 [lustre]
[ 2076.389330]  [<ffffffffa0c07b3f>] ? ll_prep_md_op_data+0x12f/0x3f0 [lustre]
[ 2076.389330]  [<ffffffffa0c22fb0>] ll_lookup_it+0x3b0/0xbd0 [lustre]
[ 2076.389330]  [<ffffffffa0c238e0>] ? ll_md_blocking_ast+0x0/0x780 [lustre]
[ 2076.389330]  [<ffffffffa0f68c10>] ? cfs_alloc+0x30/0x60 [libcfs]
[ 2076.389330]  [<ffffffffa0c24839>] ll_lookup_nd+0x89/0x3b0 [lustre]
[ 2076.389330]  [<ffffffff81193f36>] ? d_alloc+0x166/0x1d0
[ 2076.389330]  [<ffffffff811877e2>] __lookup_hash+0x102/0x160
[ 2076.389330]  [<ffffffff81187e8a>] lookup_hash+0x3a/0x50
[ 2076.389330]  [<ffffffff8118bda6>] do_filp_open+0x2c6/0xd90
[ 2076.389330]  [<ffffffff8118827b>] ? getname+0x3b/0x250
[ 2076.389330]  [<ffffffff8119861b>] ? alloc_fd+0xab/0x160
[ 2076.389330]  [<ffffffff81178839>] do_sys_open+0x69/0x140
[ 2076.389330]  [<ffffffff81178950>] sys_open+0x20/0x30
[ 2076.389330]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[ 2076.389330] Code: 00 00 00 0f 85 5a fa ff ff 4c 8b 6b 40 4c 8d 63 40 4d 39 e5 75 15 e9 52 ff ff ff 0f 1f 00 4d 8b 6d 00 4d 39 e5 0f 84 96 fa ff ff <41> 81 bd f0 fd ff ff 00 de c0 eb 49 8d bd 68 fd ff ff 75 df e8 
[ 2076.389330] RIP  [<ffffffffa05a2915>] ptlrpc_set_wait+0x605/0x8b0 [ptlrpc]
[ 2076.389330]  RSP <ffff880010b01748>
[ 2076.389330] CR2: ffff89001f9ddd60

I have a crashdump for this.



 Comments   
Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ]

This is indeed not related to quota (the crash is in the client code). I'm checking the lmv code in case i can put the finger on something, but i'm afraid that more information (a crash dump?) will be required ...

Comment by Johann Lombardi (Inactive) [ 15/Oct/12 ]

I have a crashdump for this.

Ah, i missed this. Could you please upload it to brent?

Comment by Oleg Drokin [ 15/Oct/12 ]

lu2175.tar.bz2 and vmlinux.bz2 in my home dir on brent

Comment by Johann Lombardi (Inactive) [ 16/Oct/12 ]

Trying to map the address, i get:

(gdb) list *(ptlrpc_set_wait+1541)
0x41945 is in ptlrpc_set_wait (/home/green/git/lustre-release/lustre/ptlrpc/client.c:2121).
2116    /home/green/git/lustre-release/lustre/ptlrpc/client.c: No such file or directory.

2107 int ptlrpc_set_wait(struct ptlrpc_request_set *set)
2108 {
2109         cfs_list_t            *tmp;
2110         struct ptlrpc_request *req;
2111         struct l_wait_info     lwi;
2112         int                    rc, timeout;
2113         ENTRY;
2114
2115         if (set->set_producer)
2116                 (void)ptlrpc_set_producer(set);
2117         else
2118                 cfs_list_for_each(tmp, &set->set_requests) {
2119                         req = cfs_list_entry(tmp, struct ptlrpc_request,
2120                                              rq_set_chain);
2121                         if (req->rq_phase == RQ_PHASE_NEW)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2122                                 (void)ptlrpc_send_new_req(req);
2123                 }

So it seems that we crashed while trying to access a request from set_requests.
Dumping the set, i get:

crash> px *((struct ptlrpc_request_set *)0xffff88001f9ddf30)
$9 = {
  set_refcount = {
    counter = 0x1
  },
  set_new_count = {
    counter = 0x0
  },
  set_remaining = {
    counter = 0x1
  },
  set_waitq = {
    lock = {
      raw_lock = {
        slock = 0x0
      },
      magic = 0xdead4ead,
      owner_cpu = 0xffffffff,
      owner = 0xffffffffffffffff
    },
    task_list = {
      next = 0xffff88001f9ddf58,
      prev = 0xffff88001f9ddf58
    }
  },
  set_wakeup_ptr = 0x0,
  set_requests = {
    next = 0xffff8800ae00de88,
    prev = 0xffff8800ae00de88
^^^^^^^^^^^^^^^^^^^
  },
  set_cblist = {
    next = 0xffff88001f9ddf80,
    prev = 0xffff88001f9ddf80
  },
  set_interpret = 0,
  set_arg = 0x0,
  set_new_req_lock = {
    raw_lock = {
      slock = 0x0
    },
    magic = 0xdead4ead,
    owner_cpu = 0xffffffff,
    owner = 0xffffffffffffffff
  },
  set_new_requests = {
    next = 0xffff88001f9ddfb8,
    prev = 0xffff88001f9ddfb8
  },
  set_rc = 0x0,
  set_max_inflight = 0xffffffff,
  set_producer = 0,
  set_producer_arg = 0x0
}

So there is indeed a request in set_requests, which is:

(gdb) p &(((struct ptlrpc_request *)0)->rq_set_chain)
$4 = (cfs_list_t *) 0x298
(gdb) p/x (0xffff8800ae00de88 - 0x298)
$5 = 0xffff8800ae00dbf0
crash> px *((struct ptlrpc_request *)(0xffff8800ae00dbf0))
$18 = {
  rq_type = 0x1267,
  rq_status = 0x0,
  rq_list = {
    next = 0xffff88008ad47890,
    prev = 0xffff88008ad47890
  },
  rq_timed_list = {
    next = 0xffff8900ae00dc08,
    prev = 0xffff8800ae00dc08
  },
  rq_history_list = {
    next = 0xffff8900ae00dc18,
    prev = 0xffff8900ae00dc18
  },
  rq_exp_list = {
    next = 0xffff8900ae00dc28,
    prev = 0xffff8800ae00dc28
  },
  rq_ops = 0x10000000000,
  rq_svc_thread = 0x0,
  rq_history_seq = 0x0,
  rq_at_index = 0x0,
  rq_lock = {
    raw_lock = {
      slock = 0x0
    },
    magic = 0xdead4ead,
    owner_cpu = 0xffffffff,
    owner = 0xffffffffffffffff
  },
  rq_intr = 0x0,
  rq_replied = 0x1,
  rq_err = 0x0,
  rq_timedout = 0x0,
  rq_resend = 0x0,
  rq_restart = 0x0,
  rq_replay = 0x1,
  rq_no_resend = 0x0,
  rq_waiting = 0x0,
  rq_receiving_reply = 0x0,
  rq_no_delay = 0x0,
  rq_net_err = 0x0,
  rq_wait_ctx = 0x0,
  rq_early = 0x0,
  rq_must_unlink = 0x1,
  rq_fake = 0x0,
  rq_memalloc = 0x0,
  rq_packed_final = 0x0,
  rq_hp = 0x0,
  rq_at_linked = 0x0,
  rq_reply_truncate = 0x0,
  rq_committed = 0x0,
  rq_invalid_rqset = 0x0,
  rq_generation_set = 0x0,
  rq_no_retry_einprogress = 0x1,
  rq_nr_resend = 0x0,
  rq_phase = 3955285505,
  rq_next_phase = 3955285510,
  rq_refcount = {
    counter = 0x2
  },
  rq_request_portal = 0x10c,
  rq_reply_portal = 0xa,
  rq_nob_received = 0x258,
  rq_reqlen = 0x348,
  rq_replen = 0x470,
  rq_reqmsg = 0xffff8900ae00cbf0,
  rq_repmsg = 0x0,
  rq_transno = 0x10000000000,
  rq_xid = 0x507a5fe4aac3c,
  rq_replay_list = {
    next = 0xffff8800ae00dcb8,
    prev = 0xffff8800ae00dcb8
  },
  rq_cli_ctx = 0xffffffffa0677380,
  rq_svc_ctx = 0x0,
  rq_ctx_chain = {
    next = 0xffff8800ae00dcd8,
    prev = 0xffff8800ae00dcd8
  },
  rq_flvr = {
    sf_rpc = 0x0,
    sf_flags = 0x0,
    u_rpc = {<No data fields>},
    u_bulk = {
      hash = {
        hash_alg = 0x0
      }
    }
  },
  rq_sp_from = LUSTRE_SP_CLI,
  rq_ctx_init = 0x0,
  rq_ctx_fini = 0x0,
  rq_bulk_read = 0x0,
  rq_bulk_write = 0x0,
  rq_auth_gss = 0x0,
  rq_auth_remote = 0x0,
  rq_auth_usr_root = 0x0,
  rq_auth_usr_mdt = 0x0,
  rq_auth_usr_ost = 0x0,
  rq_pack_udesc = 0x0,
  rq_pack_bulk = 0x0,
  rq_no_reply = 0x0,
  rq_pill_init = 0x1,
  rq_auth_uid = 0x0,
  rq_auth_mapped_uid = 0x0,
  rq_user_desc = 0x10000000000,
  rq_reqbuf = 0xffff8800ae00cbf0,
  rq_repbuf = 0xffff8800ae00e7f0 "",
  rq_repdata = 0x0,
  rq_clrbuf = 0x0,
  rq_reqbuf_len = 0x400,
  rq_reqdata_len = 0x248,
  rq_repbuf_len = 0x800,
  rq_repdata_len = 0x0,
  rq_clrbuf_len = 0x0,
  rq_clrdata_len = 0x0,
  rq_reply_off = 0xc0,
  rq_req_swab_mask = 0x0,
  rq_rep_swab_mask = 0x0,
  rq_import_generation = 0x1,
  rq_send_state = LUSTRE_IMP_FULL,
  rq_early_count = 0x0,
  rq_req_md_h = {
    cookie = 0x2266a11
  },
  rq_req_cbid = {
    cbid_fn = 0xffffffffa05b1720,
    cbid_arg = 0xffff8800ae00dbf0
  },
  rq_delay_limit = 0x0,
  rq_queued_time = 0x10006c6be,
  rq_arrival_time = {
    tv_sec = 0x507a6575,
    tv_usec = 0xe6e6b
  },
  rq_reply_state = 0x10000000000,
  rq_rqbd = 0x0,
  rq_reply_md_h = {
    cookie = 0x2266a09
  },
  rq_reply_waitq = {
    lock = {
      raw_lock = {
        slock = 0x0
      },
      magic = 0xdead4ead,
      owner_cpu = 0xffffffff,
      owner = 0xffffffffffffffff
    },
    task_list = {
      next = 0xffff8800ae00ddc8,
      prev = 0xffff8800ae00ddc8
    }
  },
  rq_reply_cbid = {
    cbid_fn = 0xffffffffa05b22c0,
    cbid_arg = 0xffff8800ae00dbf0
  },
  rq_self = 0x0,
  rq_peer = {
    nid = 0x0,
    pid = 0x0
  },
  rq_export = 0x10000000000,
  rq_import = 0xffff89008ad477f0,
  rq_replay_cb = 0,
  rq_commit_cb = 0x10000000000,
  rq_cb_data = 0x10000000000,
  rq_bulk = 0x0,
  rq_sent = 0x507a6575,
  rq_real_sent = 0x507a6575,
  rq_deadline = 0x507a65a1,
  rq_reply_deadline = 0x0,
  rq_bulk_deadline = 0x0,
  rq_timeout = 0x2b,
  rq_set_waitq = {
    lock = {
      raw_lock = {
        slock = 0x0
      },
      magic = 0xdead4ead,
      owner_cpu = 0xffffffff,
      owner = 0xffffffffffffffff
    },
    task_list = {
      next = 0xffff8800ae00de78,
      prev = 0xffff8800ae00de78
    }
  },
  rq_set_chain = {
    next = 0xffff89001f9ddf70,
    prev = 0xffff89001f9ddf70
  },
  rq_set = 0xffff88001f9ddf30,
  rq_interpret_reply = 0,
  rq_async_args = {
    pointer_arg = {0x0, 0x10000000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
    space = {0x0, 0x10000000000, 0x0, 0x0, 0x0, 0x0, 0x0}
  },
  rq_pool = 0x0,
  rq_session = {
    lc_tags = 0x0,
    lc_state = 0,
    lc_thread = 0x10000000000,
    lc_value = 0x10000000000,
    lc_remember = {
      next = 0x0,
      prev = 0x0
    },
    lc_version = 0x0,
    lc_cookie = 0x0
  },
  rq_recov_session = {
    lc_tags = 0x0,
    lc_state = 0,
    lc_thread = 0x0,
    lc_value = 0x0,
    lc_remember = {
      next = 0x0,
      prev = 0x0
    },
    lc_version = 0x0,
    lc_cookie = 0x0
  },
  rq_pill = {
    rc_req = 0xffff8800ae00dbf0,
    rc_fmt = 0xffffffffa064f8a0,
    rc_loc = RCL_CLIENT,
    rc_area = {{0xb8, 0x68, 0x8, 0x88, 0x100, 0x0, 0x10b, 0x48, 0xfffffeff}, {0xb8, 0x70, 0xd8, 0x38, 0x104, 0x178, 0x78, 0xfffffeff, 0xffffffff}}
  }
Comment by Johann Lombardi (Inactive) [ 16/Oct/12 ]
  rq_set_chain = {
    next = 0xffff89001f9ddf70,
    prev = 0xffff89001f9ddf70
  },

So the request almost points back to the set 0xffff89001f9ddf70 = 0xffff88001f9ddf30(set address) + 0x40(set_requests offset) + .... 0x10000000000. It seems that there was a bit swap

Comment by Johann Lombardi (Inactive) [ 17/Oct/12 ]

I talked to Oleg and he indeed experienced some memory issues which are likely the root cause of this crash.

Generated at Sat Feb 10 01:23:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.