[LU-4229] crash in NRS cleanup during mount failure Created: 08/Nov/13  Updated: 13/Feb/14  Resolved: 14/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre master v2_5_50_0-3-g6229525
Single node test setup, 1 MDT, 3 OST, client
RHEL6.3 2.6.32-279.5.1


Issue Links:
Duplicate
duplicates LU-4357 page allocation failure. mode:0x40 ca... Resolved
duplicates LU-3772 Crash in ptlrpc_service_nrs_cleanup()... Resolved
Severity: 3
Rank (Obsolete): 11519

 Description   

Was running a memory-intensive workload on the same node and then mounted MDS. It failed an allocation during setup and then oopsed in the subsequent cleanup.

LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts: 
mount.lustre: page allocation failure. order:1, mode:0x40
Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1
Call Trace:
[<ffffffff811276cf>] ? __alloc_pages_nodemask+0x77f/0x940
[<ffffffff81161e92>] ? kmem_getpages+0x62/0x170
[<ffffffff81162aaa>] ? fallback_alloc+0x1ba/0x270
[<ffffffff811624ff>] ? cache_grow+0x2cf/0x320
[<ffffffff81162829>] ? ____cache_alloc_node+0x99/0x160
[<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
[<ffffffff811636ef>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
[<ffffffff8116392b>] ? __kmalloc_node+0x7b/0x100
[<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
[<ffffffffa0a54f88>] ? ptlrpc_alloc_rqbd+0x1e8/0x6d0 [ptlrpc]
[<ffffffffa0a55555>] ? ptlrpc_grow_req_bufs+0xe5/0x2a0 [ptlrpc]
[<ffffffffa0a55d25>] ? ptlrpc_register_service+0x615/0x17c0 [ptlrpc]
[<ffffffffa0cee1a5>] ? mgs_init0+0x1285/0x1760 [mgs]
[<ffffffffa0a9bb90>] ? tgt_request_handle+0x0/0xe40 [ptlrpc]
[<ffffffffa0a6b610>] ? target_print_req+0x0/0xa0 [ptlrpc]
[<ffffffffa0ce74e9>] ? mgs_type_start+0x19/0x20 [mgs]
[<ffffffffa0cee78f>] ? mgs_device_alloc+0x10f/0x260 [mgs]
[<ffffffffa0901a2f>] ? obd_setup+0x1bf/0x290 [obdclass]
[<ffffffffa0901d08>] ? class_setup+0x208/0x870 [obdclass]
[<ffffffffa090954c>] ? class_process_config+0xc6c/0x1ad0 [obdclass]
[<ffffffffa090e3d3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa090e929>] ? do_lcfg+0x149/0x480 [obdclass]
[<ffffffffa090ecf4>] ? lustre_start_simple+0x94/0x200 [obdclass]
[<ffffffffa0948479>] ? server_fill_super+0x1159/0x19ea [obdclass]
[<ffffffffa09148f8>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
[<ffffffffa0914720>] ? lustre_fill_super+0x0/0x530 [obdclass]
[<ffffffff8117e16f>] ? get_sb_nodev+0x5f/0xa0
[<ffffffffa090c425>] ? lustre_get_sb+0x25/0x30 [obdclass]
[<ffffffff8117ddcb>] ? vfs_kern_mount+0x7b/0x1b0
[<ffffffff8117df72>] ? do_kern_mount+0x52/0x130
[<ffffffff8119c652>] ? do_mount+0x2d2/0x8d0
[<ffffffff8119cce0>] ? sys_mount+0x90/0xe0

LustreError: 6512:0:(service.c:156:ptlrpc_grow_req_bufs()) mgs: Can't allocate request buffer
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
PGD 1b078067 PUD 20d38067 PMD 0 
Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc.                 Dell DXP051                  /0FJ030
RIP: 0010:[<ffffffffa0a8ac5c>]  [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
RSP: 0018:ffff88001fc536c8  EFLAGS: 00010217
RAX: 0000000000000000 RBX: ffff8800709834e0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0b29640
RBP: ffff88001fc53708 R08: 0000000000000002 R09: 0000000000000000
R10: ffff8800244cc000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8800adc70cc0 R14: ffff880070983618 R15: ffff8800709834e8
FS:  00007fb3066b0700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000053c91000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Process mount.lustre (pid: 6512, threadinfo ffff88001fc52000, task ffff880017014080)
Stack:
ffff880070983400 00ff880017014080 ffff88001fc53708 ffff8800adc70cc0
<d> ffff880070983400 ffff880070983448 ffff880070983618 ffff880017014080
<d> ffff88001fc537b8 ffffffffa0a52583 ffff88001fc53728 ffff8800adc70cc0
Call Trace:
[<ffffffffa0a52583>] ptlrpc_unregister_service+0x673/0xff0 [ptlrpc]
[<ffffffffa0a556a1>] ? ptlrpc_grow_req_bufs+0x231/0x2a0 [ptlrpc]
[<ffffffffa0a55ee2>] ptlrpc_register_service+0x7d2/0x17c0 [ptlrpc]
[<ffffffffa0cee1a5>] mgs_init0+0x1285/0x1760 [mgs]
[rest of the stack is the same as above]

This resolves to:

(gdb) list *(ptlrpc_service_nrs_cleanup+0xec)
0x90c8c is in ptlrpc_service_nrs_cleanup_locked (/usr/src/lustre-head/lustre/ptlrpc/nrs.c:1030).
1025
1026    again:
1027            nrs = nrs_svcpt2nrs(svcpt, hp);
1028            nrs->nrs_stopping = 1;
1029
1030            cfs_list_for_each_entry_safe(policy, tmp, &nrs->nrs_policy_list,
1031                                         pol_list) {
1032                    rc = nrs_policy_unregister(nrs, policy->pol_desc->pd_name);
1033                    LASSERT(rc == 0);
1034            }

It looks like nrs_policy_list isn't initialized by the time this cleanup is called. Need to check something to see if this struct even needs to be cleaned up.



 Comments   
Comment by Niu Yawei (Inactive) [ 11/Nov/13 ]

Looks duplicated to LU-3772.

Comment by Andreas Dilger [ 14/Nov/13 ]

Duplicate of LU-3772.

Comment by Andreas Dilger [ 13/Feb/14 ]

Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357.

Generated at Sat Feb 10 01:40:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.