Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4229

crash in NRS cleanup during mount failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.6.0
    • None
    • Lustre master v2_5_50_0-3-g6229525
      Single node test setup, 1 MDT, 3 OST, client
      RHEL6.3 2.6.32-279.5.1
    • 3
    • 11519

    Description

      Was running a memory-intensive workload on the same node and then mounted MDS. It failed an allocation during setup and then oopsed in the subsequent cleanup.

      LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts: 
      mount.lustre: page allocation failure. order:1, mode:0x40
      Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1
      Call Trace:
      [<ffffffff811276cf>] ? __alloc_pages_nodemask+0x77f/0x940
      [<ffffffff81161e92>] ? kmem_getpages+0x62/0x170
      [<ffffffff81162aaa>] ? fallback_alloc+0x1ba/0x270
      [<ffffffff811624ff>] ? cache_grow+0x2cf/0x320
      [<ffffffff81162829>] ? ____cache_alloc_node+0x99/0x160
      [<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
      [<ffffffff811636ef>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
      [<ffffffff8116392b>] ? __kmalloc_node+0x7b/0x100
      [<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
      [<ffffffffa0a54f88>] ? ptlrpc_alloc_rqbd+0x1e8/0x6d0 [ptlrpc]
      [<ffffffffa0a55555>] ? ptlrpc_grow_req_bufs+0xe5/0x2a0 [ptlrpc]
      [<ffffffffa0a55d25>] ? ptlrpc_register_service+0x615/0x17c0 [ptlrpc]
      [<ffffffffa0cee1a5>] ? mgs_init0+0x1285/0x1760 [mgs]
      [<ffffffffa0a9bb90>] ? tgt_request_handle+0x0/0xe40 [ptlrpc]
      [<ffffffffa0a6b610>] ? target_print_req+0x0/0xa0 [ptlrpc]
      [<ffffffffa0ce74e9>] ? mgs_type_start+0x19/0x20 [mgs]
      [<ffffffffa0cee78f>] ? mgs_device_alloc+0x10f/0x260 [mgs]
      [<ffffffffa0901a2f>] ? obd_setup+0x1bf/0x290 [obdclass]
      [<ffffffffa0901d08>] ? class_setup+0x208/0x870 [obdclass]
      [<ffffffffa090954c>] ? class_process_config+0xc6c/0x1ad0 [obdclass]
      [<ffffffffa090e3d3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
      [<ffffffffa090e929>] ? do_lcfg+0x149/0x480 [obdclass]
      [<ffffffffa090ecf4>] ? lustre_start_simple+0x94/0x200 [obdclass]
      [<ffffffffa0948479>] ? server_fill_super+0x1159/0x19ea [obdclass]
      [<ffffffffa09148f8>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
      [<ffffffffa0914720>] ? lustre_fill_super+0x0/0x530 [obdclass]
      [<ffffffff8117e16f>] ? get_sb_nodev+0x5f/0xa0
      [<ffffffffa090c425>] ? lustre_get_sb+0x25/0x30 [obdclass]
      [<ffffffff8117ddcb>] ? vfs_kern_mount+0x7b/0x1b0
      [<ffffffff8117df72>] ? do_kern_mount+0x52/0x130
      [<ffffffff8119c652>] ? do_mount+0x2d2/0x8d0
      [<ffffffff8119cce0>] ? sys_mount+0x90/0xe0
      
      LustreError: 6512:0:(service.c:156:ptlrpc_grow_req_bufs()) mgs: Can't allocate request buffer
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
      PGD 1b078067 PUD 20d38067 PMD 0 
      Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc.                 Dell DXP051                  /0FJ030
      RIP: 0010:[<ffffffffa0a8ac5c>]  [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
      RSP: 0018:ffff88001fc536c8  EFLAGS: 00010217
      RAX: 0000000000000000 RBX: ffff8800709834e0 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0b29640
      RBP: ffff88001fc53708 R08: 0000000000000002 R09: 0000000000000000
      R10: ffff8800244cc000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff8800adc70cc0 R14: ffff880070983618 R15: ffff8800709834e8
      FS:  00007fb3066b0700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 0000000053c91000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
      Process mount.lustre (pid: 6512, threadinfo ffff88001fc52000, task ffff880017014080)
      Stack:
      ffff880070983400 00ff880017014080 ffff88001fc53708 ffff8800adc70cc0
      <d> ffff880070983400 ffff880070983448 ffff880070983618 ffff880017014080
      <d> ffff88001fc537b8 ffffffffa0a52583 ffff88001fc53728 ffff8800adc70cc0
      Call Trace:
      [<ffffffffa0a52583>] ptlrpc_unregister_service+0x673/0xff0 [ptlrpc]
      [<ffffffffa0a556a1>] ? ptlrpc_grow_req_bufs+0x231/0x2a0 [ptlrpc]
      [<ffffffffa0a55ee2>] ptlrpc_register_service+0x7d2/0x17c0 [ptlrpc]
      [<ffffffffa0cee1a5>] mgs_init0+0x1285/0x1760 [mgs]
      [rest of the stack is the same as above]
      

      This resolves to:

      (gdb) list *(ptlrpc_service_nrs_cleanup+0xec)
      0x90c8c is in ptlrpc_service_nrs_cleanup_locked (/usr/src/lustre-head/lustre/ptlrpc/nrs.c:1030).
      1025
      1026    again:
      1027            nrs = nrs_svcpt2nrs(svcpt, hp);
      1028            nrs->nrs_stopping = 1;
      1029
      1030            cfs_list_for_each_entry_safe(policy, tmp, &nrs->nrs_policy_list,
      1031                                         pol_list) {
      1032                    rc = nrs_policy_unregister(nrs, policy->pol_desc->pd_name);
      1033                    LASSERT(rc == 0);
      1034            }
      

      It looks like nrs_policy_list isn't initialized by the time this cleanup is called. Need to check something to see if this struct even needs to be cleaned up.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: