Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4229

crash in NRS cleanup during mount failure

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.6.0
    • None
    • Lustre master v2_5_50_0-3-g6229525
      Single node test setup, 1 MDT, 3 OST, client
      RHEL6.3 2.6.32-279.5.1
    • 3
    • 11519

    Description

      Was running a memory-intensive workload on the same node and then mounted MDS. It failed an allocation during setup and then oopsed in the subsequent cleanup.

      LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts: 
      mount.lustre: page allocation failure. order:1, mode:0x40
      Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1
      Call Trace:
      [<ffffffff811276cf>] ? __alloc_pages_nodemask+0x77f/0x940
      [<ffffffff81161e92>] ? kmem_getpages+0x62/0x170
      [<ffffffff81162aaa>] ? fallback_alloc+0x1ba/0x270
      [<ffffffff811624ff>] ? cache_grow+0x2cf/0x320
      [<ffffffff81162829>] ? ____cache_alloc_node+0x99/0x160
      [<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
      [<ffffffff811636ef>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
      [<ffffffff8116392b>] ? __kmalloc_node+0x7b/0x100
      [<ffffffffa10116c1>] ? cfs_cpt_malloc+0x31/0x60 [libcfs]
      [<ffffffffa0a54f88>] ? ptlrpc_alloc_rqbd+0x1e8/0x6d0 [ptlrpc]
      [<ffffffffa0a55555>] ? ptlrpc_grow_req_bufs+0xe5/0x2a0 [ptlrpc]
      [<ffffffffa0a55d25>] ? ptlrpc_register_service+0x615/0x17c0 [ptlrpc]
      [<ffffffffa0cee1a5>] ? mgs_init0+0x1285/0x1760 [mgs]
      [<ffffffffa0a9bb90>] ? tgt_request_handle+0x0/0xe40 [ptlrpc]
      [<ffffffffa0a6b610>] ? target_print_req+0x0/0xa0 [ptlrpc]
      [<ffffffffa0ce74e9>] ? mgs_type_start+0x19/0x20 [mgs]
      [<ffffffffa0cee78f>] ? mgs_device_alloc+0x10f/0x260 [mgs]
      [<ffffffffa0901a2f>] ? obd_setup+0x1bf/0x290 [obdclass]
      [<ffffffffa0901d08>] ? class_setup+0x208/0x870 [obdclass]
      [<ffffffffa090954c>] ? class_process_config+0xc6c/0x1ad0 [obdclass]
      [<ffffffffa090e3d3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
      [<ffffffffa090e929>] ? do_lcfg+0x149/0x480 [obdclass]
      [<ffffffffa090ecf4>] ? lustre_start_simple+0x94/0x200 [obdclass]
      [<ffffffffa0948479>] ? server_fill_super+0x1159/0x19ea [obdclass]
      [<ffffffffa09148f8>] ? lustre_fill_super+0x1d8/0x530 [obdclass]
      [<ffffffffa0914720>] ? lustre_fill_super+0x0/0x530 [obdclass]
      [<ffffffff8117e16f>] ? get_sb_nodev+0x5f/0xa0
      [<ffffffffa090c425>] ? lustre_get_sb+0x25/0x30 [obdclass]
      [<ffffffff8117ddcb>] ? vfs_kern_mount+0x7b/0x1b0
      [<ffffffff8117df72>] ? do_kern_mount+0x52/0x130
      [<ffffffff8119c652>] ? do_mount+0x2d2/0x8d0
      [<ffffffff8119cce0>] ? sys_mount+0x90/0xe0
      
      LustreError: 6512:0:(service.c:156:ptlrpc_grow_req_bufs()) mgs: Can't allocate request buffer
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
      PGD 1b078067 PUD 20d38067 PMD 0 
      Pid: 6512, comm: mount.lustre Tainted: P      D W  ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc.                 Dell DXP051                  /0FJ030
      RIP: 0010:[<ffffffffa0a8ac5c>]  [<ffffffffa0a8ac5c>] ptlrpc_service_nrs_cleanup+0xec/0x440 [ptlrpc]
      RSP: 0018:ffff88001fc536c8  EFLAGS: 00010217
      RAX: 0000000000000000 RBX: ffff8800709834e0 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0b29640
      RBP: ffff88001fc53708 R08: 0000000000000002 R09: 0000000000000000
      R10: ffff8800244cc000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff8800adc70cc0 R14: ffff880070983618 R15: ffff8800709834e8
      FS:  00007fb3066b0700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 0000000053c91000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
      Process mount.lustre (pid: 6512, threadinfo ffff88001fc52000, task ffff880017014080)
      Stack:
      ffff880070983400 00ff880017014080 ffff88001fc53708 ffff8800adc70cc0
      <d> ffff880070983400 ffff880070983448 ffff880070983618 ffff880017014080
      <d> ffff88001fc537b8 ffffffffa0a52583 ffff88001fc53728 ffff8800adc70cc0
      Call Trace:
      [<ffffffffa0a52583>] ptlrpc_unregister_service+0x673/0xff0 [ptlrpc]
      [<ffffffffa0a556a1>] ? ptlrpc_grow_req_bufs+0x231/0x2a0 [ptlrpc]
      [<ffffffffa0a55ee2>] ptlrpc_register_service+0x7d2/0x17c0 [ptlrpc]
      [<ffffffffa0cee1a5>] mgs_init0+0x1285/0x1760 [mgs]
      [rest of the stack is the same as above]
      

      This resolves to:

      (gdb) list *(ptlrpc_service_nrs_cleanup+0xec)
      0x90c8c is in ptlrpc_service_nrs_cleanup_locked (/usr/src/lustre-head/lustre/ptlrpc/nrs.c:1030).
      1025
      1026    again:
      1027            nrs = nrs_svcpt2nrs(svcpt, hp);
      1028            nrs->nrs_stopping = 1;
      1029
      1030            cfs_list_for_each_entry_safe(policy, tmp, &nrs->nrs_policy_list,
      1031                                         pol_list) {
      1032                    rc = nrs_policy_unregister(nrs, policy->pol_desc->pd_name);
      1033                    LASSERT(rc == 0);
      1034            }
      

      It looks like nrs_policy_list isn't initialized by the time this cleanup is called. Need to check something to see if this struct even needs to be cleaned up.

      Attachments

        Issue Links

          Activity

            [LU-4229] crash in NRS cleanup during mount failure

            Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357.

            adilger Andreas Dilger added a comment - Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357 .

            Duplicate of LU-3772.

            adilger Andreas Dilger added a comment - Duplicate of LU-3772 .

            Looks duplicated to LU-3772.

            niu Niu Yawei (Inactive) added a comment - Looks duplicated to LU-3772 .

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: