Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13294

wrong cpt malloc rotor handling leads to oops

Details

    • 3
    • 9223372036854775807

    Description

      [  891.249374] BUG: unable to handle kernel paging request at 0000000100002007
      [  891.256366] IP: [<ffffffff847c0da7>] __alloc_pages_nodemask+0x97/0x420
      [  891.262918] PGD 1fb43dd067 PUD 0 
      [  891.266272] Oops: 0000 [#1] SMP 
      [  891.269539] Modules linked in: lnet(OE+) libcfs(OE) ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw mst_pciconf(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) zfs(POE) zunicode(POE) zlua(POE) edac_mce_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr zcommon(POE) znvpair(POE) ib_uverbs(OE) zavl(POE) icp(POE) ib_core(OE) spl(OE) mlx5_core(OE) dm_mod mlx_compat(OE) mlxfw devlink i2c_piix4 i2c_designware_platform i2c_designware_core pinctrl_amd acpi_cpufreq ip_tables nfsv3 nfs_acl nfs lockd grace fscache team_mode_activebackup team crct10dif_pclmul crct10dif_common crc32c_intel igb i2c_algo_bit dca ptp pps_core nvme nvme_core nfit libnvdimm sunrpc ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: libcfs]
      [  891.353833] CPU: 9 PID: 81842 Comm: modprobe Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-957.1.3957.1.3.x4.1.17.x86_64 #1
      [  891.366249] Hardware name: None None/None, BIOS 5.14 01/21/2020
      [  891.372167] task: ffff8c874fde9040 ti: ffff8c874dbb8000 task.ti: ffff8c874dbb8000
      [  891.379638] RIP: 0010:[<ffffffff847c0da7>]  [<ffffffff847c0da7>] __alloc_pages_nodemask+0x97/0x420
      [  891.388602] RSP: 0018:ffff8c874dbbba40  EFLAGS: 00010246
      [  891.393911] RAX: 0000000100001fff RBX: 0000000000000000 RCX: ffff8c874dbbbfd8
      [  891.401037] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000201250
      [  891.408169] RBP: ffff8c874dbbbae0 R08: ffff8c974da36600 R09: 0000000100400010
      [  891.415292] R10: ffff8c978ff54d40 R11: ffffffffffffff88 R12: 0000000000201250
      [  891.422418] R13: 0000000000000400 R14: 0000000000000002 R15: 0000000000000000
      [  891.429549] FS:  00007fe9f08bf740(0000) GS:ffff8c974ee40000(0000) knlGS:0000000000000000
      [  891.437629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  891.443373] CR2: 0000000100002007 CR3: 0000001fd6a8c000 CR4: 0000000000340fe0
      [  891.450497] Call Trace:
      [  891.452944]  [<ffffffff8481a2e2>] ? deactivate_slab+0x122/0x3c0
      [  891.458860]  [<ffffffff84818d51>] new_slab+0x91/0x390
      [  891.463907]  [<ffffffff8481a9fc>] ___slab_alloc+0x3ac/0x4f0
      [  891.469481]  [<ffffffff8488efde>] ? ep_poll_callback+0xee/0x210
      [  891.475415]  [<ffffffffc0a71ee5>] ? cfs_percpt_alloc+0xf5/0x480 [libcfs]
      [  891.482116]  [<ffffffff846cba9b>] ? __wake_up_common+0x5b/0x90
      [  891.487947]  [<ffffffffc09f2000>] ? 0xffffffffc09f1fff
      [  891.493088]  [<ffffffff84982634>] ? pointer.isra.19+0xd4/0x4d0
      [  891.498921]  [<ffffffffc0a71ee5>] ? cfs_percpt_alloc+0xf5/0x480 [libcfs]
      [  891.505620]  [<ffffffff84d6060c>] __slab_alloc+0x40/0x5c
      [  891.510930]  [<ffffffff8481ebaf>] __kmalloc_node+0xbf/0x2b0
      [  891.516506]  [<ffffffffc0a71ee5>] cfs_percpt_alloc+0xf5/0x480 [libcfs]
      [  891.523030]  [<ffffffffc0a72ce0>] cfs_percpt_lock_create+0x90/0x3d0 [libcfs]
      [  891.530073]  [<ffffffffc09f2000>] ? 0xffffffffc09f1fff
      [  891.535210]  [<ffffffffc0b7328f>] lnet_lib_init+0xef/0x340 [lnet]
      [  891.541303]  [<ffffffffc09f2081>] lnet_init+0x81/0x1000 [lnet]
      [  891.547131]  [<ffffffff8460210a>] do_one_initcall+0xba/0x240
      [  891.552791]  [<ffffffff8471907c>] load_module+0x272c/0x2bc0
      [  891.558365]  [<ffffffff849a3480>] ? ddebug_proc_write+0x100/0x100
      [  891.564453]  [<ffffffff84714c03>] ? copy_module_from_fd.isra.44+0x53/0x150
      [  891.571318]  [<ffffffff847196f6>] SyS_finit_module+0xa6/0xd0
      [  891.576981]  [<ffffffff84d76ddb>] system_call_fastpath+0x22/0x27
      [  891.582983] Code: c1 eb 02 c1 e8 13 83 e3 02 83 e0 01 09 c3 44 23 25 7f a5 b9 00 48 c7 45 c0 00 00 00 00 41 f6 c4 10 0f 85 3d 02 00 00 48 8b 45 b0 <48> 83 78 08 00 0f 84 93 01 00 00 66 66 66 66 90 48 8b 45 b0 44 
      [  891.602977] RIP  [<ffffffff847c0da7>] __alloc_pages_nodemask+0x97/0x420
      [  891.609597]  RSP <ffff8c874dbbba40>
      [  891.613080] CR2: 0000000100002007
      

      Attachments

        Activity

          [LU-13294] wrong cpt malloc rotor handling leads to oops

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38049/
          Subject: LU-13294 libcfs: incorrect rotor behaviour
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: 4f704583cd561a7b6ce38c032188a6b23d9faf38

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38049/ Subject: LU-13294 libcfs: incorrect rotor behaviour Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 4f704583cd561a7b6ce38c032188a6b23d9faf38

          Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38049
          Subject: LU-13294 libcfs: incorrect rotor behaviour
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 82a2789b3ae34348c9409834b29d03a849a16efd

          gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38049 Subject: LU-13294 libcfs: incorrect rotor behaviour Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 82a2789b3ae34348c9409834b29d03a849a16efd
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37709/
          Subject: LU-13294 libcfs: incorrect rotor behaviour
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f8aa86dd1622804d81020a7dbb1116f276b340f3

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37709/ Subject: LU-13294 libcfs: incorrect rotor behaviour Project: fs/lustre-release Branch: master Current Patch Set: Commit: f8aa86dd1622804d81020a7dbb1116f276b340f3
          panda Andrew Perepechko added a comment - - edited

          Alternate bug description is in the comments:

          int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt)
          {
                  nodemask_t *mask;
                  int weight;
                  unsigned int rotor;
                  int node = 0;
          
                  /* convert CPU partition ID to HW node id */
          
                  if (cpt < 0 || cpt >= cptab->ctb_nparts) {
                          mask = cptab->ctb_nodemask;
                          rotor = cptab->ctb_spread_rotor++;
                  } else {
                          mask = cptab->ctb_parts[cpt].cpt_nodemask;
                          rotor = cptab->ctb_parts[cpt].cpt_spread_rotor++; // here rotor is initialized to -1 (the default cpt_spread_rotor value)
                          node  = cptab->ctb_parts[cpt].cpt_node;
                  }
          
                  weight = nodes_weight(*mask); // if numa nodes # is greater than 1, then weight is greater than 1
                  if (weight > 0) {
                          rotor %= weight; // -1 mod (anything other than 0 and 1) is -1
          
                          for_each_node_mask(node, *mask) {
                                  if (!rotor--) // this check will never succeed, for_each_node_mask will exit with node=1024 which will be passed to kmalloc_node()
                                          return node;
                          }
                  }
          
                  return node;
          }
          EXPORT_SYMBOL(cfs_cpt_spread_node);
          
          
          panda Andrew Perepechko added a comment - - edited Alternate bug description is in the comments: int cfs_cpt_spread_node(struct cfs_cpt_table *cptab, int cpt) { nodemask_t *mask; int weight; unsigned int rotor; int node = 0; /* convert CPU partition ID to HW node id */ if (cpt < 0 || cpt >= cptab->ctb_nparts) { mask = cptab->ctb_nodemask; rotor = cptab->ctb_spread_rotor++; } else { mask = cptab->ctb_parts[cpt].cpt_nodemask; rotor = cptab->ctb_parts[cpt].cpt_spread_rotor++; // here rotor is initialized to -1 (the default cpt_spread_rotor value) node = cptab->ctb_parts[cpt].cpt_node; } weight = nodes_weight(*mask); // if numa nodes # is greater than 1, then weight is greater than 1 if (weight > 0) { rotor %= weight; // -1 mod (anything other than 0 and 1) is -1 for_each_node_mask(node, *mask) { if (!rotor--) // this check will never succeed, for_each_node_mask will exit with node=1024 which will be passed to kmalloc_node() return node; } } return node; } EXPORT_SYMBOL(cfs_cpt_spread_node);

          Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/37709
          Subject: LU-13294 libcfs: incorrect rotor behaviour
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 79896752f0ee74d85650e5fb1d201085bfd55d46

          gerrit Gerrit Updater added a comment - Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/37709 Subject: LU-13294 libcfs: incorrect rotor behaviour Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 79896752f0ee74d85650e5fb1d201085bfd55d46

          People

            panda Andrew Perepechko
            panda Andrew Perepechko
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: