Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12152

lnetctl export corrupts memory on routers

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      Reproducer:

      [root@cent75build01 ~]# cat /etc/modprobe.d/lnet.conf
      options lnet networks=tcp
      options lnet forwarding=enabled
      [root@cent75build01 ~]# modprobe lnet
      [root@cent75build01 ~]# lctl net up
      LNET configured
      [root@cent75build01 ~]# while true; do lnetctl export > /dev/null; echo "still alive"; done
      still alive
      still alive
      still alive
      still alive
      still alive
      still alive
      still alive
      Write failed: Broken pipe
      [root@control01 ~]# ssh cent75build01
      Last login: Tue Apr  2 22:01:13 2019 from 192.168.1.10
      [root@cent75build01 ~]# cd /var/crash/127.0.0.1-2019-04-02-22:02:31
      [root@cent75build01 127.0.0.1-2019-04-02-22:02:31]# tail --lines 36 vmcore-dmesg.txt
      [  156.209529] BUG: unable to handle kernel paging request at 0000007a00000002
      [  156.209598] IP: [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.209648] PGD 800000081d0c4067 PUD 0
      [  156.209672] Oops: 0000 [#1] SMP
      [  156.209695] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE) ib_core(OE) mlx4_core(OE) mlx_compat(OE) devlink sb_edac coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr joydev sg vmw_vmci parport_pc parport shpchp i2c_piix4 binfmt_misc ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix mptspi drm scsi_transport_spi crct10dif_pclmul crct10dif_common mptscsih crc32c_intel libata serio_raw mptbase vmxnet3
      [  156.210184]  i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod
      [  156.210239] CPU: 25 PID: 2169 Comm: lnetctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
      [  156.210289] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
      [  156.210335] task: ffff97c376ba0fd0 ti: ffff97c377424000 task.ti: ffff97c377424000
      [  156.210368] RIP: 0010:[<ffffffffa53fae34>]  [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.210409] RSP: 0018:ffff97c377427d10  EFLAGS: 00010282
      [  156.210433] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000004993
      [  156.210465] RDX: 0000000000004992 RSI: 00000000000080d0 RDI: ffff97c0bfc03700
      [  156.210495] RBP: ffff97c377427d40 R08: 000000000001bb00 R09: ffffffffa54217ec
      [  156.210526] R10: 8080808080808080 R11: 0000000000000000 R12: 0000007a00000002
      [  156.210557] R13: 00000000000080d0 R14: ffff97c0bfc03700 R15: ffff97c0bfc03700
      [  156.210588] FS:  0000000000000000(0000) GS:ffff97c77fc40000(0000) knlGS:0000000000000000
      [  156.210623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  156.210649] CR2: 0000007a00000002 CR3: 00000008322c6000 CR4: 00000000000407e0
      [  156.210728] Call Trace:
      [  156.210756]  [<ffffffffa54217ec>] ? get_empty_filp+0x5c/0x1a0
      [  156.210786]  [<ffffffffa54217ec>] get_empty_filp+0x5c/0x1a0
      [  156.210817]  [<ffffffffa543019d>] path_openat+0x4d/0x640
      [  156.210846]  [<ffffffffa53c8544>] ? handle_pte_fault+0x2f4/0xd10
      [  156.211731]  [<ffffffffa5431dbd>] do_filp_open+0x4d/0xb0
      [  156.212562]  [<ffffffffa53caefd>] ? handle_mm_fault+0x39d/0x9b0
      [  156.213420]  [<ffffffffa543f167>] ? __alloc_fd+0x47/0x170
      [  156.214258]  [<ffffffffa541e0d7>] do_sys_open+0x137/0x240
      [  156.215070]  [<ffffffffa59256d5>] ? system_call_after_swapgs+0xa2/0x146
      [  156.215865]  [<ffffffffa541e1fe>] SyS_open+0x1e/0x20
      [  156.216661]  [<ffffffffa592579b>] system_call_fastpath+0x22/0x27
      [  156.217457]  [<ffffffffa59256e1>] ? system_call_after_swapgs+0xae/0x146
      [  156.218236] Code: 63 c1 5a 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 28 01 00 00 48 85 c0 0f 84 1f 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 ba 49 63
      [  156.220699] RIP  [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.221468]  RSP <ffff97c377427d10>
      [  156.222214] CR2: 0000007a00000002
      

      The problem is in lustre_lnet_show_routing(). I verified this by applying the following patch:

      diff --git a/lnet/utils/lnetctl.c b/lnet/utils/lnetctl.c
      index c503223..62d34bb 100644
      --- a/lnet/utils/lnetctl.c
      +++ b/lnet/utils/lnetctl.c
      @@ -1550,13 +1550,6 @@ static int jt_export(int argc, char **argv)
                      err_rc = NULL;
              }
      
      -       rc = lustre_lnet_show_routing(-1, &show_rc, &err_rc, backup);
      -       if (rc != LUSTRE_CFG_RC_NO_ERR) {
      -               cYAML_print_tree2file(stderr, err_rc);
      -               cYAML_free_tree(err_rc);
      -               err_rc = NULL;
      -       }
      -
              rc = lustre_lnet_show_peer(NULL, 2, -1, &show_rc, &err_rc, backup);
              if (rc != LUSTRE_CFG_RC_NO_ERR) {
                      cYAML_print_tree2file(stderr, err_rc);
      

      With that patch applied the node does not crash.

      I checked master, Lustre 2.12.0 and Lustre 2.11.0 and the problem exists in all those versions.

      Attachments

        Activity

          [LU-12152] lnetctl export corrupts memory on routers
          pjones Peter Jones made changes -
          Labels Original: LTS12
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.12.3 [ 14418 ]

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34922/
          Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: b5cbe49a16b68ad60a8e7293d1b5450e0f97a430

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34922/ Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: b5cbe49a16b68ad60a8e7293d1b5450e0f97a430

          Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34922
          Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 544877453fdf588de6e7c80a894cd541ccffd478

          gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34922 Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 544877453fdf588de6e7c80a894cd541ccffd478
          pjones Peter Jones made changes -
          Labels New: LTS12
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.13.0 [ 14290 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.13

          pjones Peter Jones added a comment - Landed for 2.13

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34591/
          Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 187117fd94e4904c168de02fc439b41a1fcc3e48

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34591/ Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg Project: fs/lustre-release Branch: master Current Patch Set: Commit: 187117fd94e4904c168de02fc439b41a1fcc3e48

          Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/34591
          Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 51887ae69d8e59f2f4510b52c4e679b9b26e7165

          gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/34591 Subject: LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 51887ae69d8e59f2f4510b52c4e679b9b26e7165
          hornc Chris Horn added a comment -

          So, some clearer naming would have made this bug a lot more obvious. Reading the code carefully I believe that the "idx" parameter to lnet_get_rtr_pool_cfg() is actually a cpt number. So what we actually want to do is make sure that "j" is equal to "idx" when we copy the buffer information

          hornc Chris Horn added a comment - So, some clearer naming would have made this bug a lot more obvious. Reading the code carefully I believe that the "idx" parameter to lnet_get_rtr_pool_cfg() is actually a cpt number. So what we actually want to do is make sure that "j" is equal to "idx" when we copy the buffer information

          People

            ashehata Amir Shehata (Inactive)
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: