Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12152

lnetctl export corrupts memory on routers

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      Reproducer:

      [root@cent75build01 ~]# cat /etc/modprobe.d/lnet.conf
      options lnet networks=tcp
      options lnet forwarding=enabled
      [root@cent75build01 ~]# modprobe lnet
      [root@cent75build01 ~]# lctl net up
      LNET configured
      [root@cent75build01 ~]# while true; do lnetctl export > /dev/null; echo "still alive"; done
      still alive
      still alive
      still alive
      still alive
      still alive
      still alive
      still alive
      Write failed: Broken pipe
      [root@control01 ~]# ssh cent75build01
      Last login: Tue Apr  2 22:01:13 2019 from 192.168.1.10
      [root@cent75build01 ~]# cd /var/crash/127.0.0.1-2019-04-02-22:02:31
      [root@cent75build01 127.0.0.1-2019-04-02-22:02:31]# tail --lines 36 vmcore-dmesg.txt
      [  156.209529] BUG: unable to handle kernel paging request at 0000007a00000002
      [  156.209598] IP: [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.209648] PGD 800000081d0c4067 PUD 0
      [  156.209672] Oops: 0000 [#1] SMP
      [  156.209695] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE) ib_core(OE) mlx4_core(OE) mlx_compat(OE) devlink sb_edac coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr joydev sg vmw_vmci parport_pc parport shpchp i2c_piix4 binfmt_misc ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix mptspi drm scsi_transport_spi crct10dif_pclmul crct10dif_common mptscsih crc32c_intel libata serio_raw mptbase vmxnet3
      [  156.210184]  i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod
      [  156.210239] CPU: 25 PID: 2169 Comm: lnetctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
      [  156.210289] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
      [  156.210335] task: ffff97c376ba0fd0 ti: ffff97c377424000 task.ti: ffff97c377424000
      [  156.210368] RIP: 0010:[<ffffffffa53fae34>]  [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.210409] RSP: 0018:ffff97c377427d10  EFLAGS: 00010282
      [  156.210433] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000004993
      [  156.210465] RDX: 0000000000004992 RSI: 00000000000080d0 RDI: ffff97c0bfc03700
      [  156.210495] RBP: ffff97c377427d40 R08: 000000000001bb00 R09: ffffffffa54217ec
      [  156.210526] R10: 8080808080808080 R11: 0000000000000000 R12: 0000007a00000002
      [  156.210557] R13: 00000000000080d0 R14: ffff97c0bfc03700 R15: ffff97c0bfc03700
      [  156.210588] FS:  0000000000000000(0000) GS:ffff97c77fc40000(0000) knlGS:0000000000000000
      [  156.210623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  156.210649] CR2: 0000007a00000002 CR3: 00000008322c6000 CR4: 00000000000407e0
      [  156.210728] Call Trace:
      [  156.210756]  [<ffffffffa54217ec>] ? get_empty_filp+0x5c/0x1a0
      [  156.210786]  [<ffffffffa54217ec>] get_empty_filp+0x5c/0x1a0
      [  156.210817]  [<ffffffffa543019d>] path_openat+0x4d/0x640
      [  156.210846]  [<ffffffffa53c8544>] ? handle_pte_fault+0x2f4/0xd10
      [  156.211731]  [<ffffffffa5431dbd>] do_filp_open+0x4d/0xb0
      [  156.212562]  [<ffffffffa53caefd>] ? handle_mm_fault+0x39d/0x9b0
      [  156.213420]  [<ffffffffa543f167>] ? __alloc_fd+0x47/0x170
      [  156.214258]  [<ffffffffa541e0d7>] do_sys_open+0x137/0x240
      [  156.215070]  [<ffffffffa59256d5>] ? system_call_after_swapgs+0xa2/0x146
      [  156.215865]  [<ffffffffa541e1fe>] SyS_open+0x1e/0x20
      [  156.216661]  [<ffffffffa592579b>] system_call_fastpath+0x22/0x27
      [  156.217457]  [<ffffffffa59256e1>] ? system_call_after_swapgs+0xae/0x146
      [  156.218236] Code: 63 c1 5a 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 28 01 00 00 48 85 c0 0f 84 1f 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 ba 49 63
      [  156.220699] RIP  [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0
      [  156.221468]  RSP <ffff97c377427d10>
      [  156.222214] CR2: 0000007a00000002
      

      The problem is in lustre_lnet_show_routing(). I verified this by applying the following patch:

      diff --git a/lnet/utils/lnetctl.c b/lnet/utils/lnetctl.c
      index c503223..62d34bb 100644
      --- a/lnet/utils/lnetctl.c
      +++ b/lnet/utils/lnetctl.c
      @@ -1550,13 +1550,6 @@ static int jt_export(int argc, char **argv)
                      err_rc = NULL;
              }
      
      -       rc = lustre_lnet_show_routing(-1, &show_rc, &err_rc, backup);
      -       if (rc != LUSTRE_CFG_RC_NO_ERR) {
      -               cYAML_print_tree2file(stderr, err_rc);
      -               cYAML_free_tree(err_rc);
      -               err_rc = NULL;
      -       }
      -
              rc = lustre_lnet_show_peer(NULL, 2, -1, &show_rc, &err_rc, backup);
              if (rc != LUSTRE_CFG_RC_NO_ERR) {
                      cYAML_print_tree2file(stderr, err_rc);
      

      With that patch applied the node does not crash.

      I checked master, Lustre 2.12.0 and Lustre 2.11.0 and the problem exists in all those versions.

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: