Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7852

Missing lnetctl command on any recent daily built package

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • None
    • CentOS 7.2
    • 3
    • 9223372036854775807

    Description

      It's easy to repeat it at CentOS 7.2, just follow these steps.
      1. Install all package that it is needed for compiling Lustre but libyaml-devel
      2. Build the Lustre customerized Linux kernel, my version is 3.10.0-327.el7.x86_64. You can refer the old manual at Chapter 30. Installing a Lustre File System from Source Code although there are some mistakes.
      3. git clone the newest Lustre-release code from git://git.hpdd.intel.com/fs/lustre-release.git
      4. Run 'configure' at root folder of lustre-release code with the path of the modified kernel source. It may look like as follow:
      ./configure --with-linux=/root/rpmbuild/BUILD/kernel-3.10.0_327.el7_lustre.x86_64/ --with-o2ib=no
      5. run 'make' at root folder of lustre-release code
      6. Once you finished, just go to the folder lustre-release/lnet/utils and you will see that the lnetctl are not exist while other such as 'lst' is.
      7. Check the Makefile under that folder, you will find that there comment symbol '#' on line 123:
      line 121: sbin_PROGRAMS = routerstat$(EXEEXT) lst$(EXEEXT) \
      line 122: $(am_EXEEXT_1) $(am_EXEEXT_2)
      line 123: #am__append_1 = lnetctl
      line 124: am__append_2 = wirecheck
      line 125: subdir = lnet/utils
      And also at line 176: #am__EXEEXT_1 = lnetctl$(EXEEXT)
      And also at the folder lustre-release/lnet/utils/lnetconfig there are no 'lnetconfig.la' which should exist because lnetctl need it to compile.

      It won't help even you run 'configure' and 'make' for more times. And now you install libyaml-devel, you can install it by yum or source. I use yum to install it by simpily running: sudo yum install libyaml-devel.
      Now you run 'configure' and 'make' again, you will find 'lnetctl' has been successfully compiled and placed at lustre-release/lnet/utils. You can also run './lnetctl' at the folder and find it seems work.

      Since lnetctl is an important tool to configure LNET, and the total Chapter 9 are tell us how to use it configuring network. I think if it is due to lack of some package, it should have a message to tell us that was lacking a package, however, the 'configure' and 'make' are running successfully without any message but not produce the 'lnetctl'. That Makefile must be auto-gernerate by running 'configure'. I hope the developers can check this problem, and a bad news is that all recent daily built package at https://build.hpdd.intel.com/job/lustre-master/ , such as '#3330' are not containing 'lnetctl', you can just download, install and check. (Mostly you will get "bash: lnetctl: command not found")

      I doesn't check if the 'configure' will produce a warning about lack of 'libyaml-devel', however I think it should produce a error rather than warning. The llmount.sh script may be need it at some situation. You can simpliy repeat the situations as follow on VirtualBox.

      1. Create a virtual machine with two network interface card, and the first one set to NAT network while the second one set to Host-Only network.
      2. Install CentOS 7.2 on it.
      3. run "ip addr" you may get as below:

      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
      ...
      inet 127.0.0.1/8 scope host lo
      ...
      2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
      ...
      inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
      ...
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
      ...
      inet 192.168.56.101/24 brd 192.168.56.255 scope global dynamic enp0s8
      ...

      As it shows you have the NAT network at first NIC and the host-only network at second.
      Now modify /etc/hostname with a name you specified (such as "node1") and modify /etc/hosts adding the host-only IP address with that hostname. you may reboot the machine after modifing to take effect.
      After modify these two files, if you run 'cat' to them, you should get something like as below:

      [eteced@node1 ~]$ cat /etc/hostname
      node1
      [eteced@node1 ~]$ cat /etc/hosts
      127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
      ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
      192.168.56.101 node1
      [eteced@node1 ~]$

      4. Download the latest build rpms (#3330) from https://build.hpdd.intel.com/job/lustre-master/ and install them. (You may need reboot to use the kernel which just been installed)
      5. Simply run "llmount.sh" at /lib64/lustre/tests/llmount.sh, it would like as below:

      [root@node1 eteced]# /lib64/lustre/tests/llmount.sh
      Stopping clients: node1 /mnt/lustre (opts
      Stopping clients: node1 /mnt/lustre2 (opts
      Loading modules from /lib64/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck
      subsystem_debug=all -lnet -lnd -pinger
      quota/lquota options: 'hash_lqs_cur_bits=3'
      Formatting mgs, mds, osts
      Format mds1: /tmp/lustre-mdt1
      Format ost1: /tmp/lustre-ost1
      Format ost2: /tmp/lustre-ost2
      Checking servers environments
      Checking clients node1 environments
      Loading modules from /lib64/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck
      subsystem_debug=all -lnet -lnd -pinger
      Setup mgs, mdt, osts
      Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1
      mount.lustre: mount /dev/loop1 at /mnt/ost1 failed: Connection timed out

      And then you may run 'dmesg', it shows:

      ...
      [ 134.960367] LNetError: 120-3: Refusing connection from 192.168.56.101 for 192.168.56.101@tcp: No matching NI
      [ 134.960666] LNetError: 10438:0:(socklnd_cb.c:1723:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.56.101
      [ 134.961040] LNetError: 11b-b: Connection to 192.168.56.101@tcp at host 192.168.56.101 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.56.101@tcp one of its NIDs?
      [ 139.960163] Lustre: 10446:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457156893/real 1457156893] req@ffff88020433a600 x1527939743088740/t0(0) o250->MGC192.168.56.101@tcp@192.168.56.101@tcp:26/25 lens 520/544 e 0 to 1 dl 1457156898 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 139.960500] Lustre: lustre-MDT0000: Connection restored to 10.0.2.15@tcp (at 0@lo)
      [ 139.960684] LNetError: 120-3: Refusing connection from 192.168.56.101 for 192.168.56.101@tcp: No matching NI
      [ 139.961892] LNetError: 10439:0:(socklnd_cb.c:1723:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.56.101
      [ 139.962902] LNetError: 11b-b: Connection to 192.168.56.101@tcp at host 192.168.56.101 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.56.101@tcp one of its NIDs?
      [ 144.971200] LustreError: 15f-b: lustre-OST0000: cannot register this server with the MGS: rc = -110. Is the MGS running?
      [ 144.972325] LustreError: 11686:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -110
      [ 144.974060] LustreError: 11686:0:(obd_mount_server.c:1512:server_put_super()) no obd lustre-OST0000
      [ 144.974866] LustreError: 11686:0:(obd_mount_server.c:140:server_deregister_mount()) lustre-OST0000 not registered
      [ 145.011302] Lustre: server umount lustre-OST0000 complete
      [ 145.011302] LustreError: 11686:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount (-110)

      Since there are no 'lnetctl' command line tools, you may have to add a conf at /etc/modprobe.d/ with line "options lnet networks=tcp0(enp0s8)" , then you need to run 'llmountcleanup.sh' before running 'llmount.sh' again. Just like below:

      [root@node1 eteced]# /lib64/lustre/tests/llmountcleanup.sh
      Stopping clients: node1 /mnt/lustre (opts:-f)
      Stopping clients: node1 /mnt/lustre2 (opts:-f)
      Stopping /mnt/mds1 (opts:-f) on node1
      modules unloaded.
      [root@node1 eteced]# /lib64/lustre/tests/llmount.sh
      Stopping clients: node1 /mnt/lustre (opts
      Stopping clients: node1 /mnt/lustre2 (opts
      Loading modules from /lib64/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck
      subsystem_debug=all -lnet -lnd -pinger
      quota/lquota options: 'hash_lqs_cur_bits=3'
      Formatting mgs, mds, osts
      Format mds1: /tmp/lustre-mdt1
      Format ost1: /tmp/lustre-ost1
      Format ost2: /tmp/lustre-ost2
      Checking servers environments
      Checking clients node1 environments
      Loading modules from /lib64/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck
      subsystem_debug=all -lnet -lnd -pinger
      Setup mgs, mdt, osts
      Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1
      Started lustre-OST0000
      Starting ost2: -o loop /tmp/lustre-ost2 /mnt/ost2
      Started lustre-OST0001
      Starting client: node1: -o user_xattr,flock node1@tcp:/lustre /mnt/lustre
      Using TIMEOUT=20
      seting jobstats to procname_uid
      Setting lustre.sys.jobid_var from disable to procname_uid
      Waiting 90 secs for update
      Updated after 3s: wanted 'procname_uid' got 'procname_uid'
      disable quota as required
      [root@node1 eteced]#

      It seems successfully running "llmount.sh".
      Although these steps are producing at a virtual machine, I think the key point to trigger the bug is that you have two network card, and the hostname is set to the second rather than the first (at /etc/hosts or some other name resolving settings).

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              eteced Yingdi Guo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: