Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1326

Multihomed configuration with lustre 2.2.0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.2.0
    • None
    • Lustre 2.2.0 system
    • 3
    • 6413

    Description

      Hello,

      I'm trying to build a new multihomed (Infiniband and Ethernet) Lustre 2.2.0 system:

      Everything works fine as long as i'm only using one of the two networks, but i'm unable to use both at the same time.
      (eg: with ethernet-only and infiniband-only clients)

      My test setup looks like this:

      n-mds1: Combined MGS/MDT with Infiniband and Ethernet
      n-oss01: OSS - also with Infiniband and Ethernet
      a9115: Infiniband-only client
      a9116: Ethernet-only client

      The mds has 2 nids: 1x IB, 1x Ethernet:

      [root@n-mds1 ~]# cat /etc/modprobe.d/lustre.conf
      options lnet networks=o2ib(ib0),tcp(eth0)

      [root@n-mds1 ~]# modprobe lustre
      [root@n-mds1 ~]# lctl list_nids
      10.201.62.13@o2ib
      10.201.30.13@tcp

      The MDS was setup like this:

      [root@n-mds1 ~]# mkfs.lustre --fsname=foobar --reformat --mdt --mgs --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp /dev/mapper/vd01

      1. ..also tried without mgsnode and with --servicenode=10.201....

      [root@n-mds1 ~]# mount -t lustre /dev/mapper/vd01 /lustre/mds

      [root@n-mds1 ~]# lctl dl
      0 UP mgs MGS MGS 5
      1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
      2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
      3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
      4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3

      The OSS also has two NIDs and is able to ping the MDS:

      [root@n-oss01 ~]# cat /etc/modprobe.d/lustre.conf
      options lnet networks=o2ib(ib0),tcp(eth0)
      [root@n-oss01 ~]# modprobe lustre
      [root@n-oss01 ~]# lctl list_nids
      10.201.62.31@o2ib
      10.201.30.31@tcp

      [root@n-oss01 ~]# lctl ping 10.201.62.13@o2ib # mds-ib
      12345-0@lo
      12345-10.201.62.13@o2ib
      12345-10.201.30.13@tcp
      [root@n-oss01 ~]# lctl ping 10.201.30.13@tcp # mds-eth
      12345-0@lo
      12345-10.201.62.13@o2ib
      12345-10.201.30.13@tcp

      The filesystem on the OSS was created via:

      [root@n-oss01 ~]# mkfs.lustre --reformat --fsname=foobar --ost --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp --index=0 /dev/mapper/vd01
      [root@n-oss01 ~]# mount -t lustre /dev/mapper/vd01 /lustre/vd01 && sleep 2 && lctl dl
      [root@n-oss01 ~]# lctl dl
      0 UP mgc MGC10.201.62.13@o2ib 7408e9c5-b92e-5423-fa52-497d0c540a43 5
      1 UP ost OSS OSS_uuid 3
      2 UP obdfilter foobar-OST0000 foobar-OST0000_UUID 5

      So the OSS seems to be happy, the MDS also looks fine:

      [root@n-mds1 ~]# lctl dl
      0 UP mgs MGS MGS 7
      1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
      2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
      3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
      4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3
      5 UP osc foobar-OST0000-osc-MDT0000 foobar-MDT0000-mdtlov_UUID 5

      Mounting the filesystem on the IB-Only client works just fine now:

      [root@a9115 ~]# cat /etc/modprobe.d/lustre.conf
      options lnet networks="o2ib(ib0)"
      [root@a9115 ~]# modprobe lustre
      [root@a9115 ~]# lctl list_nids
      10.201.36.34@o2ib
      [root@a9115 ~]# lctl ping 10.201.62.13@o2ib
      12345-0@lo
      12345-10.201.62.13@o2ib
      12345-10.201.30.13@tcp
      [root@a9115 ~]# mount -t lustre 10.201.62.13@o2ib:/foobar /cluster/scratch

      ..but the ethernet-only client fails:

      [root@a9116 ~]# cat /etc/modprobe.d/lustre.conf
      options lnet networks=tcp(eth0)
      [root@a9116 ~]# modprobe lustre
      [root@a9116 ~]# lctl list_nids
      10.201.4.35@tcp

      lctl_ping seems to work:
      [root@a9116 ~]# lctl ping 10.201.30.13@tcp # mds
      12345-0@lo
      12345-10.201.62.13@o2ib
      12345-10.201.30.13@tcp
      [root@a9116 ~]# lctl ping 10.201.30.31@tcp # oss
      12345-0@lo
      12345-10.201.62.31@o2ib
      12345-10.201.30.31@tcp

      ..the mount operation fails with:

      [root@a9116 ~]# lctl clear
      [root@a9116 ~]# mount -t lustre 10.201.30.13@tcp:/foobar /cluster/scratch/
      mount.lustre: mount 10.201.30.13@tcp:/foobar at /cluster/scratch failed: No such file or directory

      Apr 12 14:10:21 a9116 kernel: Lustre: MGC10.201.30.13@tcp: Reactivating import
      Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
      Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
      Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
      Apr 12 14:10:21 a9116 kernel: Lustre: cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
      Apr 12 14:10:21 a9116 kernel: LustreError: 15c-8: MGC10.201.30.13@tcp: The configuration from log 'foobar-client' failed (-2). This may be the result of communication errors between this node and the MGS,
      a bad configuration, or other errors. See the syslog for more information.
      Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -2
      Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(obd_config.c:566:class_cleanup()) Device 3 not setup
      Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      Is the MGS specification correct?
      Is the filesystem name correct?
      If upgrading, is the copied client log valid? (see upgrade docs)
      Apr 12 14:10:22 a9116 kernel: Lustre: client ffff880e3a01f400 umount complete
      Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount (-2)

      Why does the ethernet client receive (or pick?) the infiniband nid of the MDS? (10.201.62.13@o2ib)

      'lctl dk' reports the same:

      00000020:01000000:7.0:1334232621.765857:0:9130:0:(obd_config.c:1217:class_config_llog_handler()) Marker, inst_flg=0x0 mark_flg=0x1
      00000020:00000080:7.0:1334232621.765859:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf010
      00000020:00000080:7.0:1334232621.765860:0:9130:0:(obd_config.c:984:class_process_config()) marker 5 (0x1) foobar-MDT0000 add mdc
      00000020:00000080:7.0:1334232621.765861:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
      00000020:00000080:7.0:1334232621.765870:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x500000ac93e0d (10.201.62.13@o2ib)
      00000020:00000080:7.0:1334232621.765873:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
      00000020:00000080:7.0:1334232621.765874:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x200000ac91e0d (10.201.30.13@tcp)
      00000020:01000000:7.0:1334232621.765877:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf001, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
      00000020:00000080:7.0:1334232621.765878:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf001
      00000020:00000080:7.0:1334232621.765879:0:9130:0:(obd_config.c:318:class_attach()) attach type mdc name: foobar-MDT0000-mdc-ffff880e3a01f400 uuid: 6a7aaf3a-bbcb-9abf-3516-4320e2718614
      00000020:00000080:7.0:1334232621.765936:0:9130:0:(genops.c:348:class_newdev()) Adding new device foobar-MDT0000-mdc-ffff880e3a01f400 (ffff8810358320b8)
      00000020:00000080:7.0:1334232621.765938:0:9130:0:(obd_config.c:392:class_attach()) OBD: dev 3 attached type mdc with refcount 1
      00000020:01000000:7.0:1334232621.765940:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf003, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
      00000020:00000080:7.0:1334232621.765941:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf003
      00000100:00000100:7.0:1334232621.765957:0:9130:0:(client.c:80:ptlrpc_uuid_to_connection()) cannot find peer 10.201.62.13@o2ib!
      00010000:00080000:7.0:1334232621.765959:0:9130:0:(ldlm_lib.c:74:import_set_conn()) can't find connection 10.201.62.13@o2ib
      00010000:00020000:7.0:1334232621.765960:0:9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
      00000020:00000080:7.0:1334232621.793034:0:9130:0:(genops.c:786:class_export_put()) final put ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614
      00000020:00020000:7.0:1334232621.793043:0:9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
      00000020:00000080:4.0:1334232621.793043:0:8930:0:(genops.c:915:class_import_destroy()) destroying import ffff881039698800 for foobar-MDT0000-mdc-ffff880e3a01f400
      00000020:00000080:4.0:1334232621.793049:0:8930:0:(genops.c:743:class_export_destroy()) destroying export ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614 for foobar-MDT0000-mdc-ffff880e3a01f400
      00000020:00020000:7.0:1334232621.823020:0:9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
      00000020:02000400:7.0:1334232621.850778:0:9130:0:(obd_config.c:1456:class_config_dump_handler()) cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
      00000020:01000000:31.0:1334232621.850818:0:9116:0:(obd_config.c:1393:class_config_parse_llog()) Processed log foobar-client gen 1-13 (rc=-2)

      I'm also puzzled about the nids in CONFIGS/foobar-[client,MDT0000]:

      [root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-client |grep uuid
      Target uuid : config_uuid
      uuid=foobar-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
      uuid=foobar-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0
      #10 (088)add_uuid nid=10.201.62.13@o2ib(0x500000ac93e0d) 0: 1:10.201.62.13@o2ib
      #11 (088)add_uuid nid=10.201.30.13@tcp(0x200000ac91e0d) 0: 1:10.201.62.13@o2ib
      #20 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
      #21 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib
      [root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-MDT0000 |grep uuid
      Target uuid : config_uuid
      uuid=foobar-MDT0000-mdtlov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
      #11 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
      #12 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib

      Is it normal that there are no tcp nids on the right hand side and what stupid mistake did i make while setting up the system?

      Attachments

        Activity

          People

            wc-triage WC Triage
            ethz.support ETHz Support (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: