Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12073

conf-sanity test 123aa hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.7
    • Lustre 2.10.7
    • None
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_123aa hangs. conf-sanity test 123aa as added to b2_10 on 23 Feb 2019 with patch https://review.whamcloud.com/33863. Since that time, we’ve seen about four test sessions timeout during this test and we only see this issue on the b2_10 branch.

      Looking at the suite_log for the hang at https://testing.whamcloud.com/test_sets/4159964c-4363-11e9-9646-52540065bddc, (RHEL6.10 client testing)the last thing we see looks like the file system is coming up and setting up quotas

      Total disk size: 451176  block-softlimit: 452200 block-hardlimit: 474810 inode-softlimit: 79992 inode-hardlimit: 83991
      Setting up quota on trevis-33vm1.trevis.whamcloud.com:/mnt/lustre for quota_usr...
      + /usr/bin/lfs setquota -u quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
      + /usr/bin/lfs setquota -g quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
      Quota settings for quota_usr : 
      Disk quotas for usr quota_usr (uid 60000):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
          /mnt/lustre     [0]  452200  474810       -       0   79992   83991       -
      lustre-MDT0000_UUID
                            0       -       0       -       0       -       0       -
      lustre-OST0000_UUID
                            0       -       0       -       -       -       -       -
      

      Looking at console logs, some of the nodes are complaining that the MGS can’t be found. Looking at the client (vm1) console log, the last errors we see before the call traces are

      Lustre: DEBUG MARKER: mount | grep /mnt/lustre' '
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
      LustreError: 4491:0:(lov_obd.c:1379:lov_quotactl()) ost 1 is inactive
      LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19
      LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19
      LustreError: Skipped 3 previous similar messages
      

      Looking at the OSS (vm3) console logs, we see the same errors for each OST when formatting the OSTs after the first OST

      [48116.032988] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre   		                   /dev/lvm-Role_OSS/P2 /mnt/lustre-ost2
      [48116.396020] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
      [48116.398214] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      [48116.525199] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
      [48116.527059] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
      [48116.657800] LustreError: 15f-b: lustre-OST0001: cannot register this server with the MGS: rc = -17. Is the MGS running?
      [48116.659704] LustreError: 19668:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17
      [48116.661395] LustreError: 19668:0:(obd_mount_server.c:1592:server_put_super()) no obd lustre-OST0001
      [48116.663024] LustreError: 19668:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0001 not registered
      [48116.738331] LustreError: 19668:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-17)
      [48117.926245] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [48117.929186] LustreError: Skipped 2 previous similar messages
      

      and then

      [48150.341541] Lustre: DEBUG MARKER: trevis-33vm2.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [48152.124460] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
      [48152.207544] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [48152.210571] LustreError: Skipped 26 previous similar messages
      [48152.298637] Lustre: DEBUG MARKER: Using TIMEOUT=20
      [48157.475237] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-OST0000.quota_slave.enabled
      [48217.205900] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [48217.208963] LustreError: Skipped 155 previous similar messages
      [48347.205833] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [48347.208773] LustreError: Skipped 311 previous similar messages
      

      On the MGS/MDS console log, we the same errors for each OST after the first one

      [48111.202961] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      [48111.503103] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      [48115.037113] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [48115.218187] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [48119.050786] LustreError: 30672:0:(llog.c:391:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
      [48119.052457] LustreError: 30672:0:(mgs_llog.c:1864:record_start_log()) MGS: can't start log lustre-MDT0000.1552165754.bak: rc = -17
      [48119.054392] LustreError: 30672:0:(mgs_llog.c:1961:mgs_write_log_direct_all()) MGS: writing log lustre-MDT0000.1552165754.bak: rc = -17
      [48119.056367] LustreError: 30672:0:(mgs_llog.c:4234:mgs_write_log_param()) err -17 on param 'sys.timeout=20'
      [48119.058095] LustreError: 30672:0:(mgs_handler.c:535:mgs_target_reg()) Failed to write lustre-OST0001 log (-17)
      

      SLES clients - https://testing.whamcloud.com/test_sets/d89910e2-3890-11e9-8f69-52540065bddc
      Ubuntu clients - https://testing.whamcloud.com/test_sets/0abcfbbe-4300-11e9-9646-52540065bddc
      RHEL 6.10 clients - https://testing.whamcloud.com/test_sets/2a566de2-4324-11e9-92fe-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              adilger Andreas Dilger
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: