Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.7
Affects Version/s: Lustre 2.10.7
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

conf-sanity test_123aa hangs. conf-sanity test 123aa as added to b2_10 on 23 Feb 2019 with patch https://review.whamcloud.com/33863. Since that time, we’ve seen about four test sessions timeout during this test and we only see this issue on the b2_10 branch.

Looking at the suite_log for the hang at https://testing.whamcloud.com/test_sets/4159964c-4363-11e9-9646-52540065bddc, (RHEL6.10 client testing)the last thing we see looks like the file system is coming up and setting up quotas

Total disk size: 451176  block-softlimit: 452200 block-hardlimit: 474810 inode-softlimit: 79992 inode-hardlimit: 83991
Setting up quota on trevis-33vm1.trevis.whamcloud.com:/mnt/lustre for quota_usr...
+ /usr/bin/lfs setquota -u quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
+ /usr/bin/lfs setquota -g quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
Quota settings for quota_usr : 
Disk quotas for usr quota_usr (uid 60000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
    /mnt/lustre     [0]  452200  474810       -       0   79992   83991       -
lustre-MDT0000_UUID
                      0       -       0       -       0       -       0       -
lustre-OST0000_UUID
                      0       -       0       -       -       -       -       -

Looking at console logs, some of the nodes are complaining that the MGS can’t be found. Looking at the client (vm1) console log, the last errors we see before the call traces are

Lustre: DEBUG MARKER: mount | grep /mnt/lustre' '
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
LustreError: 4491:0:(lov_obd.c:1379:lov_quotactl()) ost 1 is inactive
LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19
LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19
LustreError: Skipped 3 previous similar messages

Looking at the OSS (vm3) console logs, we see the same errors for each OST when formatting the OSTs after the first OST

[48116.032988] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre   		                   /dev/lvm-Role_OSS/P2 /mnt/lustre-ost2
[48116.396020] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
[48116.398214] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[48116.525199] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5
[48116.527059] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
[48116.657800] LustreError: 15f-b: lustre-OST0001: cannot register this server with the MGS: rc = -17. Is the MGS running?
[48116.659704] LustreError: 19668:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17
[48116.661395] LustreError: 19668:0:(obd_mount_server.c:1592:server_put_super()) no obd lustre-OST0001
[48116.663024] LustreError: 19668:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0001 not registered
[48116.738331] LustreError: 19668:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-17)
[48117.926245] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[48117.929186] LustreError: Skipped 2 previous similar messages

and then

[48150.341541] Lustre: DEBUG MARKER: trevis-33vm2.trevis.whamcloud.com: executing set_default_debug -1 all 4
[48152.124460] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
[48152.207544] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[48152.210571] LustreError: Skipped 26 previous similar messages
[48152.298637] Lustre: DEBUG MARKER: Using TIMEOUT=20
[48157.475237] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-OST0000.quota_slave.enabled
[48217.205900] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[48217.208963] LustreError: Skipped 155 previous similar messages
[48347.205833] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[48347.208773] LustreError: Skipped 311 previous similar messages

On the MGS/MDS console log, we the same errors for each OST after the first one

[48111.202961] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
[48111.503103] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
[48115.037113] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
[48115.218187] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
[48119.050786] LustreError: 30672:0:(llog.c:391:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/
[48119.052457] LustreError: 30672:0:(mgs_llog.c:1864:record_start_log()) MGS: can't start log lustre-MDT0000.1552165754.bak: rc = -17
[48119.054392] LustreError: 30672:0:(mgs_llog.c:1961:mgs_write_log_direct_all()) MGS: writing log lustre-MDT0000.1552165754.bak: rc = -17
[48119.056367] LustreError: 30672:0:(mgs_llog.c:4234:mgs_write_log_param()) err -17 on param 'sys.timeout=20'
[48119.058095] LustreError: 30672:0:(mgs_handler.c:535:mgs_target_reg()) Failed to write lustre-OST0001 log (-17)

SLES clients - https://testing.whamcloud.com/test_sets/d89910e2-3890-11e9-8f69-52540065bddc
Ubuntu clients - https://testing.whamcloud.com/test_sets/0abcfbbe-4300-11e9-9646-52540065bddc
RHEL 6.10 clients - https://testing.whamcloud.com/test_sets/2a566de2-4324-11e9-92fe-52540065bddc

Attachments

Issue Links

mentioned in: Page Loading...

conf-sanity test 123aa hangs

Details

Description

Attachments

Issue Links

Activity

People

Dates