Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.7
-
None
-
3
-
9223372036854775807
Description
conf-sanity test_123aa hangs. conf-sanity test 123aa as added to b2_10 on 23 Feb 2019 with patch https://review.whamcloud.com/33863. Since that time, we’ve seen about four test sessions timeout during this test and we only see this issue on the b2_10 branch.
Looking at the suite_log for the hang at https://testing.whamcloud.com/test_sets/4159964c-4363-11e9-9646-52540065bddc, (RHEL6.10 client testing)the last thing we see looks like the file system is coming up and setting up quotas
Total disk size: 451176 block-softlimit: 452200 block-hardlimit: 474810 inode-softlimit: 79992 inode-hardlimit: 83991 Setting up quota on trevis-33vm1.trevis.whamcloud.com:/mnt/lustre for quota_usr... + /usr/bin/lfs setquota -u quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre + /usr/bin/lfs setquota -g quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre Quota settings for quota_usr : Disk quotas for usr quota_usr (uid 60000): Filesystem kbytes quota limit grace files quota limit grace /mnt/lustre [0] 452200 474810 - 0 79992 83991 - lustre-MDT0000_UUID 0 - 0 - 0 - 0 - lustre-OST0000_UUID 0 - 0 - - - - -
Looking at console logs, some of the nodes are complaining that the MGS can’t be found. Looking at the client (vm1) console log, the last errors we see before the call traces are
Lustre: DEBUG MARKER: mount | grep /mnt/lustre' ' Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var LustreError: 4491:0:(lov_obd.c:1379:lov_quotactl()) ost 1 is inactive LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19 LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19 LustreError: Skipped 3 previous similar messages
Looking at the OSS (vm3) console logs, we see the same errors for each OST when formatting the OSTs after the first OST
[48116.032988] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre /dev/lvm-Role_OSS/P2 /mnt/lustre-ost2 [48116.396020] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5 [48116.398214] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: errors=remount-ro [48116.525199] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5 [48116.527059] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc [48116.657800] LustreError: 15f-b: lustre-OST0001: cannot register this server with the MGS: rc = -17. Is the MGS running? [48116.659704] LustreError: 19668:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17 [48116.661395] LustreError: 19668:0:(obd_mount_server.c:1592:server_put_super()) no obd lustre-OST0001 [48116.663024] LustreError: 19668:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0001 not registered [48116.738331] LustreError: 19668:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-17) [48117.926245] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48117.929186] LustreError: Skipped 2 previous similar messages
and then
[48150.341541] Lustre: DEBUG MARKER: trevis-33vm2.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48152.124460] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 [48152.207544] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48152.210571] LustreError: Skipped 26 previous similar messages [48152.298637] Lustre: DEBUG MARKER: Using TIMEOUT=20 [48157.475237] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-OST0000.quota_slave.enabled [48217.205900] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48217.208963] LustreError: Skipped 155 previous similar messages [48347.205833] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48347.208773] LustreError: Skipped 311 previous similar messages
On the MGS/MDS console log, we the same errors for each OST after the first one
[48111.202961] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null [48111.503103] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 [48115.037113] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48115.218187] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48119.050786] LustreError: 30672:0:(llog.c:391:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/ [48119.052457] LustreError: 30672:0:(mgs_llog.c:1864:record_start_log()) MGS: can't start log lustre-MDT0000.1552165754.bak: rc = -17 [48119.054392] LustreError: 30672:0:(mgs_llog.c:1961:mgs_write_log_direct_all()) MGS: writing log lustre-MDT0000.1552165754.bak: rc = -17 [48119.056367] LustreError: 30672:0:(mgs_llog.c:4234:mgs_write_log_param()) err -17 on param 'sys.timeout=20' [48119.058095] LustreError: 30672:0:(mgs_handler.c:535:mgs_target_reg()) Failed to write lustre-OST0001 log (-17)
SLES clients - https://testing.whamcloud.com/test_sets/d89910e2-3890-11e9-8f69-52540065bddc
Ubuntu clients - https://testing.whamcloud.com/test_sets/0abcfbbe-4300-11e9-9646-52540065bddc
RHEL 6.10 clients - https://testing.whamcloud.com/test_sets/2a566de2-4324-11e9-92fe-52540065bddc
Attachments
Issue Links
- mentioned in
-
Page Loading...