[LU-12073] conf-sanity test 123aa hangs Created: 14/Mar/19 Updated: 21/Mar/19 Resolved: 21/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7 |
| Fix Version/s: | Lustre 2.10.7 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
conf-sanity test_123aa hangs. conf-sanity test 123aa as added to b2_10 on 23 Feb 2019 with patch https://review.whamcloud.com/33863. Since that time, we’ve seen about four test sessions timeout during this test and we only see this issue on the b2_10 branch. Looking at the suite_log for the hang at https://testing.whamcloud.com/test_sets/4159964c-4363-11e9-9646-52540065bddc, (RHEL6.10 client testing)the last thing we see looks like the file system is coming up and setting up quotas Total disk size: 451176 block-softlimit: 452200 block-hardlimit: 474810 inode-softlimit: 79992 inode-hardlimit: 83991
Setting up quota on trevis-33vm1.trevis.whamcloud.com:/mnt/lustre for quota_usr...
+ /usr/bin/lfs setquota -u quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
+ /usr/bin/lfs setquota -g quota_usr -b 452200 -B 474810 -i 79992 -I 83991 /mnt/lustre
Quota settings for quota_usr :
Disk quotas for usr quota_usr (uid 60000):
Filesystem kbytes quota limit grace files quota limit grace
/mnt/lustre [0] 452200 474810 - 0 79992 83991 -
lustre-MDT0000_UUID
0 - 0 - 0 - 0 -
lustre-OST0000_UUID
0 - 0 - - - - -
Looking at console logs, some of the nodes are complaining that the MGS can’t be found. Looking at the client (vm1) console log, the last errors we see before the call traces are Lustre: DEBUG MARKER: mount | grep /mnt/lustre' ' Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var LustreError: 4491:0:(lov_obd.c:1379:lov_quotactl()) ost 1 is inactive LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19 LustreError: 11-0: lustre-OST0001-osc-ffff88004ddac000: operation ost_connect to node 10.9.5.160@tcp failed: rc = -19 LustreError: Skipped 3 previous similar messages Looking at the OSS (vm3) console logs, we see the same errors for each OST when formatting the OSTs after the first OST [48116.032988] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost2; mount -t lustre /dev/lvm-Role_OSS/P2 /mnt/lustre-ost2 [48116.396020] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5 [48116.398214] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: errors=remount-ro [48116.525199] LDISKFS-fs (dm-1): file extents enabled, maximum tree depth=5 [48116.527059] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc [48116.657800] LustreError: 15f-b: lustre-OST0001: cannot register this server with the MGS: rc = -17. Is the MGS running? [48116.659704] LustreError: 19668:0:(obd_mount_server.c:1882:server_fill_super()) Unable to start targets: -17 [48116.661395] LustreError: 19668:0:(obd_mount_server.c:1592:server_put_super()) no obd lustre-OST0001 [48116.663024] LustreError: 19668:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0001 not registered [48116.738331] LustreError: 19668:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-17) [48117.926245] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48117.929186] LustreError: Skipped 2 previous similar messages and then [48150.341541] Lustre: DEBUG MARKER: trevis-33vm2.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48152.124460] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 [48152.207544] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48152.210571] LustreError: Skipped 26 previous similar messages [48152.298637] Lustre: DEBUG MARKER: Using TIMEOUT=20 [48157.475237] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-OST0000.quota_slave.enabled [48217.205900] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48217.208963] LustreError: Skipped 155 previous similar messages [48347.205833] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.161@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [48347.208773] LustreError: Skipped 311 previous similar messages On the MGS/MDS console log, we the same errors for each OST after the first one [48111.202961] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null [48111.503103] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 [48115.037113] Lustre: DEBUG MARKER: /sbin/lctl mark trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48115.218187] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4 [48119.050786] LustreError: 30672:0:(llog.c:391:llog_init_handle()) MGS: llog uuid mismatch: config_uuid/ [48119.052457] LustreError: 30672:0:(mgs_llog.c:1864:record_start_log()) MGS: can't start log lustre-MDT0000.1552165754.bak: rc = -17 [48119.054392] LustreError: 30672:0:(mgs_llog.c:1961:mgs_write_log_direct_all()) MGS: writing log lustre-MDT0000.1552165754.bak: rc = -17 [48119.056367] LustreError: 30672:0:(mgs_llog.c:4234:mgs_write_log_param()) err -17 on param 'sys.timeout=20' [48119.058095] LustreError: 30672:0:(mgs_handler.c:535:mgs_target_reg()) Failed to write lustre-OST0001 log (-17) SLES clients - https://testing.whamcloud.com/test_sets/d89910e2-3890-11e9-8f69-52540065bddc |
| Comments |
| Comment by Andreas Dilger [ 14/Mar/19 ] |
|
It looks like this is some kind of problem with the current state of the system when setupall is run, since there are complaints during OST startup before the "test" is actually run: trevis-33vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P2 at /mnt/lustre-ost2 failed: File exists trevis-33vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P3 at /mnt/lustre-ost3 failed: File exists trevis-33vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P4 at /mnt/lustre-ost4 failed: File exists trevis-33vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P5 at /mnt/lustre-ost5 failed: File exists It looks like test_120 and test_122 are skipped for non-DNE configs (which only happens for interop "full" sessions, not same-version "full" sessions), so the last test run before test_123aa on b2_10 is test_109b (it doesn't have test_110, test_111, and test_115 which also reformat the filesystem), and test_109 looks like it is only starting/stopping ost1 and mds1 facets, but startall is trying to start all of the facets. So while it would appear that this is an interop issue, I think it is more of a test configuration issue. It seems in the tests that there is an unholy mixture of "setup" vs. "setupall" and it isn't always clear which one to use. On the one hand it would be faster to avoid a reformat for every subtest, but on the other hand if there are a variety of different subtests being run it might just make sense to have every conf-sanity test start with a reformat and end with a cleanup rather than trying to re-use an unknown filesystem configuration in the next subtest. For now I think the solution is a b2_10-only patch that changes setupall to setup, but for master it may be that we want a more complete reorganization of how conf-sanity tests are run. It might make sense to split conf-sanity.sh into conf-sanity-reformat and conf-sanity-keep or something like that, since it is already one of the longest test sessions, but that is just a first guess and there may be better ways to do that. |
| Comment by Gerrit Updater [ 14/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34428 |
| Comment by Gerrit Updater [ 21/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34428/ |