[LU-12312] sanity-sec: test_31: 'network' mount option cannot be taken into account Created: 17/May/19 Updated: 01/Jun/23 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1bf6a37a-7821-11e9-a028-52540065bddc CMD: trevis-38vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4
trevis-38vm8: == rpc test complete, duration -o sec ================================================================ 19:32:45 (1558035165)
trevis-38vm8: trevis-38vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null
Started lustre-OST0007
CMD: trevis-38vm9 /usr/sbin/lctl list_nids | grep tcp999
Starting client: trevis-38vm6.trevis.whamcloud.com: -o user_xattr,flock,network=tcp999 10.9.3.145@tcp999:/lustre /mnt/lustre
CMD: trevis-38vm6.trevis.whamcloud.com mkdir -p /mnt/lustre
CMD: trevis-38vm6.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,network=tcp999 10.9.3.145@tcp999:/lustre /mnt/lustre
mount.lustre: mount 10.9.3.145@tcp999:/lustre at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
unconfigure:
- lnet:
errno: -16
descr: "LNet unconfigure error: Device or resource busy"
[17996.736209] Lustre: DEBUG MARKER: == sanity-sec test 31: client mount option '-o network' ============================================== 19:30:04 (1558035004) [17997.693592] Lustre: DEBUG MARKER: lctl get_param -n *.lustre*.exports.'10.9.5.215@tcp'.uuid 2>/dev/null | grep -q - [17998.217952] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure && /usr/sbin/lnetctl net add --if eth0 --net tcp999 [17998.557153] LNet: Added LNI 10.9.3.146@tcp999 [8/256/0/180] [18000.237970] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_statfs to node 10.9.3.145@tcp failed: rc = -107 [18000.239925] LustreError: Skipped 9 previous similar messages [18000.240888] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.9.3.145@tcp) was lost; in progress operations using this service will wait for recovery to complete [18000.243842] Lustre: Skipped 18 previous similar messages [18007.616779] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [18007.922846] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds2 [18009.125483] Lustre: lustre-MDT0001: Not available for connect from 10.9.3.145@tcp (stopping) [18009.127096] Lustre: Skipped 42 previous similar messages [18011.260546] LustreError: 17495:0:(client.c:1183:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff98fcd0168d80 x1633699486628912/t0(0) o41->lustre-MDT0003-osp-MDT0001@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [18011.264135] LustreError: 17495:0:(client.c:1183:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [18015.357668] Lustre: server umount lustre-MDT0001 complete [18015.358716] Lustre: Skipped 1 previous similar message [18016.092394] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gerrit Updater [ 30/May/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34997 |
| Comment by James A Simmons [ 25/Feb/20 ] |
|
Bug due to sysfs lnet handling of peer creation which is a bad idea. |
| Comment by Andreas Dilger [ 02/Apr/20 ] |
|
This is causing about 70% test failures in the past few days. Is this related to some other patch that landed? |
| Comment by Andreas Dilger [ 02/Apr/20 ] |
|
The test log shows: mount.lustre: mount 10.9.6.211@tcp999:/lustre at /mnt/lustre failed: Invalid argument and the client console log shows: [19617.682714] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,network=tcp999 10.9.6.211@tcp999:/lustre /mnt/lustre [19617.693290] LustreError: 21537:0:(obd_mount.c:1487:lmd_parse()) LNet Dynamic Peer Discovery is enabled on this node. 'network' mount option cannot be taken into account. [19617.695857] LustreError: 21537:0:(obd_mount.c:1586:lmd_parse()) Bad mount options user_xattr,flock,network=tcp999,device=10.9.6.211@tcp999:/lustre [19617.698041] LustreError: 21537:0:(obd_mount.c:1681:lustre_fill_super()) Unable to mount (-22) [19618.702375] LNet: Removed LNI 10.9.6.208@tcp999 It may relate to patch https://review.whamcloud.com/36919 " |
| Comment by Gerrit Updater [ 02/Apr/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38126 |
| Comment by Andreas Dilger [ 02/Apr/20 ] |
|
It seems that the initial problem report from this ticket is unrelated to the failures currently being hit, despite the fact that they both cause the same subtest to fail in the same way. |
| Comment by Gerrit Updater [ 02/Apr/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38128 |
| Comment by Gerrit Updater [ 02/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38128/ |
| Comment by James A Simmons [ 03/Apr/20 ] |
|
looks like the revert has been dropped. |
| Comment by Andreas Dilger [ 03/Apr/20 ] |
|
The test has been added to the ALWAYS_EXCEPT list, and this ticket marked with the always_except label, so we can't close it until the issue is fixed and the test is removed from the sanity-sec.sh ALWAYS_EXCEPT list. |
| Comment by Sebastien Buisson [ 08/Apr/20 ] |
|
As you can see from the logs of this successful run of sanity-sec test-31 at https://testing.whamcloud.com/test_logs/68aeef43-2eea-4566-95a7-bfa97456c2da/show_text, the error messages appearing when mounting with -o network=tcp999 with LNet Dynamic Discovery enabled are normal, as the test expects failure. Then the test disables LNet Dynamic Discovery, and mount is expected to pass. I was able to reproduce the error reported here, and it seems to be due to a stalled connection between the client and the MGS: # lctl dl 0 UP mgc MGC10.128.11.155@tcp999 702b3c5b-1df0-4 4 # lctl lctl > cfg_device MGC10.128.11.155@tcp999 lctl > cleanup force [never returns...] I was also able to test the tip of master branch (commit 742897a967 " In fact, further testing shows that it breaks the whole -o network option, which is mandatory as soon as you want to implement multi-tenancy for Lustre. So I would prefer that this patch is reverted from master for now, and sanity-sec test_31 removed from the ALWAYS_EXCEPT list. |
| Comment by Amir Shehata (Inactive) [ 08/Apr/20 ] |
|
I've been working on debugging this issue. I'd like some time to narrow down the reason for the regression before reverting the patch. At least I'd like to have clarity why this behaviour breaks the network option. |
| Comment by Amir Shehata (Inactive) [ 08/Apr/20 ] |
|
I'm currently working on a resolution for this problem. I'll push it as a separate patch, as the original patch, " |
| Comment by Gerrit Updater [ 15/Apr/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38229 |
| Comment by Gerrit Updater [ 23/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38229/ |
| Comment by Andreas Dilger [ 22/May/20 ] |
|
I hit a timeout with sanity-sec test_31 now that this test is running again. |
| Comment by Andreas Dilger [ 29/May/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/2c04ca53-15d2-4ba3-bd71-e9171621fd6f |
| Comment by Sergey Cheremencev [ 30/May/23 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/5b24d1b2-ab42-4509-ad28-6ba724c38368 |
| Comment by Sergey Cheremencev [ 01/Jun/23 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/1bfa66fa-90ed-431e-8b30-4c4bf4ce2782 |