Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1bf6a37a-7821-11e9-a028-52540065bddc
CMD: trevis-38vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 trevis-38vm8: == rpc test complete, duration -o sec ================================================================ 19:32:45 (1558035165) trevis-38vm8: trevis-38vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' CMD: trevis-38vm8 e2label /dev/mapper/ost8_flakey 2>/dev/null Started lustre-OST0007 CMD: trevis-38vm9 /usr/sbin/lctl list_nids | grep tcp999 Starting client: trevis-38vm6.trevis.whamcloud.com: -o user_xattr,flock,network=tcp999 10.9.3.145@tcp999:/lustre /mnt/lustre CMD: trevis-38vm6.trevis.whamcloud.com mkdir -p /mnt/lustre CMD: trevis-38vm6.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,network=tcp999 10.9.3.145@tcp999:/lustre /mnt/lustre mount.lustre: mount 10.9.3.145@tcp999:/lustre at /mnt/lustre failed: Invalid argument This may have multiple causes. Is 'lustre' the correct filesystem name? Are the mount options correct? Check the syslog for more info. unconfigure: - lnet: errno: -16 descr: "LNet unconfigure error: Device or resource busy"
[17996.736209] Lustre: DEBUG MARKER: == sanity-sec test 31: client mount option '-o network' ============================================== 19:30:04 (1558035004) [17997.693592] Lustre: DEBUG MARKER: lctl get_param -n *.lustre*.exports.'10.9.5.215@tcp'.uuid 2>/dev/null | grep -q - [17998.217952] Lustre: DEBUG MARKER: /usr/sbin/lnetctl lnet configure && /usr/sbin/lnetctl net add --if eth0 --net tcp999 [17998.557153] LNet: Added LNI 10.9.3.146@tcp999 [8/256/0/180] [18000.237970] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_statfs to node 10.9.3.145@tcp failed: rc = -107 [18000.239925] LustreError: Skipped 9 previous similar messages [18000.240888] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.9.3.145@tcp) was lost; in progress operations using this service will wait for recovery to complete [18000.243842] Lustre: Skipped 18 previous similar messages [18007.616779] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [18007.922846] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds2 [18009.125483] Lustre: lustre-MDT0001: Not available for connect from 10.9.3.145@tcp (stopping) [18009.127096] Lustre: Skipped 42 previous similar messages [18011.260546] LustreError: 17495:0:(client.c:1183:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff98fcd0168d80 x1633699486628912/t0(0) o41->lustre-MDT0003-osp-MDT0001@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [18011.264135] LustreError: 17495:0:(client.c:1183:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [18015.357668] Lustre: server umount lustre-MDT0001 complete [18015.358716] Lustre: Skipped 1 previous similar message [18016.092394] LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_24v - Timeout occurred after 343 mins, last suite running was sanity-sec, restarting cluster to continue tests
Attachments
Issue Links
- is blocking
-
LU-9667 LNet Kernel/Userspace Interface
-
- Open
-
- is related to
-
LU-12688 sanity-sec test 31 fails with 'unable to configure NID o2ib999'
-
- Resolved
-
-
LU-13028 LNet Discovery: toggling discovery on/off is not handled properly
-
- Resolved
-
- is related to
-
LU-15675 Interop sanity-sec test_27a: fileset not taken into account
-
- Resolved
-
I've been working on debugging this issue. I'd like some time to narrow down the reason for the regression before reverting the patch. At least I'd like to have clarity why this behaviour breaks the network option.