[LU-13065] runtests test 1 hangs when unmounting an OST Created: 11/Dec/19  Updated: 20/May/20  Resolved: 20/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.4
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The last lines seen in the suite_log are of successfully unmounting all MDTs and then trying to unmount the first OST

Stopping /mnt/lustre-mds4 (opts:-f) on trevis-41vm5
CMD: trevis-41vm5 umount -d -f /mnt/lustre-mds4
CMD: trevis-41vm5 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: trevis-41vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts || true
Stopping /mnt/lustre-ost1 (opts:-f) on trevis-41vm3
CMD: trevis-41vm3 umount -d -f /mnt/lustre-ost1

Looking at the OSS (vm3) console log, we see some errors

[ 2211.277406] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
[ 2211.641558] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[ 2217.460836] Lustre: 19898:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1575564042/real 1575564042]  req@ffff9d55fbd71200 x1652096492205632/t0(0) o400->lustre-MDT0003-lwp-OST0000@10.9.5.241@tcp:12/10 lens 224/224 e 0 to 1 dl 1575564049 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 2217.465612] Lustre: lustre-MDT0003-lwp-OST0000: Connection to lustre-MDT0003 (at 10.9.5.241@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 2217.468256] Lustre: Skipped 7 previous similar messages
[ 2222.468838] Lustre: 19897:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1575564047/real 1575564047]  req@ffff9d55dc9a8d80 x1652096492207552/t0(0) o400->lustre-MDT0003-lwp-OST0001@10.9.5.241@tcp:12/10 lens 224/224 e 0 to 1 dl 1575564054 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 2222.473635] Lustre: 19897:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[ 2231.827828] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2231.829959] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2231.833886] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2251.835838] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2251.838023] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2251.841978] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2271.843826] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2271.845969] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2271.849853] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2291.851824] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2291.853972] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2291.857887] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2310.528031] LustreError: 16010:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-108
[ 2310.530727] LustreError: 16010:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message
[ 2311.859831] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2311.862004] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2311.865920] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2331.867835] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2331.869992] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2331.873865] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2351.875834] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2351.877966] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2351.881870] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2391.883829] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2391.885979] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) Skipped 1 previous similar message
[ 2391.887591] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2391.891701] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) Skipped 1 previous similar message
[ 2391.893309] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2391.895855] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) Skipped 1 previous similar message
[ 2430.528010] LustreError: 16018:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-108
[ 2430.528013] LustreError: 16017:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-108
[ 2471.896826] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: rc = -110 waiting for callback (1 != 0)
[ 2471.898998] LustreError: 16002:0:(import.c:361:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
[ 2471.900645] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9d55f7eeed00 x1652096491324800/t0(0) o38->lustre-MDT0001-lwp-OST0000@10.9.5.241@tcp:12/10 lens 520/544 e 0 to 1 dl 1575561999 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
[ 2471.904667] LustreError: 16002:0:(import.c:387:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
[ 2471.906340] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
[ 2471.908985] LustreError: 16002:0:(import.c:401:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
[ 2550.528021] LustreError: 16027:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-108

In the past four months, we’ve seen runtests test_1 hangs on OST unmount three times
2.12.4 - https://testing.whamcloud.com/test_sets/e7093746-18bf-11ea-9487-52540065bddc

The following two hangs do not have the “enqueue global quota lock” error messages in the OSS console log
2.12.57.50 - https://testing.whamcloud.com/test_sets/b9d44252-cf7c-11e9-9fc9-52540065bddc
2.12.56.88 - https://testing.whamcloud.com/test_sets/da06134a-c37c-11e9-90ad-52540065bddc



 Comments   
Comment by Gerrit Updater [ 12/Dec/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37004
Subject: LU-13065 tests: Reproducer
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3c0e574be19816599a09b87a856cd6f103eee0b9

Comment by Gerrit Updater [ 13/Dec/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37016
Subject: LU-13065 tests: clean up runtests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f69763fcb0e8222c222dc9c29c6e6755630dabb3

Comment by Gerrit Updater [ 20/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37016/
Subject: LU-13065 tests: clean up runtests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73e14f5d1015e2b8955f0e9f4b09323f61aa594f

Comment by Peter Jones [ 20/May/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:58:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.