[LU-10242] parallel-scale no sub tests failed: test failed to respond and timed out Created: 14/Nov/17  Updated: 08/Dec/17  Resolved: 08/Dec/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

onyx, full
servers: el7.4, zfs, branch master, v2.10.55, b3667
clients: el7.4, branch master, v2.10.55, b3667


Issue Links:
Duplicate
duplicates LU-10045 sanity-lfsck no sub tests failed Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/b2776b74-2819-4ec6-ae6c-a178cd39927a
test set: https://testing.hpdd.intel.com/test_sets/9cc1000e-c58a-11e7-a066-52540065bddc

From suite_log:

Stopping /mnt/lustre-ost1 (opts:-f) on onyx-40vm12
CMD: onyx-40vm12 umount -d -f /mnt/lustre-ost1  (end of log)


 Comments   
Comment by James Nunez (Inactive) [ 20/Nov/17 ]

If we look at the console log for the OSS (vm12) for the failure at https://testing.hpdd.intel.com/test_sets/9cc1000e-c58a-11e7-a066-52540065bddc, we see

[34864.377659] Lustre: DEBUG MARKER: == parallel-scale test complete, duration 1835 sec =================================================== 15:12:43 (1510240363)
[34878.315981] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.2.8.127@tcp failed: rc = -107
[34878.316013] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.2.8.127@tcp) was lost; in progress operations using this service will wait for recovery to complete
[34878.316028] Lustre: Skipped 6 previous similar messages
[34878.319455] LustreError: Skipped 6 previous similar messages
[34902.824881] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts
[34904.206537] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[34905.315050] Lustre: 11761:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510240398/real 1510240398]  req@ffff8800461c5800 x1583565439895808/t0(0) o400->MGC10.2.8.127@tcp@10.2.8.127@tcp:26/25 lens 224/224 e 0 to 1 dl 1510240405 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[34905.317780] Lustre: 11761:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[34905.318748] LustreError: 166-1: MGC10.2.8.127@tcp: Connection to MGS (at 10.2.8.127@tcp) was lost; in progress operations using this service will fail
[34926.320052] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510240415/real 1510240415]  req@ffff8800461c6a00 x1583565439896080/t0(0) o250->MGC10.2.8.127@tcp@10.2.8.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1510240426 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[34926.322773] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[34971.320054] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510240450/real 1510240450]  req@ffff8800461c4300 x1583565439896208/t0(0) o250->MGC10.2.8.127@tcp@10.2.8.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1510240471 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[34971.322745] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
[34984.556626] LustreError: 12355:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800461c6100 x1583565439896336/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[34984.559025] LustreError: 12355:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 10 previous similar messages
[34984.560138] LustreError: 12355:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
[35044.585482] LustreError: 12360:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800461c6700 x1583565439896592/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[35044.587888] LustreError: 12360:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[35044.588910] LustreError: 12360:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
[35044.590441] LustreError: 12360:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message
[35060.320042] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510240535/real 1510240535]  req@ffff8800461c5200 x1583565439896576/t0(0) o250->MGC10.2.8.127@tcp@10.2.8.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1510240560 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[35060.322774] Lustre: 11759:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
[35104.615868] LustreError: 12365:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88005fd81e00 x1583565439896848/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[35104.618097] LustreError: 12365:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[35104.619145] LustreError: 12365:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
[35104.620594] LustreError: 12365:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message
[35164.647730] LustreError: 12371:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88005fd81e00 x1583565439897104/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[35164.650020] LustreError: 12371:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[35164.651038] LustreError: 12371:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
Comment by nasf (Inactive) [ 08/Dec/17 ]

Another failure instance of LU-10045

Generated at Sat Feb 10 02:33:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.