[LU-17123] sanityn test_77k: dd (write) failed (2) Created: 18/Sep/23  Updated: 18/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/10fb1a42-c7cc-4565-8770-45ec78b8d958

test_77k failed with the following error:

dd (write) failed (2)

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/98520 - 4.18.0-477.15.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/98520 - 4.18.0-477.15.1.el8_lustre.x86_64

The OSS seems to lose all connextions and "ost" threads hang waiting locks:

[16115.251910] Lustre: 383999:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1694831259/real 1694831259]  req@00000000d41b4382 x1777142665919616/t0(0) o104->lustre-OST0000@10.240.25.236@tcp:15/16 lens 328/224 e 0 to 1 dl 1694831275 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
[16119.347902] Lustre: lustre-MDT0001-lwp-OST0002: Connection to lustre-MDT0001 (at 10.240.25.245@tcp) was lost; in progress operations using this service will wait for recovery to complete
[16119.350544] Lustre: Skipped 7 previous similar messages
[16119.351565] LustreError: 166-1: MGC10.240.25.238@tcp: Connection to MGS (at 10.240.25.238@tcp) was lost; in progress operations using this service will fail
[16134.323951] Lustre: lustre-OST0002: Client lustre-MDT0000-mdtlov_UUID (at 10.240.25.238@tcp) reconnecting
[16134.325447] Lustre: lustre-OST0002: Export 00000000ed0e2564 already connecting from 10.240.25.238@tcp
[16134.325556] Lustre: Skipped 21 previous similar messages
[16139.827790] Lustre: ll_ost00_053: service thread pid 383999 was inactive for 40.114 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[16139.827809] Pid: 325900, comm: ll_ost00_011 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Fri Sep 1 20:56:46 UTC 2023
[16139.830938] Lustre: Skipped 1 previous similar message
[16139.832607] Call Trace TBD:
[16139.834146] [<0>] ptlrpc_set_wait+0x5c0/0x740 [ptlrpc]
[16139.835473] [<0>] ldlm_run_ast_work+0xda/0x3f0 [ptlrpc]
[16139.836444] [<0>] ldlm_handle_conflict_lock+0x6b/0x2f0 [ptlrpc]
[16139.837520] [<0>] ldlm_lock_enqueue+0x253/0xaf0 [ptlrpc]
[16139.838499] [<0>] ldlm_handle_enqueue+0x645/0x1870 [ptlrpc]
[16139.839528] [<0>] tgt_enqueue+0xa8/0x230 [ptlrpc]
[16139.840437] [<0>] tgt_request_handle+0xd20/0x19c0 [ptlrpc]
[16139.841444] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[16139.842562] [<0>] ptlrpc_main+0xc91/0x15a0 [ptlrpc]
[16139.843468] [<0>] kthread+0x134/0x150
[16139.844168] [<0>] ret_from_fork+0x35/0x40
[16139.844896] Pid: 383999, comm: ll_ost00_053 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Fri Sep 1 20:56:46 UTC 2023
[16139.846581] Call Trace TBD:
...
[16142.259860] LNetError: 17064:0:(socklnd.c:1521:ksocknal_destroy_conn()) Completing partial receive from 12345-10.240.25.236@tcp[2], ip 10.240.25.236:1021, with error, wanted: 48240, left: 48240, last alive is 0 secs ago
[16142.263196] LustreError: 17064:0:(events.c:474:server_bulk_callback()) event type 3, status -5, desc 00000000076efbf0
[16142.265041] LustreError: 17064:0:(events.c:315:request_in_callback()) event type 2, status -5, service ost
[16142.265276] LustreError: 307914:0:(ldlm_lib.c:3593:target_bulk_io()) @@@ network error on bulk WRITE  req@0000000067cf65dd x1777142715013440/t0(0) o4->50c0edd0-63d0-45e8-a85d-05f2f1de4c75@10.240.25.236@tcp:512/0 lens 488/448 e 0 to 0 dl 1694831307 ref 1 fl Interpret:/200/0 rc 0/0 job:'dd.500' uid:500 gid:500
[16142.271134] Lustre: lustre-OST0005: Bulk IO write error with 50c0edd0-63d0-45e8-a85d-05f2f1de4c75 (at 10.240.25.236@tcp), client will retry: rc = -110
[16142.272779] LustreError: 383997:0:(pack_generic.c:631:__lustre_unpack_msg()) message length 0 too small for magic/version check
[16142.273297] Lustre: Skipped 2 previous similar messages
[16142.276201] LustreError: 383997:0:(sec.c:2245:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.240.25.236@tcp x1777142715014848
[16145.015333] Lustre: Evicted from MGS (at 10.240.25.238@tcp) after server handle changed from 0xa2d887d3e928fa8 to 0xa2d887d3e9eff12
[16145.018407] Lustre: MGC10.240.25.238@tcp: Connection restored to  (at 10.240.25.238@tcp)
[16145.019799] Lustre: Skipped 8 previous similar messages
[16150.705145] Lustre: lustre-OST0001: haven't heard from client 50c0edd0-63d0-45e8-a85d-05f2f1de4c75 (at 10.240.25.236@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp 00000000804080c8, cur 1694831310 expire 1694831280 last 1694831259
[16150.708849] LustreError: 273324:0:(client.c:1258:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@00000000c65bbbb1 x1777142665944064/t0(0) o105->lustre-OST0001@10.240.25.236@tcp:15/16 lens 392/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
[16150.712813] LustreError: 273324:0:(client.c:1258:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
[16152.179729] LustreError: 307904:0:(ldlm_lib.c:3581:target_bulk_io()) @@@ Eviction on bulk WRITE  req@0000000068165803 x1777142715013696/t0(0) o4->50c0edd0-63d0-45e8-a85d-05f2f1de4c75@10.240.25.236@tcp:538/0 lens 488/448 e 1 to 0 dl 1694831333 ref 1 fl Interpret:/200/0 rc 0/0 job:'dd.500' uid:500 gid:500

Network issue?:

LNetError: 17064:0:(socklnd.c:1521:ksocknal_destroy_conn()) Completing partial receive from 12345-10.240.25.236@tcp[2], ip 10.240.25.236:1021, with error, wanted: 48240, left: 48240, last alive is 0 secs ago

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_77k - dd (write) failed (2)


Generated at Sat Feb 10 03:32:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.