Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17123

sanityn test_77k: dd (write) failed (2)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for eaujames <eaujames@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/10fb1a42-c7cc-4565-8770-45ec78b8d958

      test_77k failed with the following error:

      dd (write) failed (2)
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/98520 - 4.18.0-477.15.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/98520 - 4.18.0-477.15.1.el8_lustre.x86_64

      The OSS seems to lose all connextions and "ost" threads hang waiting locks:

      [16115.251910] Lustre: 383999:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1694831259/real 1694831259]  req@00000000d41b4382 x1777142665919616/t0(0) o104->lustre-OST0000@10.240.25.236@tcp:15/16 lens 328/224 e 0 to 1 dl 1694831275 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
      [16119.347902] Lustre: lustre-MDT0001-lwp-OST0002: Connection to lustre-MDT0001 (at 10.240.25.245@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [16119.350544] Lustre: Skipped 7 previous similar messages
      [16119.351565] LustreError: 166-1: MGC10.240.25.238@tcp: Connection to MGS (at 10.240.25.238@tcp) was lost; in progress operations using this service will fail
      [16134.323951] Lustre: lustre-OST0002: Client lustre-MDT0000-mdtlov_UUID (at 10.240.25.238@tcp) reconnecting
      [16134.325447] Lustre: lustre-OST0002: Export 00000000ed0e2564 already connecting from 10.240.25.238@tcp
      [16134.325556] Lustre: Skipped 21 previous similar messages
      [16139.827790] Lustre: ll_ost00_053: service thread pid 383999 was inactive for 40.114 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [16139.827809] Pid: 325900, comm: ll_ost00_011 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Fri Sep 1 20:56:46 UTC 2023
      [16139.830938] Lustre: Skipped 1 previous similar message
      [16139.832607] Call Trace TBD:
      [16139.834146] [<0>] ptlrpc_set_wait+0x5c0/0x740 [ptlrpc]
      [16139.835473] [<0>] ldlm_run_ast_work+0xda/0x3f0 [ptlrpc]
      [16139.836444] [<0>] ldlm_handle_conflict_lock+0x6b/0x2f0 [ptlrpc]
      [16139.837520] [<0>] ldlm_lock_enqueue+0x253/0xaf0 [ptlrpc]
      [16139.838499] [<0>] ldlm_handle_enqueue+0x645/0x1870 [ptlrpc]
      [16139.839528] [<0>] tgt_enqueue+0xa8/0x230 [ptlrpc]
      [16139.840437] [<0>] tgt_request_handle+0xd20/0x19c0 [ptlrpc]
      [16139.841444] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
      [16139.842562] [<0>] ptlrpc_main+0xc91/0x15a0 [ptlrpc]
      [16139.843468] [<0>] kthread+0x134/0x150
      [16139.844168] [<0>] ret_from_fork+0x35/0x40
      [16139.844896] Pid: 383999, comm: ll_ost00_053 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Fri Sep 1 20:56:46 UTC 2023
      [16139.846581] Call Trace TBD:
      ...
      [16142.259860] LNetError: 17064:0:(socklnd.c:1521:ksocknal_destroy_conn()) Completing partial receive from 12345-10.240.25.236@tcp[2], ip 10.240.25.236:1021, with error, wanted: 48240, left: 48240, last alive is 0 secs ago
      [16142.263196] LustreError: 17064:0:(events.c:474:server_bulk_callback()) event type 3, status -5, desc 00000000076efbf0
      [16142.265041] LustreError: 17064:0:(events.c:315:request_in_callback()) event type 2, status -5, service ost
      [16142.265276] LustreError: 307914:0:(ldlm_lib.c:3593:target_bulk_io()) @@@ network error on bulk WRITE  req@0000000067cf65dd x1777142715013440/t0(0) o4->50c0edd0-63d0-45e8-a85d-05f2f1de4c75@10.240.25.236@tcp:512/0 lens 488/448 e 0 to 0 dl 1694831307 ref 1 fl Interpret:/200/0 rc 0/0 job:'dd.500' uid:500 gid:500
      [16142.271134] Lustre: lustre-OST0005: Bulk IO write error with 50c0edd0-63d0-45e8-a85d-05f2f1de4c75 (at 10.240.25.236@tcp), client will retry: rc = -110
      [16142.272779] LustreError: 383997:0:(pack_generic.c:631:__lustre_unpack_msg()) message length 0 too small for magic/version check
      [16142.273297] Lustre: Skipped 2 previous similar messages
      [16142.276201] LustreError: 383997:0:(sec.c:2245:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.240.25.236@tcp x1777142715014848
      [16145.015333] Lustre: Evicted from MGS (at 10.240.25.238@tcp) after server handle changed from 0xa2d887d3e928fa8 to 0xa2d887d3e9eff12
      [16145.018407] Lustre: MGC10.240.25.238@tcp: Connection restored to  (at 10.240.25.238@tcp)
      [16145.019799] Lustre: Skipped 8 previous similar messages
      [16150.705145] Lustre: lustre-OST0001: haven't heard from client 50c0edd0-63d0-45e8-a85d-05f2f1de4c75 (at 10.240.25.236@tcp) in 51 seconds. I think it's dead, and I am evicting it. exp 00000000804080c8, cur 1694831310 expire 1694831280 last 1694831259
      [16150.708849] LustreError: 273324:0:(client.c:1258:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@00000000c65bbbb1 x1777142665944064/t0(0) o105->lustre-OST0001@10.240.25.236@tcp:15/16 lens 392/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
      [16150.712813] LustreError: 273324:0:(client.c:1258:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
      [16152.179729] LustreError: 307904:0:(ldlm_lib.c:3581:target_bulk_io()) @@@ Eviction on bulk WRITE  req@0000000068165803 x1777142715013696/t0(0) o4->50c0edd0-63d0-45e8-a85d-05f2f1de4c75@10.240.25.236@tcp:538/0 lens 488/448 e 1 to 0 dl 1694831333 ref 1 fl Interpret:/200/0 rc 0/0 job:'dd.500' uid:500 gid:500
      

      Network issue?:

      LNetError: 17064:0:(socklnd.c:1521:ksocknal_destroy_conn()) Completing partial receive from 12345-10.240.25.236@tcp[2], ip 10.240.25.236:1021, with error, wanted: 48240, left: 48240, last alive is 0 secs ago
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanityn test_77k - dd (write) failed (2)

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: