Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15737

recovery-small: ll_ost00 - service thread hangs.

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Cliff White <cwhite@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4784602c-4d77-42e2-919b-a194a0137d91

      Test fails due to client timing out waiting on FULL state.
      Appears to be due to thread hanging on one node:

      [ 1927.612591] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 26a: evict dead exports =========== 09:10:53 \(1649063453\)
      [ 1928.007926] Lustre: DEBUG MARKER: == recovery-small test 26a: evict dead exports =========== 09:10:53 (1649063453)
      [ 1974.360682] Lustre: ll_ost00_004: service thread pid 11008 was inactive for 43.187 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 1974.364718] Pid: 11008, comm: ll_ost00_004 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Sun Apr 3 16:16:31 UTC 2022
      [ 1974.366773] Call Trace TBD:
      [ 1974.367500] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
      [ 1974.368739] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
      [ 1974.369924] [<0>] ofd_destroy_by_fid+0x235/0x4a0 [ofd]
      [ 1974.370992] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
      [ 1974.372045] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
      [ 1974.373224] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
      [ 1974.374523] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
      [ 1974.375548] [<0>] kthread+0x116/0x130
      [ 1974.376336] [<0>] ret_from_fork+0x35/0x40
      [ 1974.664781] Lustre: lustre-OST0005: haven't heard from client 0b624cdc-fcec-4fec-b859-486e2bb9b84b (at 10.240.40.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp 000000003f573f19, cur 1649063501 expire 1649063471 last 1649063454
      [ 2020.832883] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_26a: @@@@@@ FAIL: lustre-OST0000-osc-ffff8f8645de7800 state is not FULL 
      [ 2021.200277] Lustre: DEBUG MARKER: recovery-small test_26a: @@@@@@ FAIL: lustre-OST0000-osc-ffff8f8645de7800 state is not FULL
      

      Attachments

        Issue Links

          Activity

            [LU-15737] recovery-small: ll_ost00 - service thread hangs.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-18392 [ LU-18392 ]

            The question was about - sending/not sending blocking ast. Never mind, enqueue with LDLM_FL_BLOCK_NOWAIT will send blocking ast.

            aboyko Alexander Boyko added a comment - The question was about - sending/not sending blocking ast. Never mind, enqueue with LDLM_FL_BLOCK_NOWAIT will send blocking ast.

            not sure why did you mention LDLM_FL_SPECULATIVE, before the patch it was LDLM_FL_AST_DISCARD_DATA

            bzzz Alex Zhuravlev added a comment - not sure why did you mention LDLM_FL_SPECULATIVE, before the patch it was LDLM_FL_AST_DISCARD_DATA

            I think enqueue with LDLM_FL_SPECULATIVE does not send blocking ast, not a LDLM_FL_BLOCK_NOWAIT.

            aboyko Alexander Boyko added a comment - I think enqueue with LDLM_FL_SPECULATIVE does not send blocking ast, not a LDLM_FL_BLOCK_NOWAIT.

            hmm, will ldlm send blocking ast to client if LDLM_FL_BLOCK_NOWAIT is specified at enqueue? if not, then unlink-close will keep orphaned data in client's cache?

            bzzz Alex Zhuravlev added a comment - hmm, will ldlm send blocking ast to client if LDLM_FL_BLOCK_NOWAIT is specified at enqueue? if not, then unlink-close will keep orphaned data in client's cache?
            bobijam Zhenyu Xu made changes -
            Link New: This issue is related to DDN-5430 [ DDN-5430 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Assignee Original: WC Triage [ wc-triage ] New: Alexander Boyko [ aboyko ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55598/
            Subject: LU-15737 ofd: don't block destroys
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 27f787daa7f25f1f14f8e041582ef969f87cd77a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55598/ Subject: LU-15737 ofd: don't block destroys Project: fs/lustre-release Branch: master Current Patch Set: Commit: 27f787daa7f25f1f14f8e041582ef969f87cd77a

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55598
            Subject: LU-15737 ofd: don't block destroys
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f1dd736c3e4b0828aa7f932dbcba83284cf07472

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55598 Subject: LU-15737 ofd: don't block destroys Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f1dd736c3e4b0828aa7f932dbcba83284cf07472

            People

              aboyko Alexander Boyko
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: