Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19340

tgt_checksum_niobuf_t10pi causes direct reclaim and MDT hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Medium
    • Lustre 2.17.0
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      [ 5411.609485] Lustre: mdt_io04_043: service thread pid 50444 was inactive for 202.321 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 5411.609485] Lustre: mdt_io04_019: service thread pid 50259 was inactive for 202.322 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5411.609487] Pid: 50411, comm: mdt_io06_037 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 SMP Thu Jun 26 09:41:51 UTC 2025
      [ 5411.609490] Lustre: Skipped 1 previous similar message
      [ 5411.613409] Lustre: Skipped 2 previous similar messages
      [ 5411.616096] Call Trace TBD:
      [ 5411.616112] [<0>] lu_cache_shrink_count+0x22/0x130 [obdclass]
      [ 5411.621324] [<0>] do_shrink_slab+0x3a/0x330
      [ 5411.622082] [<0>] shrink_slab+0xbe/0x2f0
      [ 5411.622824] [<0>] shrink_node+0x257/0x710
      [ 5411.623539] [<0>] do_try_to_free_pages+0xd8/0x4c0
      [ 5411.624336] [<0>] try_to_free_pages+0xf3/0x1d0
      [ 5411.625096] [<0>] __alloc_pages_slowpath+0x3e2/0xcd0
      [ 5411.625927] [<0>] __alloc_pages_nodemask+0x2e2/0x330
      [ 5411.626754] [<0>] tgt_checksum_niobuf_t10pi+0x7c/0xc80 [ptlrpc]
      [ 5411.627763] [<0>] tgt_checksum_niobuf_rw+0xae/0x7e0 [ptlrpc]
      [ 5411.628735] [<0>] tgt_brw_write+0x1456/0x17c0 [ptlrpc]
      [ 5411.629628] [<0>] tgt_request_handle+0xc9c/0x1970 [ptlrpc]
      [ 5411.630567] [<0>] ptlrpc_server_handle_request+0x346/0xc70 [ptlrpc]
      [ 5411.631610] [<0>] ptlrpc_main+0xb45/0x13a0 [ptlrpc]
      [ 5411.632476] [<0>] kthread+0x134/0x150
      [ 5411.633130] [<0>] ret_from_fork+0x1f/0x40
      [ 5411.633844] Pid: 50528, comm: mdt_io00_061 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 SMP Thu Jun 26 09:41:51 UTC 2025
      [ 5411.635544] Call Trace TBD:
      [ 5411.636193] [<0>] lu_cache_shrink_count+0x22/0x130 [obdclass]
      [ 5411.637202] [<0>] do_shrink_slab+0x3a/0x330
      [ 5411.637950] [<0>] shrink_slab+0xbe/0x2f0
      [ 5411.638650] [<0>] shrink_node+0x257/0x710
      [ 5411.639350] [<0>] do_try_to_free_pages+0xd8/0x4c0
      [ 5411.640122] [<0>] try_to_free_pages+0xf3/0x1d0
      [ 5411.640855] [<0>] __alloc_pages_slowpath+0x3e2/0xcd0
      [ 5411.641648] [<0>] __alloc_pages_nodemask+0x2e2/0x330
      [ 5411.642423] [<0>] tgt_checksum_niobuf_t10pi+0x7c/0xc80 [ptlrpc]
      [ 5411.643374] [<0>] tgt_checksum_niobuf_rw+0xae/0x7e0 [ptlrpc]
      [ 5411.644283] [<0>] tgt_brw_write+0x1456/0x17c0 [ptlrpc]
      [ 5411.645123] [<0>] tgt_request_handle+0xc9c/0x1970 [ptlrpc]
      [ 5411.646007] [<0>] ptlrpc_server_handle_request+0x346/0xc70 [ptlrpc]
      [ 5411.646990] [<0>] ptlrpc_main+0xb45/0x13a0 [ptlrpc]
      [ 5411.647798] [<0>] kthread+0x134/0x150
      [ 5411.648410] [<0>] ret_from_fork+0x1f/0x40
      [ 5411.649075] Pid: 50444, comm: mdt_io04_043 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 SMP Thu Jun 26 09:41:51 UTC 2025
      [ 5411.650813] Call Trace TBD:
      [ 5411.651402] [<0>] lu_cache_shrink_count+0x22/0x130 [obdclass]
      [ 5411.652384] [<0>] do_shrink_slab+0x3a/0x330
      [ 5411.653126] [<0>] shrink_slab+0xbe/0x2f0
      [ 5411.653835] [<0>] shrink_node+0x257/0x710
      [ 5411.654547] [<0>] do_try_to_free_pages+0xd8/0x4c0
      [ 5411.655327] [<0>] try_to_free_pages+0xf3/0x1d0
      [ 5411.656071] [<0>] __alloc_pages_slowpath+0x3e2/0xcd0
      [ 5411.656885] [<0>] __alloc_pages_nodemask+0x2e2/0x330
      [ 5411.657704] [<0>] tgt_checksum_niobuf_t10pi+0x7c/0xc80 [ptlrpc]
      [ 5411.658678] [<0>] tgt_checksum_niobuf_rw+0xae/0x7e0 [ptlrpc]
      [ 5411.659616] [<0>] tgt_brw_write+0x1456/0x17c0 [ptlrpc]
      [ 5411.660485] [<0>] tgt_request_handle+0xc9c/0x1970 [ptlrpc]
      [ 5411.661383] [<0>] ptlrpc_server_handle_request+0x346/0xc70 [ptlrpc]
      [ 5411.662406] [<0>] ptlrpc_main+0xb45/0x13a0 [ptlrpc]
      [ 5411.663254] [<0>] kthread+0x134/0x150
      [ 5411.663874] [<0>] ret_from_fork+0x1f/0x40
      [ 5423.897035] Lustre: mdt_io06_032: service thread pid 50400 was inactive for 202.377 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5423.900155] Lustre: Skipped 134 previous similar messages
      [ 5432.088727] Lustre: mdt_io01_016: service thread pid 50248 was inactive for 200.313 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5432.091220] Lustre: Skipped 12 previous similar messages
      [ 5436.184590] Lustre: mdt_io00_056: service thread pid 50519 was inactive for 200.031 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5436.184590] Lustre: mdt_io00_022: service thread pid 50278 was inactive for 200.279 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5436.184595] Lustre: Skipped 3 previous similar messages
      [ 5436.187269] Lustre: Skipped 28 previous similar messages
      [ 5440.280443] Lustre: mdt_io04_023: service thread pid 50302 was inactive for 203.455 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5440.283892] Lustre: Skipped 55 previous similar messages
      [ 5452.568011] Lustre: mdt_io04_022: service thread pid 50296 was inactive for 201.636 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 5452.570546] Lustre: Skipped 220 previous similar messages
      [ 5469.991458] Lustre: mdt_io01_031: service thread pid 50330 completed after 233.335s. This likely indicates the system was overloaded (too many service threads, or not enough hardware resources).
      [ 5470.263510] Lustre: mdt_io02_010: service thread pid 50181 completed after 262.217s. This likely indicates the system was overloaded (too many service threads, or not enough hardware resources).
      [ 5470.832067] Lustre: mdt_io02_016: service thread pid 50260 completed after 234.679s. This likely indicates the system was overloaded (too many service threads, or not enough hardware resources).
        

      Attachments

        Activity

          People

            skoyama Sohei Koyama
            skoyama Sohei Koyama
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: