Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14932

runtests: test_1 llog_cat_cleanup()) ASSERTION( index ) on MDS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1ff8d9a5-c3da-4835-8739-9f790d3c2491

      test_1 crashed on the MDS with the following error:

      onyx-44vm9 crashed during runtests test_1
      
      LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) ASSERTION( index ) failed: 
      LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) LBUG
      Pid: 138526, comm: lod0001_rec0000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Fri Jul 30 19:47:15 UTC 2021
      header
      Call Trace TBD:
      libcfs_call_trace+0x6f/0x90 [libcfs]
      lbug_with_loc+0x43/0x80 [libcfs]
      llog_cat_cleanup+0x391/0x3d0 [obdclass]
      llog_cat_close+0x193/0x210 [obdclass]
      lod_sub_recovery_th6+0x1e3/0xb40 [lod]
      kthread+0x112/0x130
      
      LustreError: 143361:0:(llog.c:1149:llog_write_rec()) lustre-MDT0000-osp-MDT0001: loghandle 0000000062d00541 with no 
      LustreError: 143361:0:(llog_cat.c:602:llog_cat_add_rec()) llog_write_rec -71: lh=0000000062d00541
      LustreError: 143361:0:(update_trans.c:1062:top_trans_stop()) lustre-MDT0000-osp-MDT0001: write updates failed: rc = -71
      

      A second test had a similar MDS crash with a slightly different stack:
      https://testing.whamcloud.com/test_sets/366c2ba7-795e-4856-b4c4-9f2cce973618

      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 139728 Comm: mdt00_002  4.18.0-240.22.1.el8_lustre.x86_64 #1
      RIP: 0010:__list_add_valid+0x10/0x50
      Call Trace:
       llog_cat_prep_log+0x311/0x3c0 [obdclass]
       llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
       llog_declare_add+0x187/0x1d0 [obdclass]
       top_trans_start+0x212/0x940 [ptlrpc]
       mdd_unlink+0x4a0/0xb30 [mdd]
       mdt_reint_unlink+0xb0c/0x12a0 [mdt]
       mdt_reint_rec+0x11f/0x250 [mdt]
       mdt_reint_internal+0x498/0x780 [mdt]
       mdt_reint+0x5e/0x100 [mdt]
       tgt_request_handle+0xc90/0x1940 [ptlrpc]
       ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
       ptlrpc_main+0xba2/0x1490 [ptlrpc]
      

      A third test crashed the MDS with a different operation, but also in llog list handling:
      https://testing.whamcloud.com/test_sets/b7099363-3b2c-4b7a-ad54-795ca4541ddc

      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 138567 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1
      RIP: 0010:__list_add_valid+0x10/0x50
      Call Trace:
       llog_cat_prep_log+0x311/0x3c0 [obdclass]
       llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
       llog_declare_add+0x187/0x1d0 [obdclass]
       top_trans_start+0x212/0x940 [ptlrpc]
       mdd_create+0xb42/0x1870 [mdd]
       mdt_create+0x7a7/0xc20 [mdt]
       mdt_reint_create+0x30b/0x3c0 [mdt]
       mdt_reint_rec+0x11f/0x250 [mdt]
       mdt_reint_internal+0x498/0x780 [mdt]
       mdt_reint+0x5e/0x100 [mdt]
       tgt_request_handle+0xc90/0x1940 [ptlrpc]
       ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
       ptlrpc_main+0xba2/0x1490 [ptlrpc]
      

      Searching back through the Maloo crashes of runtests to the start of the year, it appears this started failing with this ASSERTION on 2021-07-31 (though there are other, unlrelated crashes in runtests due to bugs in under-development patches).

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: