Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19685

Crash in sanity-lfsck test_6b: "LFSCK resumes from last checkpoint (2)"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Medium
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Crash Details:

      LustreError: 225351:0:(lfsck_engine.c:1607:lfsck_assistant_engine()) ASSERTION( lad->lad_post_result > 0 ) failed: 
      LustreError: 225351:0:(lfsck_engine.c:1607:lfsck_assistant_engine()) LBUG
      CPU: 0 PID: 225351 Comm: lfsck_namespace Kdump: loaded Tainted: P           OE     -------- -  - 4.18.0-553.76.1.el8_lustre.x86_64 #1
      Call Trace:
       dump_stack+0x41/0x60
       lbug_with_loc.cold.6+0x5/0x43 [libcfs]
       lfsck_assistant_engine+0x19dd/0x1c10 [lfsck]
       kthread+0x134/0x150
       ret_from_fork+0x1f/0x40
      Kernel panic - not syncing: LBUG 

      Test Case:
      sanity-lfsck test_6b: "LFSCK resumes from last checkpoint (2)"

      Sets OBD_FAIL_LFSCK_FATAL2 (0x80001609) to force LFSCK failure
      Expects LFSCK to fail gracefully and save checkpoint
      Instead triggers kernel panic
      Root Cause:
      Race condition between master LFSCK thread and assistant thread:

      Master thread encounters error (OBD_FAIL_LFSCK_FATAL2) and calls lfsck_post_generic() with result = -EINVAL
      lfsck_post_generic() sets lad->lad_post_result = -EINVAL (line 2580)
      Since result <= 0, it calls lfsck_stop_assistant() instead of setting LAD_TO_POST flag
      Assistant thread may have already passed the lfsck_should_stop() check at line 1594
      Assistant sees LAD_TO_POST flag still set from previous operation
      Assistant clears LAD_TO_POST flag (line 1604)
      Assistant hits LASSERT(lad->lad_post_result > 0) at line 1605
      CRASH: lad_post_result is -EINVAL, not > 0

      Impact:

      Kernel panic during LFSCK error handling
      Affects error injection testing and potentially real error scenarios
      LFSCK cannot fail gracefully when master thread encounters fatal errors
      Fix:
      Add lfsck_should_stop() check after clearing LAD_TO_POST flag but before the assertion. This allows the assistant thread to detect that the master has signaled a stop with an error and exit cleanly instead of asserting.

      Attachments

        Activity

          People

            paf0186 Patrick Farrell
            paf0186 Patrick Farrell
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: