Details
-
Bug
-
Resolution: Incomplete
-
Medium
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Crash Details:
LustreError: 225351:0:(lfsck_engine.c:1607:lfsck_assistant_engine()) ASSERTION( lad->lad_post_result > 0 ) failed: LustreError: 225351:0:(lfsck_engine.c:1607:lfsck_assistant_engine()) LBUG CPU: 0 PID: 225351 Comm: lfsck_namespace Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.76.1.el8_lustre.x86_64 #1 Call Trace: dump_stack+0x41/0x60 lbug_with_loc.cold.6+0x5/0x43 [libcfs] lfsck_assistant_engine+0x19dd/0x1c10 [lfsck] kthread+0x134/0x150 ret_from_fork+0x1f/0x40 Kernel panic - not syncing: LBUG
Test Case:
sanity-lfsck test_6b: "LFSCK resumes from last checkpoint (2)"
Sets OBD_FAIL_LFSCK_FATAL2 (0x80001609) to force LFSCK failure
Expects LFSCK to fail gracefully and save checkpoint
Instead triggers kernel panic
Root Cause:
Race condition between master LFSCK thread and assistant thread:
Master thread encounters error (OBD_FAIL_LFSCK_FATAL2) and calls lfsck_post_generic() with result = -EINVAL
lfsck_post_generic() sets lad->lad_post_result = -EINVAL (line 2580)
Since result <= 0, it calls lfsck_stop_assistant() instead of setting LAD_TO_POST flag
Assistant thread may have already passed the lfsck_should_stop() check at line 1594
Assistant sees LAD_TO_POST flag still set from previous operation
Assistant clears LAD_TO_POST flag (line 1604)
Assistant hits LASSERT(lad->lad_post_result > 0) at line 1605
CRASH: lad_post_result is -EINVAL, not > 0
Impact:
Kernel panic during LFSCK error handling
Affects error injection testing and potentially real error scenarios
LFSCK cannot fail gracefully when master thread encounters fatal errors
Fix:
Add lfsck_should_stop() check after clearing LAD_TO_POST flag but before the assertion. This allows the assistant thread to detect that the master has signaled a stop with an error and exit cleanly instead of asserting.