Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
Llog allows parallel processing records, during processing record could be canceled. For a changelog two threads could do processing and canceling records. And race could happen, when both processing the same record. So first will cancel it, and second will get ENOENT. Since this is a valid error, Lustre should hide it from a caller.
The next log show exact race, two threads (28074 and 11741) cancels record in the same time they processed 35285 record. So one thread canceled it and another got -2 (ENOENT).
00000004:00000001:5.0:1614693066.498334:0:28074:0:(mdd_device.c:312:llog_changelog_cancel_cb()) Process entered 00000040:00100000:5.0:1614693066.498336:0:28074:0:(llog.c:220:llog_cancel_arr_rec()) Canceling 1 records, first 35284 in log [0x645e:0x1:0x0] 00000040:00001000:5.0:1614693066.498359:0:28074:0:(llog_osd.c:401:llog_osd_write_rec()) new record 10645539 to [0x1:0x645e:0x0] 00000004:00000001:5.0:1614693066.498365:0:28074:0:(mdd_device.c:348:llog_changelog_cancel_cb()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:5.0:1614693066.498368:0:28074:0:(mdd_device.c:312:llog_changelog_cancel_cb()) Process entered 00000040:00100000:5.0:1614693066.498369:0:28074:0:(llog.c:220:llog_cancel_arr_rec()) Canceling 1 records, first 35285 in log [0x645e:0x1:0x0] 00000004:00000001:3.0:1614693066.498383:0:11741:0:(mdd_device.c:312:llog_changelog_cancel_cb()) Process entered 00000040:00100000:3.0:1614693066.498385:0:11741:0:(llog.c:220:llog_cancel_arr_rec()) Canceling 1 records, first 35285 in log [0x645e:0x1:0x0] 00000040:00001000:5.0:1614693066.498393:0:28074:0:(llog_osd.c:401:llog_osd_write_rec()) new record 10645539 to [0x1:0x645e:0x0] 00000004:00000001:5.0:1614693066.498398:0:28074:0:(mdd_device.c:348:llog_changelog_cancel_cb()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:5.0:1614693066.498401:0:28074:0:(mdd_device.c:312:llog_changelog_cancel_cb()) Process entered 00000040:00100000:5.0:1614693066.498403:0:28074:0:(llog.c:220:llog_cancel_arr_rec()) Canceling 1 records, first 35286 in log [0x645e:0x1:0x0] 00000004:00000001:3.0:1614693066.498422:0:11741:0:(mdd_device.c:348:llog_changelog_cancel_cb()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00080000:3.0:1614693066.498423:0:11741:0:(llog.c:699:llog_process_thread()) stop processing plain 0x645e:1:0 index 35285 count 28959 00000040:00001000:5.0:1614693066.498433:0:28074:0:(llog_osd.c:401:llog_osd_write_rec()) new record 10645539 to [0x1:0x645e:0x0]
Attachments
Issue Links
- is related to
-
LU-14705 ASSERTION( llog_osd_exist(loghandle) ) failed: with concurent "lfs changelog_clear"
- Open