[LU-9714] Changelog consumer test reports 'Local llog found corrupted' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.11.0
Affects Version/s: None
Labels:
None

Epic/Theme:
- patch
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

[admin@snx11253n002 ~]$ lctl get_param mdd.snx11253-MDT0000.changelog_users;
mdd.snx11253-MDT0000.changelog_users=
current index: 8936329
ID index
cl2 7049068
cl31347 3348651
cl31349 3348651
cl31351 3348651
cl33628 6330946
cl33632 6335474
cl2 7049068
cl33962 7379382
cl33963 7379382
cl33964 7379382

TEST PROCEDURE:
For each rank:
(I) CONSUME_LOGS (STARTREC=<arg> or 1)
[1] Set endrec = startrec + params.chunksize
[2] Read CL in parallel with N threads with range (startrec, endrec), and extract
actual last record read (lastrec) .
a] If no new files are found ...
i] Retry until allowed (by refresh timeout, refresh retries).
Otherwise quit.
b] If new files are found ...
i] If clearing is enabled, clear logs up to lastrec.
ii] Increment record ranges
(startrec = lastrec, endrec = startrec + chunksize)
d] repeat step [2]
TEST EXECUTION:
The test was executed with the follow arguments:
$ changelog-consumer.py -v --read-clear --clusers cl2 cl2 cl2 cl2 --mdt snx11253-MDT0000 --chunksize 9999 --read-threads 8 --update-retries 6 --update-after 30
Basically this is reading 9999 Changelogs in 8 threads, then clearing 9999 Changelogs in 1 thread (implied, not show). These actions are being duplicated across 4 separate nodes, and against the "cl2" user.

The root cause is parallel processing llog with modification.
The llog has llog_handle which include

lgh_cur_idx; /* used during llog_process */
lgh_cur_offset; /* used during llog_process */

They both are modified by llog_process_thread every time we process record from llog.

Those fields are used at llog_osd_write_rec for modification/rewrite llog record. The race exist when two or more threads are processing the same llog and at least one of them do modification.
1) llog_process_thread lgh_cur_idx=10 lgh_cur_offset=2000,
2) llog_process_thread lgh_cur_idx=5 lgh_cur_offset=1000,

1) log_process_thread >lpi_cb>mdd_changelog_user_purge_cb->llog_write... llog_osd_write_rec and then

if (idx != loghandle->lgh_cur_idx) {                         
                                CERROR("%s: modify index mismatch %d %d\n",          
                                       o->do_lu.lo_dev->ld_obd->obd_name, idx,       
                                       loghandle->lgh_cur_idx);                      
                               RETURN(-EFAULT);                                     
          }

From logs

May 4 19:55:29 snx11253n002 kernel: LustreError: 29737:0:(llog_osd.c:441:llog_osd_write_rec()) snx11253-MDT0000-osd: modify index mismatch 2 34132

This was the first case.

(2) Lets imagine that second thread modify lgh_cur_idx/lgh_cur_offset right after this check

if (idx != loghandle->lgh_cur_idx)

then

lgi->lgi_off = loghandle->lgh_cur_offset;

lgi->lgi_off became 1000 instead of 2000. And the wrong modification will happen.

Attachments

Activity

People

Assignee:: Alexander Boyko

Reporter:: Alexander Boyko

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jun/17 9:50 AM

Updated:: 29/Jan/22 10:10 AM

Resolved:: 29/Jan/22 10:10 AM