[LU-9714] Changelog consumer test reports 'Local llog found corrupted' Created: 27/Jun/17  Updated: 29/Jan/22  Resolved: 29/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Epic/Theme: patch
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

[admin@snx11253n002 ~]$ lctl get_param mdd.snx11253-MDT0000.changelog_users;
mdd.snx11253-MDT0000.changelog_users=
current index: 8936329
ID index
cl2 7049068
cl31347 3348651
cl31349 3348651
cl31351 3348651
cl33628 6330946
cl33632 6335474
cl2 7049068
cl33962 7379382
cl33963 7379382
cl33964 7379382

TEST PROCEDURE:
For each rank:
(I) CONSUME_LOGS (STARTREC=<arg> or 1)
[1] Set endrec = startrec + params.chunksize
[2] Read CL in parallel with N threads with range (startrec, endrec), and extract
actual last record read (lastrec) .
a] If no new files are found ...
i] Retry until allowed (by refresh timeout, refresh retries).
Otherwise quit.
b] If new files are found ...
i] If clearing is enabled, clear logs up to lastrec.
ii] Increment record ranges
(startrec = lastrec, endrec = startrec + chunksize)
d] repeat step [2]
TEST EXECUTION:
The test was executed with the follow arguments:
$ changelog-consumer.py -v --read-clear --clusers cl2 cl2 cl2 cl2 --mdt snx11253-MDT0000 --chunksize 9999 --read-threads 8 --update-retries 6 --update-after 30
Basically this is reading 9999 Changelogs in 8 threads, then clearing 9999 Changelogs in 1 thread (implied, not show). These actions are being duplicated across 4 separate nodes, and against the "cl2" user.

The root cause is parallel processing llog with modification.
The llog has llog_handle which include

lgh_cur_idx; /* used during llog_process */
lgh_cur_offset; /* used during llog_process */

They both are modified by llog_process_thread every time we process record from llog.

Those fields are used at llog_osd_write_rec for modification/rewrite llog record. The race exist when two or more threads are processing the same llog and at least one of them do modification.
1) llog_process_thread lgh_cur_idx=10 lgh_cur_offset=2000,
2) llog_process_thread lgh_cur_idx=5 lgh_cur_offset=1000,

1) log_process_thread >lpi_cb>mdd_changelog_user_purge_cb->llog_write... llog_osd_write_rec and then

if (idx != loghandle->lgh_cur_idx) {                         
                                CERROR("%s: modify index mismatch %d %d\n",          
                                       o->do_lu.lo_dev->ld_obd->obd_name, idx,       
                                       loghandle->lgh_cur_idx);                      
                               RETURN(-EFAULT);                                     
          }  


From logs

May 4 19:55:29 snx11253n002 kernel: LustreError: 29737:0:(llog_osd.c:441:llog_osd_write_rec()) snx11253-MDT0000-osd: modify index mismatch 2 34132

This was the first case.

(2) Lets imagine that second thread modify lgh_cur_idx/lgh_cur_offset right after this check

if (idx != loghandle->lgh_cur_idx)

then

lgi->lgi_off = loghandle->lgh_cur_offset;

lgi->lgi_off became 1000 instead of 2000. And the wrong modification will happen.



 Comments   
Comment by Gerrit Updater [ 27/Jun/17 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/27838
Subject: LU-9714 llog: fix llog_process_thread race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c8ffad1a2e01b371351bec82037958fe8b282b3

Comment by Mikhail Pershin [ 12/Jul/17 ]

what kind of modification is used in that test? You've mentioned only 'clearing' of changelog, is it done with llog_cancel or you are using llog_write() to wipe these records entirely?

Comment by Alexander Boyko [ 12/Jul/17 ]

user mode test, so it was

lfs changelog_clear

Comment by Gerrit Updater [ 18/Sep/17 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29035
Subject: LU-9714 test: checking llog_process_thread race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cd7ad26a3208ea69e393a632b3496ed77d767d52

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27838/
Subject: LU-9714 llog: fix llog_process_thread race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 52b693c588555c55dd44fe3a27a1bf8c8cccac31

Comment by Peter Jones [ 22/Dec/17 ]

Alex

The functional change tracked here has landed to master. Do you still intend to land the accompanying test? If so, please could you rebase it so that it can complete testing and reviews?

Thanks

Peter

Generated at Sat Feb 10 02:28:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.