[LU-11581] Not all changelog entries are returned to userspace Created: 29/Oct/18  Updated: 06/Aug/19  Resolved: 26/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Weber Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: changelog
Environment:

Lustre 2.10 based virtual cluster


Issue Links:
Related
is related to LU-11426 2/2 Olafs agree: changelog entries ar... Resolved
Epic/Theme: changelog
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

In a Lustre 2.10+ based cluster I have observed a problem where some changelog entries are not returned to userspace. Which entries are dropped is not consistent across attempts to read them.

I can reproduce this by doing the following:

  1. Register a changelog reader to enable changelog
  2. On at least two client nodes, run a file creation/deletion loop - I use a recursive copy of /usr/include to a client-specific directory
  3. Wait until the changelog has grown to a couple million entries.
  4. Stop the file creation/deletion loops, and ensure the filesystem is idle.
  5. Run lfs changelog several times on a client and redirect the output to different files.
  6. Compare the files.

What I have observed is that I got different output files from lfs changelog every single time. Changelog records that are absent in one of the output files are present in another and vice versa. At no point were all entries that should be in the on-disk log returned.

In my (admittedly CPU-starved) virtual cluster the drop rate was approximately 1 entry per 16000 records, but in a test like above having a few million on-disk records is required to consistently see the problem.

Notes:

  • I originally observed this with a changelog reader which has been instrumented to detect this kind of issue. The description above regards how it can be reproduced without relying on a proprietary tool.
  • I have not been able to reproduce this in a 2.7+ based cluster. Admittedly that one does have much more capable hardware as well.
  • To compare the output files with tools like comp you need to sort them first using 'sort -n'. This is thanks to LU-11426
  • This issue may in fact be caused by LU-11426 interacting with the new (in 2.10) mechanism to return changelog entries to userspace.


 Comments   
Comment by John Hammond [ 29/Oct/18 ]

I agree that this is likely to due LU-11426.

Comment by Peter Jones [ 29/Oct/18 ]

John

Can you please advise?

Thanks

Peter

Comment by Olaf Weber [ 06/Aug/19 ]

We have now seen this issue on systems running 2.7 based code. Out of order records do seem to play a part, but the 2.10+ mechanism for returning records to userspace does not appear to be the culprit.

Generated at Sat Feb 10 02:45:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.