[LU-14474] Oops in llog_cat_prep_log() in sanity-quota / recovery-small Created: 27/Feb/21  Updated: 15/Dec/22  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-14964 recovery-small: GPF in llog_exist aft... Resolved
is related to LU-14932 runtests: test_1 llog_cat_cleanup()) ... Resolved
is related to LU-14995 recovery-small: FAIL: remove sub-test... Resolved
is related to LU-16398 ost-pools: FAIL: remove sub-test dirs... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.whamcloud.com/test_sets/fd169bf5-481b-4906-8891-b5fc6738d14d

[28530.564627] general protection fault: 0000 [#1] SMP PTI
[28530.565704] CPU: 0 PID: 1366058 Comm: mdt00_002 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
[28530.567756] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[28530.568756] RIP: 0010:up_write+0x5/0x30
[28530.569406] Code: f8 02 74 01 c3 48 8b 47 20 a8 04 74 06 f0 48 83 67 20 fb e9 3d ff ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> c7 47 20 00 00 00 00 48 c7 c0 ff ff ff ff f0 48 0f c1 07 a8 02
[28530.572467] RSP: 0018:ffffab6000b93a38 EFLAGS: 00010212
[28530.573333] RAX: 00000000ffffffea RBX: ffff97eb10f06a00 RCX: 000000000293cb3b
[28530.574512] RDX: 000000000293cb3a RSI: 0000000000000002 RDI: 5f32303030646f6c
[28530.575677] RBP: ffff97eb10f06ac8 R08: 000000000002f0c0 R09: ffffffffc1a13f21
[28530.576846] R10: 00000000ffffffea R11: 0000000000000001 R12: 0000000000000001
[28530.578021] R13: ffff97eb0f0ee140 R14: ffff97eb2d2f7cb8 R15: 00000000ffffffea
[28530.579186] FS:  0000000000000000(0000) GS:ffff97eb3fc00000(0000) knlGS:0000000000000000
[28530.580510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28530.581457] CR2: 00007fcd6fb50000 CR3: 0000000042a0a005 CR4: 00000000000606f0
[28530.582640] Call Trace:
[28530.583369]  llog_cat_prep_log+0xb0/0x3c0 [obdclass]
[28530.584256]  llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
[28530.585211]  llog_declare_add+0x187/0x1d0 [obdclass]
[28530.586470]  top_trans_start+0x212/0x940 [ptlrpc]
[28530.587364]  mdd_create+0xba4/0x1850 [mdd]
[28530.588225]  mdt_create+0x7b1/0xc50 [mdt]
[28530.588936]  mdt_reint_create+0x30b/0x3c0 [mdt]
[28530.589721]  mdt_reint_rec+0x11f/0x250 [mdt]
[28530.590454]  mdt_reint_internal+0x498/0x780 [mdt]
[28530.591283]  mdt_reint+0x5e/0x100 [mdt]
[28530.591981]  tgt_request_handle+0xc78/0x1910 [ptlrpc]


 Comments   
Comment by Andreas Dilger [ 21/Aug/21 ]

+1 on master in recovery-small unmount https://testing.whamcloud.com/test_sets/e3bb9ef5-9a74-4037-9fd5-bd9c1e60e4e2

Comment by Andreas Dilger [ 31/Aug/21 ]

+1 on master in recovery-small after the test had finished:
https://testing.whamcloud.com/test_sets/6eb8a766-eea5-44d8-8fcb-0eac77b2be79

Lustre: DEBUG MARKER: == recovery-small test complete, duration 6212 sec ======= 13:58:23 (1629640703)
BUG: unable to handle kernel paging request at 000000000bd00c23
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 2618 Comm: mdt00_001 Kdump: loaded 4.18.0-240.22.1.el8_lustre.x86_64 #1
RIP: 0010:llog_exist+0xd9/0x180 [obdclass]
Call Trace:
llog_cat_prep_log+0x4f/0x3c0 [obdclass]
llog_cat_declare_add_rec+0x56/0x220 [obdclass]
llog_declare_add+0x187/0x1d0 [obdclass]
top_trans_start+0x212/0x940 [ptlrpc]
mdd_unlink+0x4a0/0xb30 [mdd]
mdt_reint_unlink+0xb0c/0x1290 [mdt]
mdt_reint_rec+0x11f/0x250 [mdt]
mdt_reint_internal+0x498/0x780 [mdt]
mdt_reint+0x5e/0x100 [mdt]
tgt_request_handle+0xc90/0x1940 [ptlrpc]
ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
ptlrpc_main+0xba2/0x1490 [ptlrpc]
Comment by Alex Zhuravlev [ 03/Sep/21 ]

a similar one: https://testing.whamcloud.com/test_sets/a3d67622-925a-4122-abf5-f599b59970ba

Comment by Alex Zhuravlev [ 14/Sep/21 ]

https://testing.whamcloud.com/test_sessions/66dd0fd7-ab82-4af5-9178-0a12fe5302d5

Comment by Alex Zhuravlev [ 14/Sep/21 ]
[  857.600721] LustreError: 167-0: lustre-MDT0001-osp-MDT0002: This client was evicted by lustre-MDT0001; in progress operations using this service will fail.
[  857.857771] LustreError: 64216:0:(llog.c:472:llog_verify_record()) lustre-MDT0001-osp-MDT0000: record is too large: 0 > 32768
[  857.859905] LustreError: 64216:0:(llog.c:656:llog_process_thread()) lustre-MDT0001-osp-MDT0000: invalid record in llog [0x1:0x40001b70:0x2] record for index 0/1: rc = -22
Comment by Gerrit Updater [ 14/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44910
Subject: LU-14474 obdclass: debugging bits
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fbbd7933cc997754f874ad8ec964da8ea1e7df72

Comment by Alex Zhuravlev [ 21/Sep/21 ]

the immediate reason is that lod_sub_recovery_thread() destroys just scanned llog which is stored in chd_next_log, but I haven't understood yet how the log destroyed got into chd_next_log.
one thing I observed is that llog processing in lod_sub_recovery_thread() races with new requests from the clients. I guess this is not quite correct.

Comment by Gerrit Updater [ 21/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44998
Subject: LU-14474 llog: reset pointer to the next llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 447f84ff513e2ab3efdac5b32403ac06d3c61e7c

Comment by Gerrit Updater [ 10/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44998/
Subject: LU-14474 llog: don't destroy next llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4521f6af35d1dc20b531b87ff3633d89dbac86ec

Comment by Peter Jones [ 10/Oct/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 23/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45928
Subject: LU-14474 llog: don't destroy next llog
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 61ca609907943a56e94e073ee0d71ead0074aeb5

Generated at Sat Feb 10 03:10:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.