[LU-18049] ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.16.0
Labels:
- JSS

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/3576290d-064a-4573-a087-75b59fff6df7

test_25 failed with the following error:

trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25

Test session details:
clients: https://build.whamcloud.com/job/lustre-master/4542 - 5.14.0-362.24.1.el9_3.x86_64
servers: https://build.whamcloud.com/job/lustre-b_es6_0/666 - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64

Two clients both crashed in ext4_htree_store_dirent() (NOT ldiskfs) in kmalloc, so it looks like some kind of client-side memory corruption?

[27299.419062] Lustre: MGC10.240.44.44@tcp: Connection restored to  (at 10.240.44.44@tcp)
[27299.448580] LustreError: 886364:0:(client.c:3288:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff902e1402e3c0 x1802432488249280/t691489734702(691489734702) o101->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 520/608 e 0 to 0 dl 1718935667 ref 2 fl Interpret:RPQU/204/0 rc 301/301 job:'lfs.0' uid:0 gid:0
[27300.013294] Lustre: lustre-MDT0000-mdc-ffff902e04449800: Connection restored to  (at 10.240.44.44@tcp)
[27305.358931] Lustre: 886365:0:(client.c:2334:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718935641/real 1718935641]  req@ffff902e547d9380 x1802432536672704/t0(0) o400->lustre-MDT0000-mdc-ffff902e04449800@10.240.44.44@tcp:12/10 lens 224/224 e 0 to 1 dl 1718935657 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
[27306.104959] BUG: unable to handle page fault for address: ffff902ee5338778
[27306.105697] #PF: supervisor read access in kernel mode
[27306.106204] #PF: error_code(0x0000) - not-present page
[27306.107607] CPU: 1 PID: 1109213 Comm: bash Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-362.24.1.el9_3.x86_64 #1
[27306.108653] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[27306.109209] RIP: 0010:__kmalloc+0x11b/0x370
[27306.118475] Call Trace:
[27306.118756]  <TASK>
[27306.120625]  ? __die_body.cold+0x8/0xd
[27306.121006]  ? page_fault_oops+0x134/0x170
[27306.121437]  ? kernelmode_fixup_or_oops+0x84/0x110
[27306.121944]  ? exc_page_fault+0xa8/0x150
[27306.122371]  ? asm_exc_page_fault+0x22/0x30
[27306.122806]  ? ext4_htree_store_dirent+0x36/0x100 [ext4]
[27306.123359]  ? __kmalloc+0x11b/0x370
[27306.123740]  ext4_htree_store_dirent+0x36/0x100 [ext4]
[27306.124269]  htree_dirblock_to_tree+0x1ab/0x310 [ext4]
[27306.124809]  ext4_htree_fill_tree+0x203/0x3b0 [ext4]
[27306.125333]  ext4_dx_readdir+0x10d/0x360 [ext4]
[27306.125817]  ext4_readdir+0x392/0x550 [ext4]
[27306.126275]  iterate_dir+0x17c/0x1c0
[27306.126711]  __x64_sys_getdents64+0x80/0x120
[27306.128187]  do_syscall_64+0x5c/0x90

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
ost-pools test_25 - trevis-106vm10, trevis-106vm11 crashed during ost-pools test_25

Attachments

Issue Links

is related to

LU-18230 sanity-sec: crash in lov_connect/lmv_connect ->__kernfs_new_node->kstrdup->__kmalloc_track_caller

Resolved

LU-16307 sanity-sec: test_31: export for 10.240.26.216@tcp on MGS should not exist

Resolved

is related to

LU-13306 allow clients to accept mgs_nidtbl_entry with IPv6 NIDs

Resolved

mentioned in: Page Loading...

Activity

[LU-18049] ost-pools test_25, sanity-sec test_31: crash in ext4_htree_store_dirent kmalloc

Andreas Dilger added a comment - 22/Sep/24 1:04 AM

Patch v2_15_58-46-ge2539c0667 failed all sessions.
Patch v2_15_58-45-ge4d2d4ff74 failed all sessions.
Patch v2_15_58-44-g14ed4a6f8f passed all sessions.

So it is clear the patch v2_15_58-45-ge4d2d4ff74 "LU-13306 mgc: handle large NID formats" is the source of the memory corruption.

Sergey will see if he can reproduce this locally to improve debugging.

I've added memory debug logging to the -45 test patch to see if this can help identify the source of the corruption.

Andreas Dilger added a comment - 22/Sep/24 1:04 AM Patch v2_15_58-46-ge2539c0667 failed all sessions. Patch v2_15_58-45-ge4d2d4ff74 failed all sessions. Patch v2_15_58-44-g14ed4a6f8f passed all sessions. So it is clear the patch v2_15_58-45-ge4d2d4ff74 " LU-13306 mgc: handle large NID formats " is the source of the memory corruption. Sergey will see if he can reproduce this locally to improve debugging. I've added memory debug logging to the -45 test patch to see if this can help identify the source of the corruption.

Andreas Dilger added a comment - 21/Sep/24 3:20 AM

While it is only a 50-50 chance right now, it looks likely that patch https://review.whamcloud.com/50750 "LU-13306 mgc: handle large NID formats" is the source of the memory corruption in this test.

If this can be reproduced on a local test system (outside of autotest) with the slab debugging enabled, that is likely the most direct path to figuring out what is actually corrupting the memory.

Andreas Dilger added a comment - 21/Sep/24 3:20 AM While it is only a 50-50 chance right now, it looks likely that patch https://review.whamcloud.com/50750 " LU-13306 mgc: handle large NID formats " is the source of the memory corruption in this test. If this can be reproduced on a local test system (outside of autotest) with the slab debugging enabled, that is likely the most direct path to figuring out what is actually corrupting the memory.

Andreas Dilger added a comment - 21/Sep/24 3:17 AM

scherementsev, I'm not sure if this is practical to do or not, but one option might be to crank up the debug log size (e.g. 1GB) and enable "+malloc" debugging to get a record of what memory is allocated/freed by Lustre, and then use this to compare against the memory address in the core dump that is corrupted.

This would potentially allow seeing what Lustre allocation is immediately before/after the corrupted slab item, and this could tell us what code is overwriting its own allocation and clobbering the neighboring slab object.

Note I also filed ATM-3293 to implement a mechanism to pass kernel command-line options to the system during boot, like:

Test-Parameters: kernelopt="slab_nomerge slab_debug=FZPU,kmalloc-64"

so that we can enable the slab redzone / poisoning functionality on the kmalloc-64 slab. But until that is imlemented I don't think there is any slab debugging that can be enabled from within the running VM.

Andreas Dilger added a comment - 21/Sep/24 3:17 AM scherementsev , I'm not sure if this is practical to do or not, but one option might be to crank up the debug log size (e.g. 1GB) and enable " +malloc " debugging to get a record of what memory is allocated/freed by Lustre, and then use this to compare against the memory address in the core dump that is corrupted. This would potentially allow seeing what Lustre allocation is immediately before/after the corrupted slab item, and this could tell us what code is overwriting its own allocation and clobbering the neighboring slab object. Note I also filed ATM-3293 to implement a mechanism to pass kernel command-line options to the system during boot, like: Test-Parameters: kernelopt="slab_nomerge slab_debug=FZPU,kmalloc-64" so that we can enable the slab redzone / poisoning functionality on the kmalloc-64 slab. But until that is imlemented I don't think there is any slab debugging that can be enabled from within the running VM.

Andreas Dilger added a comment - 21/Sep/24 3:06 AM

Patch v2_15_58-42-ga54a20665a has passed all sessions.
Patch v2_15_58-44-g14ed4a6f8f is still running after 2h so it is likely to also pass.
Patch v2_15_58-47-gaebb405e32 has failed all sessions.

I will push the two finished patches to 45 and 46:
v2_15_58-47-gaebb405e32 LU-10499 pcc: use foreign layout for PCCRO on server side FAIL
v2_15_58-46-ge2539c0667 ~~LU-17088~~ dom: don't create different size DOM component
v2_15_58-45-ge4d2d4ff74 ~~LU-13306~~ mgc: handle large NID formats
v2_15_58-44-g14ed4a6f8f ~~LU-16837~~ llite: handle unknown layout component PASS

Andreas Dilger added a comment - 21/Sep/24 3:06 AM Patch v2_15_58-42-ga54a20665a has passed all sessions. Patch v2_15_58-44-g14ed4a6f8f is still running after 2h so it is likely to also pass. Patch v2_15_58-47-gaebb405e32 has failed all sessions. I will push the two finished patches to 45 and 46: v2_15_58-47-gaebb405e32 LU-10499 pcc: use foreign layout for PCCRO on server side FAIL v2_15_58-46-ge2539c0667 LU-17088 dom: don't create different size DOM component v2_15_58-45-ge4d2d4ff74 LU-13306 mgc: handle large NID formats v2_15_58-44-g14ed4a6f8f LU-16837 llite: handle unknown layout component PASS

Andreas Dilger added a comment - 20/Sep/24 11:57 PM

Patch v2_15_58-57-g4ba4976f52 has crashed all sessions. I will move it to v2_15_58-44-g14ed4a6f8f (just before large NID patch).

Andreas Dilger added a comment - 20/Sep/24 11:57 PM Patch v2_15_58-57-g4ba4976f52 has crashed all sessions. I will move it to v2_15_58-44-g14ed4a6f8f (just before large NID patch).

Andreas Dilger added a comment - 20/Sep/24 11:08 PM - edited

Patch v2_15_58-57-g4ba4976f52 build finished, prioritized test sessions in the queue, though it is unclear if this will tell us anything new (likely to crash).
Patch v2_15_58-52-g58a1c2c321 crashed all 9 sessions after 55-65 min, which means suspect patch is likely older. Mostly eliminates v2_15_58-60 (cgroups).
Patch v2_15_58-31-g5f3bc2ce7c is still running all sessions after 70-85 min, might be an indication that it will pass.

Next is v2_15_58-47-gaebb405e32 repurpose of the -52 patch.

Between -42 and -47 is v2_15_58-45-ge4d2d4ff74 "LU-13306 mgc: handle large NID formats" which Sergey previously pointed out as a likely culprit for interop crashing.

Note that there are two later patches with "Fixes: LU-13306" (v2_15_59-38-g8276ade19d and v2_15_60-5-g015f645651) but they reference a different patch. If necessary, we could probably merge those into the test patches, but they are fairly far away from the recent testing that I'm not sure they are involved...

It may be significant that none of the recent crashes I've looked at have shown ext4-related stacks, only mount/config crashes.

Andreas Dilger added a comment - 20/Sep/24 11:08 PM - edited Patch v2_15_58-57-g4ba4976f52 build finished, prioritized test sessions in the queue, though it is unclear if this will tell us anything new (likely to crash). Patch v2_15_58-52-g58a1c2c321 crashed all 9 sessions after 55-65 min, which means suspect patch is likely older. Mostly eliminates v2_15_58-60 (cgroups). Patch v2_15_58-31-g5f3bc2ce7c is still running all sessions after 70-85 min, might be an indication that it will pass. Next is v2_15_58-47-gaebb405e32 repurpose of the -52 patch. Between -42 and -47 is v2_15_58-45-ge4d2d4ff74 " LU-13306 mgc: handle large NID formats " which Sergey previously pointed out as a likely culprit for interop crashing. Note that there are two later patches with " Fixes: LU-13306 " (v2_15_59-38-g8276ade19d and v2_15_60-5-g015f645651) but they reference a different patch. If necessary, we could probably merge those into the test patches, but they are fairly far away from the recent testing that I'm not sure they are involved... It may be significant that none of the recent crashes I've looked at have shown ext4-related stacks, only mount/config crashes.

Andreas Dilger added a comment - 20/Sep/24 10:19 PM

Patch on v2_15_58-62-g1aec91ac64 crashed all sessions after 45-90 minutes (up to 6 iterations on one session) with stacks like ~~LU-18230~~, but none with ext4 in them.
Patch on v2_15_58-52-g58a1c2c321 and has just started testing a few minutes ago.
Patch on v2_15_58-42-ga54a20665a is still in the test queue, though I have prioritized those test sessions.
Patch on v2_15_58-31-g5f3bc2ce7c just started testing 20 minutes ago as well, so too early to know.
Patch on v2_15_58-57-g4ba4976f52 just rebased, but at least the build queue is now empty (test queue has over 180 sessions in it).

Andreas Dilger added a comment - 20/Sep/24 10:19 PM Patch on v2_15_58-62-g1aec91ac64 crashed all sessions after 45-90 minutes (up to 6 iterations on one session) with stacks like LU-18230 , but none with ext4 in them. Patch on v2_15_58-52-g58a1c2c321 and has just started testing a few minutes ago. Patch on v2_15_58-42-ga54a20665a is still in the test queue, though I have prioritized those test sessions. Patch on v2_15_58-31-g5f3bc2ce7c just started testing 20 minutes ago as well, so too early to know. Patch on v2_15_58-57-g4ba4976f52 just rebased, but at least the build queue is now empty (test queue has over 180 sessions in it).

Sergey Cheremencev added a comment - 20/Sep/24 6:25 PM

green , despite the fact we are trying to find the issue doing bisection I've looked into several vmcores.
As I see it always fails here(at least in 3 vmcores I had a chance to look into):

3238 static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
3239                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
3240 {
     .... 
3295         } else {
3296                 void *next_object = get_freepointer_safe(s, object);
3297

It is clear that the object here is broken. It adds to object s->offset, so now we have a fast way to find which slab was affected just looking at RAX and RCX.

Example:

https://testing.whamcloud.com/test_sets/d1dea7be-129c-4b3d-ba7e-e76a19e90d7a
RAX: 2c7e5a86dcc759a8 RBX: 0000000000000dc0 RCX: 2c7e5a86dcc759c8
0x2c7e5a86dcc759c8 - 0x2c7e5a86dcc759a8 = 32 // kmalloc-64

https://testing.whamcloud.com/test_sets/1a9f2b90-4ad5-4124-aa7e-abdb6f7992df
RAX: ffff968f67c95519 RBX: 0000000000000dc0 RCX: ffff968f67c95539
0xffff968f67c95539 - 0xffff968f67c95519 = 32 // kmalloc-64

https://testing.whamcloud.com/test_sets/4d16dcfd-a86b-407c-8cd4-2281bcc25872
RAX: ffff8c0101f81c2b RBX: 0000000000000d40 RCX: ffff8c0101f81c4b
0xffff8c0101f81c4b - 0xffff8c0101f81c2b = 32 // kmalloc-64

This algorithm definitely works for at least 2 below back traces.

[  631.425995] RIP: 0010:__kmalloc+0x11b/0x370
  9 [  631.426441] Code: 8b 51 08 48 8b 01 48 83 79 10 00 48 89 04 24 0f 84 b8 01 00 00 48 85 c0 0f 84 af 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 8a 00 20 00 00 48
 10 [  631.427963] RSP: 0018:ffffba54812bbbe0 EFLAGS: 00010202
 11 [  631.428453] RAX: 2c7e5a86dcc759a8 RBX: 0000000000000dc0 RCX: 2c7e5a86dcc759c8
 12 [  631.429089] RDX: 000000004fbb8000 RSI: 0000000000000dc0 RDI: 00000000000370a0
 13 [  631.429714] RBP: 0000000000000000 R08: ffffba54812bbc78 R09: 00000000fd89642e
 14 [  631.430358] R10: 00000000636f6262 R11: 000000000b0b0b0b R12: ffff949b41042500
 15 [  631.431000] R13: ffffffffc0627316 R14: 0000000000000dc0 R15: 000000000000003c
 16 [  631.431622] FS:  00007f4c2cf3f740(0000) GS:ffff949bffc00000(0000) knlGS:0000000000000000
 17 [  631.432327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 18 [  631.432858] CR2: 000055c0dc40eaf8 CR3: 0000000006276001 CR4: 00000000000606f0
 19 [  631.433482] Call Trace:
 20 [  631.433755]  <TASK>
 30 [  631.438189]  ext4_htree_store_dirent+0x36/0x100 [ext4]
 31 [  631.438708]  htree_dirblock_to_tree+0x1ab/0x310 [ext4]
 32 [  631.439249]  ext4_htree_fill_tree+0xca/0x3b0 [ext4]
 34 [  631.440218]  ext4_dx_readdir+0x10d/0x360 [ext4]
 35 [  631.440692]  ext4_readdir+0x392/0x550 [ext4]
 36 [  631.441156]  iterate_dir+0x17c/0x1c0
 37 [  631.441534]  __x64_sys_getdents64+0x80/0x120
 39 [  631.442342]  do_syscall_64+0x5c/0x90
 43 [  631.443994]  entry_SYSCALL_64_after_hwframe+0x72/0xdc 

[  439.455143] RIP: 0010:kmem_cache_alloc_trace+0xfe/0x300
[  439.455749] Code: 8b 51 08 48 8b 01 48 83 79 10 00 48 89 04 24 0f 84 91 01 00 00 48 85 c0 0f 84 88 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 8a 00 20 00 00 48
[  439.457506] RSP: 0018:ffff9cd7c073fd20 EFLAGS: 00010286
[  439.458054] RAX: ffff8c0101f81c2b RBX: 0000000000000d40 RCX: ffff8c0101f81c4b
[  439.458790] RDX: 000000005017c001 RSI: 0000000000000d40 RDI: 00000000000370a0
[  439.459512] RBP: 0000000000000000 R08: ffff8c01c3370000 R09: ffff8c02c5f5378c
[  439.460229] R10: ffffffffffffffff R11: 000000000000000f R12: ffff8c01c1042500
[  439.460958] R13: ffffffffc0b8011f R14: 0000000000000d40 R15: 0000000000000038
[  439.461693] FS:  0000000000000000(0000) GS:ffff8c027fd00000(0000) knlGS:0000000000000000
[  439.462498] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  439.463091] CR2: ffff8c0101f81c4b CR3: 0000000004b3a001 CR4: 00000000001706e0
[  439.463825] Call Trace:
[  439.464122]  <TASK>
[  439.469393]  ksocknal_send_hello_v2+0x2f/0x550 [ksocklnd]
[  439.469957]  ksocknal_create_conn+0x2ad/0x1940 [ksocklnd]
[  439.471100]  ksocknal_connect+0x18f/0x6d0 [ksocklnd]
[  439.471632]  ksocknal_connd+0x51b/0x7a0 [ksocklnd]
[  439.473271]  kthread+0xe0/0x100
[  439.474056]  ret_from_fork+0x2c/0x50

Some vmcores even say about the problem with a free pointer when showing basic kmalloc information. But it is not always:

crash> kmem -s | grep bad
kmem: kmalloc-64: slab: fffff3624016b380 invalid freepointer: 2c7e5a86dcc759c8

However, there is still a number of sanity-sec_31 failures with another back traces(I saw at least 4 different) without ext4 and kmalloc inside. Seems the the problem randomly corrupt the memory.

Sergey Cheremencev added a comment - 20/Sep/24 6:25 PM green , despite the fact we are trying to find the issue doing bisection I've looked into several vmcores. As I see it always fails here(at least in 3 vmcores I had a chance to look into): 3238 static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru, 3239 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) 3240 { .... 3295 } else { 3296 void *next_object = get_freepointer_safe(s, object); 3297 It is clear that the object here is broken. It adds to object s->offset, so now we have a fast way to find which slab was affected just looking at RAX and RCX. Example: https://testing.whamcloud.com/test_sets/d1dea7be-129c-4b3d-ba7e-e76a19e90d7a RAX: 2c7e5a86dcc759a8 RBX: 0000000000000dc0 RCX: 2c7e5a86dcc759c8 0x2c7e5a86dcc759c8 - 0x2c7e5a86dcc759a8 = 32 // kmalloc-64 https://testing.whamcloud.com/test_sets/1a9f2b90-4ad5-4124-aa7e-abdb6f7992df RAX: ffff968f67c95519 RBX: 0000000000000dc0 RCX: ffff968f67c95539 0xffff968f67c95539 - 0xffff968f67c95519 = 32 // kmalloc-64 https://testing.whamcloud.com/test_sets/4d16dcfd-a86b-407c-8cd4-2281bcc25872 RAX: ffff8c0101f81c2b RBX: 0000000000000d40 RCX: ffff8c0101f81c4b 0xffff8c0101f81c4b - 0xffff8c0101f81c2b = 32 // kmalloc-64 This algorithm definitely works for at least 2 below back traces. [ 631.425995] RIP: 0010:__kmalloc+0x11b/0x370 9 [ 631.426441] Code: 8b 51 08 48 8b 01 48 83 79 10 00 48 89 04 24 0f 84 b8 01 00 00 48 85 c0 0f 84 af 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 8a 00 20 00 00 48 10 [ 631.427963] RSP: 0018:ffffba54812bbbe0 EFLAGS: 00010202 11 [ 631.428453] RAX: 2c7e5a86dcc759a8 RBX: 0000000000000dc0 RCX: 2c7e5a86dcc759c8 12 [ 631.429089] RDX: 000000004fbb8000 RSI: 0000000000000dc0 RDI: 00000000000370a0 13 [ 631.429714] RBP: 0000000000000000 R08: ffffba54812bbc78 R09: 00000000fd89642e 14 [ 631.430358] R10: 00000000636f6262 R11: 000000000b0b0b0b R12: ffff949b41042500 15 [ 631.431000] R13: ffffffffc0627316 R14: 0000000000000dc0 R15: 000000000000003c 16 [ 631.431622] FS: 00007f4c2cf3f740(0000) GS:ffff949bffc00000(0000) knlGS:0000000000000000 17 [ 631.432327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 18 [ 631.432858] CR2: 000055c0dc40eaf8 CR3: 0000000006276001 CR4: 00000000000606f0 19 [ 631.433482] Call Trace: 20 [ 631.433755] <TASK> 30 [ 631.438189] ext4_htree_store_dirent+0x36/0x100 [ext4] 31 [ 631.438708] htree_dirblock_to_tree+0x1ab/0x310 [ext4] 32 [ 631.439249] ext4_htree_fill_tree+0xca/0x3b0 [ext4] 34 [ 631.440218] ext4_dx_readdir+0x10d/0x360 [ext4] 35 [ 631.440692] ext4_readdir+0x392/0x550 [ext4] 36 [ 631.441156] iterate_dir+0x17c/0x1c0 37 [ 631.441534] __x64_sys_getdents64+0x80/0x120 39 [ 631.442342] do_syscall_64+0x5c/0x90 43 [ 631.443994] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 439.455143] RIP: 0010:kmem_cache_alloc_trace+0xfe/0x300 [ 439.455749] Code: 8b 51 08 48 8b 01 48 83 79 10 00 48 89 04 24 0f 84 91 01 00 00 48 85 c0 0f 84 88 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 8a 00 20 00 00 48 [ 439.457506] RSP: 0018:ffff9cd7c073fd20 EFLAGS: 00010286 [ 439.458054] RAX: ffff8c0101f81c2b RBX: 0000000000000d40 RCX: ffff8c0101f81c4b [ 439.458790] RDX: 000000005017c001 RSI: 0000000000000d40 RDI: 00000000000370a0 [ 439.459512] RBP: 0000000000000000 R08: ffff8c01c3370000 R09: ffff8c02c5f5378c [ 439.460229] R10: ffffffffffffffff R11: 000000000000000f R12: ffff8c01c1042500 [ 439.460958] R13: ffffffffc0b8011f R14: 0000000000000d40 R15: 0000000000000038 [ 439.461693] FS: 0000000000000000(0000) GS:ffff8c027fd00000(0000) knlGS:0000000000000000 [ 439.462498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 439.463091] CR2: ffff8c0101f81c4b CR3: 0000000004b3a001 CR4: 00000000001706e0 [ 439.463825] Call Trace: [ 439.464122] <TASK> [ 439.469393] ksocknal_send_hello_v2+0x2f/0x550 [ksocklnd] [ 439.469957] ksocknal_create_conn+0x2ad/0x1940 [ksocklnd] [ 439.471100] ksocknal_connect+0x18f/0x6d0 [ksocklnd] [ 439.471632] ksocknal_connd+0x51b/0x7a0 [ksocklnd] [ 439.473271] kthread+0xe0/0x100 [ 439.474056] ret_from_fork+0x2c/0x50 Some vmcores even say about the problem with a free pointer when showing basic kmalloc information. But it is not always: crash> kmem -s | grep bad kmem: kmalloc-64: slab: fffff3624016b380 invalid freepointer: 2c7e5a86dcc759c8 However, there is still a number of sanity-sec_31 failures with another back traces(I saw at least 4 different) without ext4 and kmalloc inside. Seems the the problem randomly corrupt the memory.

Andreas Dilger added a comment - 20/Sep/24 5:25 AM

The two patches on 2.15.56 and 2.15.58 passed testing (or at least have not yet failed, it takes about 9h = 100 x 325s for the custom sanity-sec test_31 sessions to complete). The two patches on 2.15.63, 2.15.60, and then 2.15.59 (which is ~= 2.15.58-186), 2.15.58-93 have both failed fairly quickly. I've moved these patches to 2.15.58-62 and 2.15.58-31.

This is starting to converge toward the suspect patch v2_15_58-60-g8aa231a994 "LU-16713 llite: writeback/commit pages under memory pressure". Unfortunately, both the build and test queue are fairly busy right now, so the next cycle may take some time to finish. If the 2.15.56 and 2.15.58 patches finally complete testing. they should be moved to 2.15.58-42 and 2.15.58-52 on the expectation that 2.15.58-31 will pass.

We might also consider to reduce ONLY_REPEAT=100 to 15 or so (max 1.3h), since it appears the test session is failing almost immediately (1 or 2 iterations) and consistently - (almost?) every session that runs fails.

Andreas Dilger added a comment - 20/Sep/24 5:25 AM The two patches on 2.15.56 and 2.15.58 passed testing (or at least have not yet failed, it takes about 9h = 100 x 325s for the custom sanity-sec test_31 sessions to complete). The two patches on 2.15.63, 2.15.60, and then 2.15.59 (which is ~= 2.15.58-186), 2.15.58-93 have both failed fairly quickly. I've moved these patches to 2.15.58-62 and 2.15.58-31. This is starting to converge toward the suspect patch v2_15_58-60-g8aa231a994 " LU-16713 llite: writeback/commit pages under memory pressure ". Unfortunately, both the build and test queue are fairly busy right now, so the next cycle may take some time to finish. If the 2.15.56 and 2.15.58 patches finally complete testing. they should be moved to 2.15.58-42 and 2.15.58-52 on the expectation that 2.15.58-31 will pass. We might also consider to reduce ONLY_REPEAT=100 to 15 or so (max 1.3h), since it appears the test session is failing almost immediately (1 or 2 iterations) and consistently - (almost?) every session that runs fails.

Sergey Cheremencev added a comment - 19/Sep/24 12:55 PM

adilger , the first similar failure with ext4 in a backtrace that I found is https://testing.whamcloud.com/test_sets/2a8f520f-2c3d-49b6-9466-df9945d96017:

 [11857.512147] usercopy: Kernel memory exposure attempt detected from SLUB object 'kmalloc-96' (offset 78, size 22)!
[11857.513942] ------------[ cut here ]------------
[11857.514726] kernel BUG at mm/usercopy.c:103!
[11857.515529] invalid opcode: 0000 [#1] SMP PTI 
[11857.516276] CPU: 0 PID: 504970 Comm: modprobe Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.10.1.el8_7.x86_64 #1
[11857.518162] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[11857.519102] RIP: 0010:usercopy_abort+0x74/0x76
[11857.519939] Code: 0f 45 c6 51 48 89 f9 48 c7 c2 16 b4 30 93 41 52 48 c7 c6 03 43 2f 93 48 c7 c7 e0 b4 30 93 48 0f 45 f2 48 89 c2 e8 fd 10 e1 ff <0f> 0b 49 89 f0 48 89 f9 44 89 e2 31 f6 48 c7 c7 3d b4 30 93 e8 73
[11857.522837] RSP: 0018:ffffa736c1a2bd78 EFLAGS: 00010246
[11857.523694] RAX: 0000000000000065 RBX: ffff90c943feb0ae RCX: 0000000000000000
[11857.524846] RDX: 0000000000000000 RSI: ffff90c9ffc16698 RDI: ffff90c9ffc16698
[11857.525992] RBP: 0000000000000016 R08: 0000000000000000 R09: c0000000ffff7fff
[11857.527137] R10: 0000000000000001 R11: ffffa736c1a2bb98 R12: 0000000000000001
[11857.528284] R13: ffff90c943feb0c4 R14: 0000000000000016 R15: ffff90c943feb0ae
[11857.529458] FS:  00007f79e436e740(0000) GS:ffff90c9ffc00000(0000) knlGS:0000000000000000
[11857.530751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11857.531709] CR2: 0000562128291458 CR3: 0000000003da2003 CR4: 00000000001706f0
[11857.532859] Call Trace:
[11857.533377]  __check_heap_object+0xd7/0x110
[11857.534117]  __check_object_size+0x107/0x173
[11857.534843]  filldir64+0xbe/0x130
[11857.535466]  call_filldir+0x7f/0x120 [ext4]
[11857.536502]  ext4_readdir+0x48f/0x8c0 [ext4]
[11857.537242]  iterate_dir+0x144/0x1a0
[11857.538570]  ksys_getdents64+0x9c/0x130
[11857.539875]  __x64_sys_getdents64+0x16/0x20
[11857.540579]  do_syscall_64+0x5b/0x1b0
[11857.541291]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[11857.542234] RIP: 0033:0x7f79e333f4bb

There were several kernel panics in sanity-sec_31 during https://review.whamcloud.com/#/c/fs/lustre-release/+/51209/ testing. That's why I tried to revert this patch suspecting it might be guilty. Yes there is still a chance that these failures are something different.

Sergey Cheremencev added a comment - 19/Sep/24 12:55 PM adilger , the first similar failure with ext4 in a backtrace that I found is https://testing.whamcloud.com/test_sets/2a8f520f-2c3d-49b6-9466-df9945d96017 : [11857.512147] usercopy: Kernel memory exposure attempt detected from SLUB object 'kmalloc-96' (offset 78, size 22)! [11857.513942] ------------[ cut here ]------------ [11857.514726] kernel BUG at mm/usercopy.c:103! [11857.515529] invalid opcode: 0000 [#1] SMP PTI [11857.516276] CPU: 0 PID: 504970 Comm: modprobe Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.10.1.el8_7.x86_64 #1 [11857.518162] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [11857.519102] RIP: 0010:usercopy_abort+0x74/0x76 [11857.519939] Code: 0f 45 c6 51 48 89 f9 48 c7 c2 16 b4 30 93 41 52 48 c7 c6 03 43 2f 93 48 c7 c7 e0 b4 30 93 48 0f 45 f2 48 89 c2 e8 fd 10 e1 ff <0f> 0b 49 89 f0 48 89 f9 44 89 e2 31 f6 48 c7 c7 3d b4 30 93 e8 73 [11857.522837] RSP: 0018:ffffa736c1a2bd78 EFLAGS: 00010246 [11857.523694] RAX: 0000000000000065 RBX: ffff90c943feb0ae RCX: 0000000000000000 [11857.524846] RDX: 0000000000000000 RSI: ffff90c9ffc16698 RDI: ffff90c9ffc16698 [11857.525992] RBP: 0000000000000016 R08: 0000000000000000 R09: c0000000ffff7fff [11857.527137] R10: 0000000000000001 R11: ffffa736c1a2bb98 R12: 0000000000000001 [11857.528284] R13: ffff90c943feb0c4 R14: 0000000000000016 R15: ffff90c943feb0ae [11857.529458] FS: 00007f79e436e740(0000) GS:ffff90c9ffc00000(0000) knlGS:0000000000000000 [11857.530751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11857.531709] CR2: 0000562128291458 CR3: 0000000003da2003 CR4: 00000000001706f0 [11857.532859] Call Trace: [11857.533377] __check_heap_object+0xd7/0x110 [11857.534117] __check_object_size+0x107/0x173 [11857.534843] filldir64+0xbe/0x130 [11857.535466] call_filldir+0x7f/0x120 [ext4] [11857.536502] ext4_readdir+0x48f/0x8c0 [ext4] [11857.537242] iterate_dir+0x144/0x1a0 [11857.538570] ksys_getdents64+0x9c/0x130 [11857.539875] __x64_sys_getdents64+0x16/0x20 [11857.540579] do_syscall_64+0x5b/0x1b0 [11857.541291] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [11857.542234] RIP: 0033:0x7f79e333f4bb There were several kernel panics in sanity-sec_31 during https://review.whamcloud.com/#/c/fs/lustre-release/+/51209/ testing. That's why I tried to revert this patch suspecting it might be guilty. Yes there is still a chance that these failures are something different.

Andreas Dilger added a comment - 19/Sep/24 12:53 AM

Looking back through the sanity-sec test_31 failures, the first client crash reported with interop testing was on 2023-09-26 (nothing before that back to 2023-06-01) running full testing with 2.15.58.62 client + 2.12.9 server:
https://testing.whamcloud.com/sub_tests/8571bbbe-fd9d-41f5-a733-764f02f76bc7
It crashed with an assertion during client unmount:

lu_object.c:1805:lu_context_fini()) ASSERTION( list_empty(&ctx->lc_remember) ) failed
[22405.272300] CPU: 1 PID: 1366321 Comm: ldlm_bl_02 4.18.0-477.21.1.el8_8.x86_64 #1
[22405.275539] Call Trace:
[22405.276112]  dump_stack+0x41/0x60
[22405.276840]  panic+0xe7/0x2ac
[22405.278236]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[22405.279190]  lu_context_fini+0x189/0x190 [obdclass]
[22405.280165]  lu_env_fini+0x16/0x30 [obdclass]
[22405.281044]  ldlm_bl_thread_main+0x4d1/0x920 [ptlrpc]
[22405.284646]  kthread+0x134/0x150

which doesn't look very similar to this issue but at a minimum represents the oldest possible failure. After 2023-09-26 there are crashes every 3-7 days for full-part-2 testing, probably whenever patches are landed to master.

The next crash on 2023-09-29 running client 2.15.58.82 definitely appears to be the same:
https://testing.whamcloud.com/test_sets/cb672876-171a-40d4-bc2e-c4615165f752

[ 4931.621588] CPU: 1 PID: 242163 Comm: bash  5.14.0-284.30.1.el9_2.x86_64 #1
[ 4931.623653] RIP: 0010:__kmalloc+0x116/0x360
[ 4931.633766] Call Trace:
[ 4931.634121]  <TASK>
[ 4931.634447]  ext4_htree_store_dirent+0x36/0x100 [ext4]
[ 4931.635272]  htree_dirblock_to_tree+0x1ab/0x310 [ext4]
[ 4931.635962]  ext4_htree_fill_tree+0x243/0x3b0 [ext4]
[ 4931.636622]  ext4_dx_readdir+0x134/0x390 [ext4]
[ 4931.637229]  ext4_readdir+0x54/0x580 [ext4]
[ 4931.637821]  iterate_dir+0x17c/0x1c0
[ 4931.639479]  do_syscall_64+0x5c/0x90

I did confirm that there was master client + 2.12.9/2.14.0 server interop testing for sanity-sec test_31 going back to at least 2.15.52.175 in 2022-12-06 that was not failing.

There were two patches that landed on master on 2023-09-26:

03a795efa4 ~~LU-16713~~ llite: remove unused ccc_unstable_waitq (looks harmless)
8aa231a994 ~~LU-16713~~ llite: writeback/commit pages under memory pressure

Note also that there are a number of patches that modified test_31 in this timeframe:

7c45baeb21 ~~LU-17996~~ mgs: add ability to clear exports (2024-07-13)
e163883f76 ~~LU-17666~~: configure lnet before add net in sanity-sec:31 (2024-04-02)
28b4d02161 ~~LU-16307~~ tests: fix sanity-sec test_31 (2024-03-04)
8d828762d1 ~~LU-17015~~ gss: support large kerberos token for rpc sec init (2023-10-25)
f70821fc14 ~~LU-16775~~ tests: cleanup target after sanity-sec test_31 (2023-05-19)
c508c94268 ~~LU-16557~~ client: -o network needs add_conn processing (2023-03-01)

Andreas Dilger added a comment - 19/Sep/24 12:53 AM Looking back through the sanity-sec test_31 failures, the first client crash reported with interop testing was on 2023-09-26 (nothing before that back to 2023-06-01) running full testing with 2.15.58.62 client + 2.12.9 server: https://testing.whamcloud.com/sub_tests/8571bbbe-fd9d-41f5-a733-764f02f76bc7 It crashed with an assertion during client unmount: lu_object.c:1805:lu_context_fini()) ASSERTION( list_empty(&ctx->lc_remember) ) failed [22405.272300] CPU: 1 PID: 1366321 Comm: ldlm_bl_02 4.18.0-477.21.1.el8_8.x86_64 #1 [22405.275539] Call Trace: [22405.276112] dump_stack+0x41/0x60 [22405.276840] panic+0xe7/0x2ac [22405.278236] lbug_with_loc.cold.8+0x18/0x18 [libcfs] [22405.279190] lu_context_fini+0x189/0x190 [obdclass] [22405.280165] lu_env_fini+0x16/0x30 [obdclass] [22405.281044] ldlm_bl_thread_main+0x4d1/0x920 [ptlrpc] [22405.284646] kthread+0x134/0x150 which doesn't look very similar to this issue but at a minimum represents the oldest possible failure. After 2023-09-26 there are crashes every 3-7 days for full-part-2 testing, probably whenever patches are landed to master. The next crash on 2023-09-29 running client 2.15.58.82 definitely appears to be the same: https://testing.whamcloud.com/test_sets/cb672876-171a-40d4-bc2e-c4615165f752 [ 4931.621588] CPU: 1 PID: 242163 Comm: bash 5.14.0-284.30.1.el9_2.x86_64 #1 [ 4931.623653] RIP: 0010:__kmalloc+0x116/0x360 [ 4931.633766] Call Trace: [ 4931.634121] <TASK> [ 4931.634447] ext4_htree_store_dirent+0x36/0x100 [ext4] [ 4931.635272] htree_dirblock_to_tree+0x1ab/0x310 [ext4] [ 4931.635962] ext4_htree_fill_tree+0x243/0x3b0 [ext4] [ 4931.636622] ext4_dx_readdir+0x134/0x390 [ext4] [ 4931.637229] ext4_readdir+0x54/0x580 [ext4] [ 4931.637821] iterate_dir+0x17c/0x1c0 [ 4931.639479] do_syscall_64+0x5c/0x90 I did confirm that there was master client + 2.12.9/2.14.0 server interop testing for sanity-sec test_31 going back to at least 2.15.52.175 in 2022-12-06 that was not failing. There were two patches that landed on master on 2023-09-26: 03a795efa4 LU-16713 llite: remove unused ccc_unstable_waitq (looks harmless) 8aa231a994 LU-16713 llite: writeback/commit pages under memory pressure Note also that there are a number of patches that modified test_31 in this timeframe: 7c45baeb21 LU-17996 mgs: add ability to clear exports (2024-07-13) e163883f76 LU-17666 : configure lnet before add net in sanity-sec:31 (2024-04-02) 28b4d02161 LU-16307 tests: fix sanity-sec test_31 (2024-03-04) 8d828762d1 LU-17015 gss: support large kerberos token for rpc sec init (2023-10-25) f70821fc14 LU-16775 tests: cleanup target after sanity-sec test_31 (2023-05-19) c508c94268 LU-16557 client: -o network needs add_conn processing (2023-03-01)

People

Assignee:: Sergey Cheremencev

Reporter:: Qian Yingjin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 19/Jul/24 1:21 AM

Updated:: 02/Oct/24 6:19 PM

Resolved:: 30/Sep/24 4:08 PM