[LU-16014] sanity test_27M: crash in lod_qos_prep_create() Created: 15/Jul/22  Updated: 16/Jan/24  Resolved: 21/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-16872 sanity: test_27M Error: '(5) stripe c... Resolved
Related
is related to LU-15727 lod_get_default_lov_striping() misint... Resolved
is related to LU-16623 lod_statfs_and_check() does not skip ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Noticed a regular crash that looks like this in boilpot:

Lustre: DEBUG MARKER: == sanity test 27M: test O_APPEND striping ====== 21:09:25 (1657760965)
BUG: unable to handle kernel paging request at ffff8801466bccb0
IP: [<ffffffffa13f68f6>] lod_qos_prep_create+0xe96/0x1ab0 [lod]
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 3 PID: 2694 Comm: mdt01_002 Kdump: loaded  3.10.0-7.9-debug #2
Hardware name: Red Hat KVM, BIOS 1.15.0-1.module_el8.6.0+1087+b42c8331 04/01/2014
Call Trace:
 lod_prepare_create+0x23b/0x320 [lod]
 lod_declare_striped_create+0xf8/0xa50 [lod]
 lod_declare_create+0x1f5/0x600 [lod]
 mdd_declare_create_object_internal+0xd3/0x3b0 [mdd]
 mdd_declare_create_object.isra.35+0x51/0xb60 [mdd]
 mdd_declare_create+0x66/0x480 [mdd]
 mdd_create+0x9a9/0x1d30 [mdd]
 mdt_reint_open+0x2004/0x2c10 [mdt]
 mdt_reint_rec+0x87/0x240 [mdt]
 mdt_reint_internal+0x76c/0xb50 [mdt]
 mdt_intent_open+0x93/0x480 [mdt]
 mdt_intent_opc+0x1dd/0xc10 [mdt]
 mdt_intent_policy+0x1a1/0x360 [mdt]
 ldlm_lock_enqueue+0x3c2/0xb40 [ptlrpc]
 ldlm_handle_enqueue0+0x8c6/0x1780 [ptlrpc]
 tgt_enqueue+0x64/0x240 [ptlrpc]
 tgt_request_handle+0x93a/0x19c0 [ptlrpc]
 ptlrpc_server_handle_request+0x250/0xc30 [ptlrpc]
 ptlrpc_main+0xbd9/0x15f0 [ptlrpc]
 kthread+0xe4/0xf0

I think it came from LU-15727 patch https://review.whamcloud.com/47014

First hit on June 20th and then it really intensified in the past few days for some reason.

Very first crash (has vmcore and all):

http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-98-2022-06-20-10:50:27/

most recent crash with vmcore out of current master-next:

http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-28-2022-07-12-03:11:17/



 Comments   
Comment by Colin Faber [ 08/Aug/22 ]

Hi green if you remove LU-15727 are you still seeing the issue?

Comment by Minh Diep [ 16/Nov/22 ]

+1 on master https://testing.whamcloud.com/test_sessions/e5e1efcb-2b2d-44ac-9573-69ffd456b050

Comment by Andreas Dilger [ 13/Jan/23 ]

+5 on master in the past 4 weeks, all ldiskfs review sessions (DNE and non-DNE, one aarch64):
https://testing.whamcloud.com/test_sets/c4f8dd3c-7514-40f9-84a1-34c6ccb54ae3
https://testing.whamcloud.com/test_sets/01ab0985-4735-42b2-aa62-e3589414d6be
https://testing.whamcloud.com/test_sets/7d0da5b9-281a-4d16-b738-947c50d04d64
https://testing.whamcloud.com/test_sets/b3157654-0dd5-4626-85e4-7d6046e1a28f
https://testing.whamcloud.com/test_sets/1aa360a8-e42a-4dae-b598-e857f23bc06c

Comment by Alex Zhuravlev [ 10/Apr/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/128b97d8-62fa-4844-a7f0-cd325ca58198

Comment by Andreas Dilger [ 07/Jul/23 ]

This bug introduced by patch https://review.whamcloud.com/47014 "LU-15727 lod: honor append_pool with default composite layouts" which landed on master on 2022-07-11.

Comment by Andreas Dilger [ 07/Jul/23 ]

There is patch https://review.whamcloud.com/51559 "LU-16872 lod: reset llc_ostlist when using O_APPEND stripes" that should fix this.

Comment by Andreas Dilger [ 21/Aug/23 ]

Fixed via LU-16872

Generated at Sat Feb 10 03:23:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.