[LU-15757] sanityn test_109: Oops in ll_md_blocking_ast() at umount Created: 19/Apr/22  Updated: 08/Nov/22  Resolved: 12/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream, Lustre 2.15.0
Fix Version/s: Lustre 2.15.0, Lustre 2.15.2

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15835 sanityn test_102: [20108.951589] WARN... Reopened
is related to LU-6142 Enforce Linux kernel coding style in ... Open
is related to LU-15305 sanityn test_109 crash: list_del corr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
Trace:
PID: 622621  TASK: ffff89d05b2117c0  CPU: 1   COMMAND: "ll_imp_inval"
 #0 [ffff89d04d307b70] panic at ffffffff860af786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff89d04d307c00] ll_md_blocking_ast at ffffffffc1919052 [lustre]
    /home/lustre/master-mine/lustre/llite/namei.c: 388
 #2 [ffff89d04d307c60] ldlm_cancel_callback at ffffffffc0fc074c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 2445
 #3 [ffff89d04d307cb0] ldlm_cli_cancel_local at ffffffffc0fd8ce6 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: 1244
 #4 [ffff89d04d307cd0] ldlm_cli_cancel at ffffffffc0fde6a8 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_request.c: 1569
 #5 [ffff89d04d307d30] cleanup_resource_queue at ffffffffc0fc278c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1091
 #6 [ffff89d04d307d80] ldlm_resource_cleanup at ffffffffc0fc2914 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1106
 #7 [ffff89d04d307d98] ldlm_resource_clean_hash at ffffffffc0fc294c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1119
 #8 [ffff89d04d307da8] cfs_hash_for_each_relax at ffffffffc0b39cc5 [libcfs]
    /home/lustre/master-mine/libcfs/libcfs/hash.c: 1644
 #9 [ffff89d04d307e20] cfs_hash_for_each_nolock at ffffffffc0b3d40f [libcfs]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_hash.h: 402
#10 [ffff89d04d307e48] ldlm_namespace_cleanup at ffffffffc0fc2ce6 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1163
#11 [ffff89d04d307e60] mdc_import_event at ffffffffc13456f6 [mdc]
    /home/lustre/master-mine/lustre/mdc/mdc_request.c: 2712
#12 [ffff89d04d307e90] ptlrpc_invalidate_import at ffffffffc102460d [ptlrpc]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 154

after adding an additional assertiong:

        if ((bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_PERM))) {
                LASSERT(inode);
                LASSERT(inode->i_sb);
                LASSERT(inode->i_sb->s_root);
        }      

it's:

LustreError: 622621:0:(namei.c:389:ll_lock_cancel_bits()) ASSERTION( inode->i_sb->s_root ) failed
BUG: unable to handle kernel NULL pointer dereference at 0000000000000030


 Comments   
Comment by Gerrit Updater [ 19/Apr/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47086
Subject: LU-15757 llite: check s_root ll_md_blocking_ast()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dccac47b8b305b99fafaccad66994ac492b7e221

Comment by Alex Zhuravlev [ 19/Apr/22 ]

not sure how llite's umount can be blocked until all import's invalidations are done, so the patch above just checks s_root to ensure it's not NULL.

Comment by John Hammond [ 10/May/22 ]

Do you think this might be related to LU-15835?

I looked at several recent sanityn sessions this morning. It appears that when we hit the warning in test_102 then the client crashes in unmount in test_109. And conversely if there was no warning then there is no crash.

See
https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=518400&test_set_script_id=570ba67a-4a46-11e0-a7f6-52540025f9af&sub_test_script_id=99ec82f0-f44d-4789-952d-ce0ac311b80e&source=sub_tests#redirect

Comment by Andreas Dilger [ 10/May/22 ]

+2 timeouts on master in sanityn test_109 in the past few days:
https://testing.whamcloud.com/test_sets/c8c1e14a-226f-4b99-8b4a-0f20ab910488
https://testing.whamcloud.com/test_sets/34da2ea2-433f-441b-85a3-b337237c5bfe

Not sure if this is a regression because of some recent patch landing, or just coincidence.

I hadn't seen John's comment that there are also crashes in this test. That would be more of a smoking gun - most crashes are 2022-05-05 or later. There was patch https://review.whamcloud.com/46693 "LU-14826 tests: fix sanityn.sh test_102 version check" landed on that day, and part of that test was previously always skipped due to a syntax error.

Comment by John Hammond [ 11/May/22 ]

Seems to be reproduced for fstype=zfs, mdtcount=4 (or > 1?) by sanityn test_102 followed by umount. See https://review.whamcloud.com/#/c/47277/ and https://testing.whamcloud.com/test_sets/74867bfb-0bd6-4bec-80f4-16c5f67d8d7c

Comment by Gerrit Updater [ 12/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47311
Subject: LU-15757 test: disable sanityn test_102() for zfs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e280988367b8adf4391402a46a7c90333a7d3a80

Comment by John Hammond [ 12/May/22 ]

> Seems to be reproduced for fstype=zfs, mdtcount=4 (or > 1?) by sanityn test_102 followed by umount.

Correction. zfs is just more likely to produce the timing needed for this.

Comment by Gerrit Updater [ 12/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47086/
Subject: LU-15757 llite: check s_root ll_md_blocking_ast()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0095c0d05ca80a2494710e3b4afb1d1e4b5cdcfe

Comment by Andreas Dilger [ 12/May/22 ]

This was introduced by patch https://review.whamcloud.com/40293 "LU-6142 lustre: use is_root_inode()" landed as commit v2_14_50-100-gfca56be02b.

Comment by Gerrit Updater [ 18/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47311/
Subject: LU-15757 test: disable sanityn test_102() for zfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a8da1802236c4270dddb8fe88251de4b86d66084

Comment by Gerrit Updater [ 01/Sep/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48406
Subject: LU-15757 llite: check s_root ll_md_blocking_ast()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: bede6b390c455c40a858de0a48d544970892aeb1

Comment by Gerrit Updater [ 08/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48406/
Subject: LU-15757 llite: check s_root ll_md_blocking_ast()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: e6dd92e27af98b38ba5bd8e7e818efa82971a145

Generated at Sat Feb 10 03:21:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.