[LU-15835] sanityn test_102: [20108.951589] WARNING: CPU: 1 PID: 722753 at fs/exportfs/expfs.c:96 reconnect_path+0x22b/0x2c0 Created: 10/May/22  Updated: 10/Nov/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: always_except

Issue Links:
Related
is related to LU-14826 enable striped directory as NFS export Resolved
is related to LU-15757 sanityn test_109: Oops in ll_md_block... Resolved
is related to LU-8585 All Lustre test suites should pass wi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for John Hammond <jhammond@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/edf5d3ea-27eb-49cb-b702-feabd080fbfa

test_102 failed with the following error:

========================================================== 04:24:30 \(1652156670\)
[20105.929303] Lustre: DEBUG MARKER: == sanityn test 102: Test open by handle of unlinked file ========================================================== 04:24:30 (1652156670)
[20108.951589] WARNING: CPU: 1 PID: 722753 at fs/exportfs/expfs.c:96 reconnect_path+0x22b/0x2c0
[20108.953218] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 joydev pcspkr ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net virtio_blk net_failover failover [last unloaded: libcfs]
[20108.963772] CPU: 1 PID: 722753 Comm: check_fhandle_s Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
[20108.965968] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[20108.967020] RIP: 0010:reconnect_path+0x22b/0x2c0
[20108.967879] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4d 85 f6 0f 85 91 00 00 00 48 89 df e8 93 06 f8 ff 48 85 ed 0f 85 43 ff ff ff e9 47 ff ff ff <0f> 0b e9 5f ff ff ff 48 89 df e8 76 06 f8 ff 48 3b 5b 18 74 47 4d
[20108.971094] RSP: 0018:ffffb7570144fd58 EFLAGS: 00010246
[20108.972054] RAX: ffff897bc8617240 RBX: ffff897bc8617058 RCX: 0000000300000000
[20108.973332] RDX: 0000000000000001 RSI: 0000000200000000 RDI: ffff897bc8617298
[20108.974610] RBP: ffff897bc8617240 R08: 9a28cbcbbe51ee3a R09: 0000000000000002
[20108.975917] R10: ffffb7570144fd38 R11: ffff897bbff28c4a R12: ffff897bc8617240
[20108.977188] R13: ffff897bdec693a0 R14: ffff897b971d8c00 R15: ffffb7570144fda0
[20108.978459] FS:  00007f3f59036740(0000) GS:ffff897bffd00000(0000) knlGS:0000000000000000
[20108.979893] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20108.980924] CR2: 0000000000602070 CR3: 000000005779c006 CR4: 00000000000606e0
[20108.982197] Call Trace:
[20108.982814]  ? drop_caches_sysctl_handler.cold.2+0x34/0x34
[20108.983817]  exportfs_decode_fh+0x111/0x28f
[20108.984646]  ? do_wp_page+0x215/0x310
[20108.985349]  ? __handle_mm_fault+0x79a/0x7c0
[20108.986203]  ? _cond_resched+0x15/0x30
[20108.986950]  ? __kmalloc+0x16f/0x210
[20108.987635]  do_handle_open+0x150/0x310
[20108.988450]  do_syscall_64+0x5b/0x1a0
[20108.989208]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[20108.990210] RIP: 0033:0x7f3f589534f8
[20108.990898] Code: ff eb b1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 65 50 2c 00 8b 00 85 c0 75 17 b8 30 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 50 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
[20108.994110] RSP: 002b:00007ffe5b418188 EFLAGS: 00000246 ORIG_RAX: 0000000000000130
[20108.995450] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f3f589534f8
[20108.996720] RDX: 0000000000010000 RSI: 00000000013472a0 RDI: 0000000000000004
[20108.998001] RBP: 00007ffe5b4182a0 R08: 0000000000000000 R09: 00007f3f589d3d40
[20108.999298] R10: 0000000000000005 R11: 0000000000000246 R12: 00000000013472a0
[20109.000563] R13: 00007ffe5b41a030 R14: 0000000000000020 R15: 00000000013472a0
[20109.001844] ---[ end trace 2f92765dad6425db ]---
static void clear_disconnected(struct dentry *dentry)
{
 	dget(dentry);
        while (dentry->d_flags & DCACHE_DISCONNECTED) {
                struct dentry *parent = dget_parent(dentry);

                WARN_ON_ONCE(IS_ROOT(dentry)); /* fs/exportfs/expfs.c:96 */

		spin_lock(&dentry->d_lock);
                dentry->d_flags &= ~DCACHE_DISCONNECTED;
                spin_unlock(&dentry->d_lock);

                dput(dentry);
                dentry = parent;
        }
        dput(dentry);
}

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_102 -



 Comments   
Comment by Gerrit Updater [ 10/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47277
Subject: LU-15835 test: unmount clients in sanityn test_102
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cda9f10b4d84f76b87b5ed4df0a24b528f8f7848

Comment by James A Simmons [ 10/May/22 ]

I suspect this is due to Lustre's fh_to_dentry() handler finding the alias in the .lustre/fid path and that special .lustre/fid dentry is being updated to an anonymous dentry value due to the unlink. Need to push a debug patch to confirm my theory.

Comment by Andreas Dilger [ 10/May/22 ]

The part of sanityn test_102 that is failing was added in patch https://review.whamcloud.com/44168 "LU-14826 mdt: getattr_name("..") under striped directory" but was only enabled on 2022-05-05 when patch https://review.whamcloud.com/46693 "LU-14826 tests: fix sanityn.sh test_102 version check" fixed a syntax error in the version check.

The LU-14826 change matches James' theory that this relates to open-by-handle on a directory.

Comment by Gerrit Updater [ 10/May/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/47286
Subject: LU-15835 test: export bug in sanityn test_102
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5414607fcfa926353894b21026a332dd96f19a1d

Comment by Lai Siyao [ 11/May/22 ]

How come this WARN_ON_ONCE(IS_ROOT(dentry)) is triggerred?

Comment by John Hammond [ 11/May/22 ]

This seems to be a zfs only issue.

Comment by Andreas Dilger [ 12/May/22 ]

Should be fixed by patch https://review.whamcloud.com/47086 "LU-15757 llite: check s_root ll_md_blocking_ast()".

Comment by John Hammond [ 12/May/22 ]

> Should be fixed by patch https://review.whamcloud.com/47086 "LU-15757 llite: check s_root ll_md_blocking_ast()".

How can that change fix the WARN_ON_ONCE(IS_ROOT(dentry)) ?

Comment by John Hammond [ 13/May/22 ]

The underlying issue here is really not fixed by the landing of https://review.whamcloud.com/47086 "LU-15757 llite: check s_root ll_md_blocking_ast()".

See https://testing.whamcloud.com/test_sets/801ce875-26f5-4d8f-9418-ceb6558d8564

test_102 produces the same WARN_ON_ONCE() then when umounting /mnt/lustre2 in test_109 we get
"VFS: Busy inodes after unmount of lustre. Self-destruct in 5 seconds. Have a nice day...".

[17438.002991] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 102: Test open by handle of unlinked file ========================================================== 10:08:11 \(1652436491\)
[17438.505282] Lustre: DEBUG MARKER: == sanityn test 102: Test open by handle of unlinked file ========================================================== 10:08:11 (1652436491)
[17439.364004] WARNING: CPU: 0 PID: 740619 at fs/exportfs/expfs.c:96 reconnect_path+0x22b/0x2c0
...
[17439.376227] CPU: 0 PID: 740619 Comm: check_fhandle_s Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
[17439.378445] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[17439.379507] RIP: 0010:reconnect_path+0x22b/0x2c0
[17439.380367] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4d 85 f6 0f 85 91 00 00 00 48 89 df e8 93 06 f8 ff 48 85 ed 0f 85 43 ff ff ff e9 47 ff ff ff <0f> 0b e9 5f ff ff ff 48 89 df e8 76 06 f8 ff 48 3b 5b 18 74 47 4d
...
[17439.394751] Call Trace:
[17439.395290]  ? drop_caches_sysctl_handler.cold.2+0x34/0x34
[17439.396306]  exportfs_decode_fh+0x111/0x28f
[17439.397105]  ? do_wp_page+0x215/0x310
[17439.397812]  ? __handle_mm_fault+0x79a/0x7c0
[17439.398638]  ? _cond_resched+0x15/0x30
[17439.399362]  ? __kmalloc+0x16f/0x210
[17439.400046]  do_handle_open+0x150/0x310
[17439.400801]  do_syscall_64+0x5b/0x1a0
[17439.401521]  entry_SYSCALL_64_after_hwframe+0x65/0xca
...

[17554.085330] Lustre: DEBUG MARKER: == sanityn test 109: Race with several mount instances on 1 node ========================================================== 10:10:06 (1652436606)
...
[17555.052891] Lustre: Unmounted lustre-client
...
[17555.259102] Lustre: DEBUG MARKER: umount /mnt/lustre2 2>&1
[17555.485451] LustreError: 747603:0:(file.c:243:ll_close_inode_openhandle()) lustre-clilmv-ffff9bb15ef19000: inode [0x280000404:0x2462:0x0] mdc close failed: rc = -108
[17555.488148] LustreError: 747603:0:(file.c:243:ll_close_inode_openhandle()) Skipped 11 previous similar messages
[17555.530183] VFS: Busy inodes after unmount of lustre. Self-destruct in 5 seconds.  Have a nice day...
Comment by Andreas Dilger [ 19/May/22 ]

simmonsja any chance you could look into this issue? It is one of the last 2.15 blockers without a patch.

Comment by James A Simmons [ 22/May/22 ]

I have been debugging this issue.  Its a race of the unlink() and an open(). The unlink() creates a bare inode with a dentry being cleaned up at a later time. When the open is tried  with the fhandle d_obtain_alias() attempts to reattach the found dentry into the directory tree. I  think that the dentry is not been properly cleaned up yet. At  the same time an dentry lookup is being done which is calling d_alloc() that in turns creates an ROOT dentry that is being found instead of the previous dentry that attempting to reconnect to the directory tree.  

Comment by James A Simmons [ 23/May/22 ]

So the source of the problem is due to the setting of the  MDS_OPEN_LOCK for fhandles. I wonder if we would have problems with opencache_threshold_count being exceeded as well? 

Comment by Peter Jones [ 25/May/22 ]

As this issue has been around since 2.14, let's include it in a future 2.15.x maintenance release rather than delay 2.15 further to wait for a suitable fix to be ready

Comment by Gerrit Updater [ 25/May/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/47450
Subject: LU-15835 llite: test opencache threshold set to 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 356ec7286f496cc9efd32e6b0cd4616603430d57

Comment by James A Simmons [ 25/May/22 ]

I do have a workaround but I need to see what the impact of the workaround is.

Comment by Gerrit Updater [ 26/May/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/47458
Subject: LU-15835 llite: remove lld_nfs_dentry flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9057c363b1f04ba3e99c4e332db3aea5eeea622a

Comment by Gerrit Updater [ 25/Aug/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/48345
Subject: LU-15835 tests: see if sanityn test_102
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4875993724cd93c99943547cb64d3c107c9cf0b0

Generated at Sat Feb 10 03:21:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.