[LU-10851] parallel-scale-nfsv4 hangs on unmount Created: 26/Mar/18 Updated: 03/Oct/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
parallel-scale-nfsv4 hangs on unmount after all tests have run. In the suite_log, the last thing we see is == parallel-scale-nfsv4 test complete, duration 2088 sec ============================================= 22:07:24 (1521868044)
Unmounting NFS clients...
CMD: trevis-8vm1,trevis-8vm2 umount -f /mnt/lustre
Unexporting Lustre filesystem...
CMD: trevis-8vm1,trevis-8vm2 chkconfig --list rpcidmapd 2>/dev/null |
grep -q rpcidmapd && service rpcidmapd stop ||
true
CMD: trevis-8vm4 { [[ -e /etc/SuSE-release ]] &&
service nfsserver stop; } ||
service nfs stop
CMD: trevis-8vm4 sed -i '/^lustre/d' /etc/exports
CMD: trevis-8vm4 exportfs -v
CMD: trevis-8vm4 grep -c /mnt/lustre' ' /proc/mounts
Stopping client trevis-8vm4 /mnt/lustre (opts:-f)
CMD: trevis-8vm4 lsof -t /mnt/lustre
CMD: trevis-8vm4 umount -f /mnt/lustre 2>&1
Looking at the console logs for vm4, MDS1 and 3, we see [ 2216.385890] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2088 sec ============================================= 22:07:24 (1521868044)
[ 2216.698201] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &&
[ 2216.698201] service nfsserver stop; } ||
[ 2216.698201] service nfs stop
[ 2216.805093] nfsd: last server has exited, flushing export cache
[ 2216.819487] Lustre: DEBUG MARKER: sed -i '/^lustre/d' /etc/exports
[ 2216.885266] Lustre: DEBUG MARKER: exportfs -v
[ 2216.945098] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
[ 2216.982526] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[ 2217.170422] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
[ 2217.192827] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[ 2217.193476] LustreError: 410:0:(file.c:205:ll_close_inode_openhandle()) lustre-clilmv-ffff880060b4e800: inode [0x200000406:0x3c1b:0x0] mdc close failed: rc = -108
[ 2217.218709] Lustre: 4066:0:(llite_lib.c:2676:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.84@tcp:/lustre/fid: [0x200000406:0x3e42:0x0]/ may get corrupted (rc -108)
[ 2217.218732] Lustre: 4066:0:(llite_lib.c:2676:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.84@tcp:/lustre/fid: [0x200000406:0x3e7b:0x0]/ may get corrupted (rc -108)
…
[ 5541.474664]
[ 5541.474667] umount D 0000000000000000 0 410 409 0x00000000
[ 5541.474669] ffff88004365fda8 ffff88004365fde0 ffff880048e5ce00 ffff880043660000
[ 5541.474670] ffff88004365fde0 000000010013feb9 ffff88007fc10840 0000000000000000
[ 5541.474671] ffff88004365fdc0 ffffffff81612a95 ffff88007fc10840 ffff88004365fe68
[ 5541.474672] Call Trace:
[ 5541.474674] [<ffffffff81612a95>] schedule+0x35/0x80
[ 5541.474677] [<ffffffff81615851>] schedule_timeout+0x161/0x2d0
[ 5541.474689] [<ffffffffa1457cc7>] ll_kill_super+0x77/0x150 [lustre]
[ 5541.474723] [<ffffffffa09f3a94>] lustre_kill_super+0x34/0x40 [obdclass]
[ 5541.474734] [<ffffffff8120cf5f>] deactivate_locked_super+0x3f/0x70
[ 5541.474742] [<ffffffff812283fb>] cleanup_mnt+0x3b/0x80
[ 5541.474745] [<ffffffff8109d198>] task_work_run+0x78/0x90
[ 5541.474748] [<ffffffff8107b5cf>] exit_to_usermode_loop+0x91/0xc2
[ 5541.474760] [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
[ 5541.474768] [<ffffffff81616ca7>] int_ret_from_sys_call+0x25/0x9f
[ 5541.476903] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f
[ 5541.476904]
We see this problem with unmount on the maser and b2_10 branches for SLES12 SP2 and SP3 testing only. Logs for test suites failres are at https://testing.whamcloud.com/test_sets/4bce5a66-2f2f-11e8-9e0e-52540065bddc https://testing.whamcloud.com/test_sets/103f280e-2fac-11e8-b3c6-52540065bddc https://testing.whamcloud.com/test_sets/044a75f0-2eba-11e8-b6a0-52540065bddc |
| Comments |
| Comment by Minh Diep [ 26/Mar/18 ] |
|
let's wait after LU-10566 fix. |
| Comment by Saurabh Tandan (Inactive) [ 08/May/18 ] |
|
+1 for 2.10.3_132 https://testing.whamcloud.com/test_sets/ea75ed40-5091-11e8-abc3-52540065bddc |
| Comment by James Nunez (Inactive) [ 15/Aug/18 ] |
|
We have a similar hang on umount at https://testing.whamcloud.com/test_sets/15040422-a0d1-11e8-87f3-52540065bddc for SLES12 SP3. |
| Comment by James Nunez (Inactive) [ 19/Jun/19 ] |
|
I'm see non-SLES parallel-scale-nfsv4 test suites hang on umount. On the MDS, I don't see mount in the 'D' state, but I do see the same error messages on the MDS. For example, https://testing.whamcloud.com/test_sets/9acd3074-9234-11e9-bebb-52540065bddc has the following in the MDS console [43887.752148] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2913 sec ============================================= 00:44:22 (1560905062)
[43898.780782] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &&
[43898.780782] service nfsserver stop; } ||
[43898.780782] service nfs stop
[43899.020702] nfsd: last server has exited, flushing export cache
[43899.183891] Lustre: DEBUG MARKER: sed -i '/lustre/d' /etc/exports
[43899.529625] Lustre: DEBUG MARKER: exportfs -v
[43899.871591] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
[43900.207184] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[43900.668082] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
[43900.832871] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[43900.833789] Lustre: Skipped 7 previous similar messages
[43900.835183] LustreError: 12449:0:(file.c:233:ll_close_inode_openhandle()) lustre-clilmv-ffff8ce81e82e800: inode [0x2000255c6:0x37fc:0x0] mdc close failed: rc = -108
[43900.836647] LustreError: 12449:0:(file.c:233:ll_close_inode_openhandle()) Skipped 38 previous similar messages
[43900.847426] Lustre: 7507:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x3730:0x0]/ may get corrupted (rc -108)
[43900.847442] Lustre: 7506:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x37ee:0x0]/ may get corrupted (rc -108)
[43900.851357] Lustre: 7506:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x3820:0x0]/ may get corrupted (rc -108)
[43900.853652] Lustre: 7507:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x3811:0x0]/ may get corrupted (rc -108)
[43900.853743] Lustre: 7506:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x378a:0x0]/ may get corrupted (rc -108)
[43900.858826] Lustre: 7507:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x3824:0x0]/ may get corrupted (rc -108)
[43900.858892] Lustre: 7506:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x37fa:0x0]/ may get corrupted (rc -108)
[43900.868467] Lustre: 7506:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x381e:0x0]/ may get corrupted (rc -108)
[43900.872369] Lustre: 7507:0:(llite_lib.c:2842:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.144@tcp:/lustre/fid: [0x2000255c6:0x381f:0x0]/ may get corrupted (rc -108)
[43910.921368] Lustre: Unmounted lustre-client
[43951.682868] Lustre: lustre-MDT0000: haven't heard from client e8ec627f-7451-4 (at 10.9.4.144@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8ce7d95f0800, cur 1560905126 expire 1560905096 last 1560905074
[47228.231671] SysRq : Changing Loglevel
[47228.232247] Loglevel set to 8
|
| Comment by Jian Yu [ 28/Aug/19 ] |
|
+1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/1e8a4714-c997-11e9-a25b-52540065bddc |
| Comment by Jian Yu [ 10/May/20 ] |
|
+1 on master branch: |
| Comment by James Nunez (Inactive) [ 30/Sep/20 ] |
|
We see this hang on parallel-scale-nfsv3 also; |
| Comment by Sarah Liu [ 22/Feb/22 ] |
|
similar error on master, it failed to stop nfs service after all tests finished. onyx-71vm13: Redirecting to /bin/systemctl stop nfs.service onyx-71vm13: Failed to stop nfs.service: Unit nfs.service not loaded. onyx-71vm13: Redirecting to /bin/systemctl stop nfs-server.service |
| Comment by Artem Blagodarenko (Inactive) [ 14/Mar/22 ] |
|
+1 on master |
| Comment by Alex Deiter [ 15/Dec/22 ] |
|
We hit the same issue on the master branch: https://testing.whamcloud.com/test_sessions/1442a73f-2793-4e28-9124-a81b4ae65262 all test sessions hang on: CMD: trevis-70vm4 umount -f /mnt/lustre 2>&1 [ 5154.892582] task:umount state:D stack: 0 pid:37125 ppid: 37124 flags:0x00004080 [ 5154.894509] Call Trace: [ 5154.895153] __schedule+0x2bd/0x760 [ 5154.896010] schedule+0x37/0xa0 [ 5154.896681] schedule_timeout+0x197/0x300 [ 5154.897504] ? __next_timer_interrupt+0xf0/0xf0 [ 5154.898422] ? __radix_tree_delete+0x92/0xa0 [ 5154.899286] ll_kill_super+0x63/0x130 [lustre] [ 5154.900268] lustre_kill_super+0x28/0x40 [lustre] [ 5154.901221] deactivate_locked_super+0x34/0x70 [ 5154.902133] cleanup_mnt+0x3b/0x70 [ 5154.902869] task_work_run+0x8a/0xb0 [ 5154.903640] exit_to_usermode_loop+0xeb/0xf0 [ 5154.904525] do_syscall_64+0x198/0x1a0 [ 5154.905288] entry_SYSCALL_64_after_hwframe+0x65/0xca dmesg: [ 1796.367975] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [ 1796.370695] LustreError: 37125:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff95a92ffe3800: inode [0x200000403:0x153d:0x0] mdc close failed: rc = -108 [ 1796.411489] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28cf:0x0]/ may get corrupted (rc -108) [ 1796.411662] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ec:0x0]/ may get corrupted (rc -108) [ 1796.414759] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x293d:0x0]/ may get corrupted (rc -108) [ 1796.421120] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a1b:0x0]/ may get corrupted (rc -108) [ 1796.436808] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ed:0x0]/ may get corrupted (rc -108) [ 1796.442761] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a05:0x0]/ may get corrupted (rc -108) [ 1796.443103] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f9:0x0]/ may get corrupted (rc -108) [ 1796.445919] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2940:0x0]/ may get corrupted (rc -108) [ 1796.449051] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f0:0x0]/ may get corrupted (rc -108) [ 1796.452119] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29b1:0x0]/ may get corrupted (rc -108) [ 1796.458583] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298d:0x0]/ may get corrupted (rc -108) [ 1796.463828] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29bc:0x0]/ may get corrupted (rc -108) [ 1796.466124] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2955:0x0]/ may get corrupted (rc -108) [ 1796.466961] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298e:0x0]/ may get corrupted (rc -108) [ 1796.472725] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a0f:0x0]/ may get corrupted (rc -108) [ 1796.473142] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x294a:0x0]/ may get corrupted (rc -108) [ 1796.480757] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29e1:0x0]/ may get corrupted (rc -108) [ 1796.481037] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f3:0x0]/ may get corrupted (rc -108) [ 1796.483887] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a10:0x0]/ may get corrupted (rc -108) [ 1796.494909] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2901:0x0]/ may get corrupted (rc -108) [ 1796.495139] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2945:0x0]/ may get corrupted (rc -108) [ 1796.498049] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28d7:0x0]/ may get corrupted (rc -108) [ 1796.501156] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a19:0x0]/ may get corrupted (rc -108) [ 1796.504452] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29d2:0x0]/ may get corrupted (rc -108) [ 1796.519318] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x296b:0x0]/ may get corrupted (rc -108) [ 1796.525781] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a02:0x0]/ may get corrupted (rc -108) [ 1826.914042] Lustre: lustre-MDT0000: haven't heard from client 309ad38c-8e48-430d-ae7a-a65d0196c74a (at 0@lo) in 31 seconds. I think it's dead, and I am evicting it. exp 00000000f489d0d3, cur 1671076587 expire 1671076557 last 1671076556 |
| Comment by Colin Faber [ 15/Dec/22 ] |
|
Is this dmesg from the MDS or a client? |
| Comment by Alex Deiter [ 15/Dec/22 ] |
|
Hello cfaber, Is this dmesg from the MDS or a client? For some reason Lustre tests suite uses MDS host as a Lustre client. Test flow is:
Note - 10.240.43.3 is a local address on the same host.
|
| Comment by Andreas Dilger [ 15/Dec/22 ] |
|
Colin, the llite* messages would definitely be from the client, regardless of where it is mounted. Alex, the reason for the NFS server to be mounted on the MDS is twofold:
I don't think that would be a contributor to the unmount problem being seen, but hard to say for sure. |
| Comment by Andreas Dilger [ 15/Dec/22 ] |
|
Looking at the messages here, it seems the client still has cached files at the time that it is being unmounted, but the connection to the MDS is stopped shortly thereafter and that gives the client problems. [ 2217.170422] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1 [ 2217.192827] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request The "-f" flag means "force unmount and don't wait gracefully" and should be able to clean up regardless of whether the MDS connection is available or not, but I wonder if it is part of the issue here? If this was just "umount /mnt/lustre" the client would flush the dirty cache (avoiding the "dirty page discard" errors) and probably work more reliably. That said, "umount -f" should also work reliably, but that likely needs some changes to the code on the client. According to the stack traces, the unmount is stuck on:
void ll_kill_super(struct super_block *sb)
{
/* wait running statahead threads to quit */
while (atomic_read(&sbi->ll_sa_running) > 0)
schedule_timeout_uninterruptible(
cfs_time_seconds(1) >> 3);
}
so something is preventing the statahead threads from exiting. Likely the statahead threads need to be woken up here during the unmount process so that they can exit. It isn't even clear if there is a list of statahead threads on the superblock that can be signaled to wake up/exit? It looks like only a count/limit of statahead threads. I recall some changes in Yingjin's recent statahead patches, that these threads will exit themselves after a short time, but I'm not sure if that is enough to fix this problem or not. Definitely having a direct notification and the threads checking that the filesystem is being unmounted is better than a timeout. |
| Comment by Andreas Dilger [ 15/Dec/22 ] |
|
Alex, can you please test if removing the "-f" allows this testing to pass more reliably? Yingjin, I think you are most familiar with statahead these days. I'm thinking that the patch to fix this may be quite different before/after your statahead patches. For master I would prefer that we prioritize landing the statahead patch series, so a patch as early in your series as possible that avoids other conflicts in later patches, if it isn't already fixed by your series. Then a separate patch for backporting to earlier branches to keep the statahead threads on a list/waitqueue |
| Comment by Alex Deiter [ 15/Dec/22 ] |
|
Hello adilger, Thank you very much for the detailed explanation! Thank you! |
| Comment by Alex Deiter [ 16/Dec/22 ] |
|
Hello adilger, Alex, can you please test if removing the "-f" allows this testing to pass more reliably? Done, but result is the same: https://testing.whamcloud.com/test_sessions/related?job=lustre-reviews&build=91221#redirect [ 1737.450139] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1480 sec ========================================================== 23:08:24 (1671145704) [ 1738.826052] Lustre: DEBUG MARKER: systemctl stop nfs-server [ 1739.068162] nfsd: last server has exited, flushing export cache [ 1739.311751] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports [ 1739.702372] Lustre: DEBUG MARKER: exportfs -v [ 1740.089608] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts [ 1740.448974] Lustre: DEBUG MARKER: lsof -t /mnt/lustre [ 1740.993566] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 [ 5093.425686] sysrq: SysRq : Changing Loglevel ... [ 5096.722623] task:umount state:D stack: 0 pid:56280 ppid: 56279 flags:0x00004080 [ 5096.724108] Call Trace: [ 5096.724610] __schedule+0x2bd/0x760 [ 5096.725284] schedule+0x37/0xa0 [ 5096.725903] schedule_timeout+0x197/0x300 [ 5096.726669] ? __next_timer_interrupt+0xf0/0xf0 [ 5096.727535] ? __radix_tree_delete+0x92/0xa0 [ 5096.728360] ll_kill_super+0x63/0x130 [lustre] [ 5096.729239] lustre_kill_super+0x28/0x40 [lustre] [ 5096.730133] deactivate_locked_super+0x34/0x70 [ 5096.730970] cleanup_mnt+0x3b/0x70 [ 5096.731645] task_work_run+0x8a/0xb0 [ 5096.732344] exit_to_usermode_loop+0xeb/0xf0 [ 5096.733163] do_syscall_64+0x198/0x1a0 [ 5096.733888] entry_SYSCALL_64_after_hwframe+0x65/0xca Details: umount hangs only for NVSv3 and may be caused by tests using locks: [ 1231.270056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 \(1671145127\) [ 1231.633442] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 (1671145127) [ 1232.072645] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -b -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1232.437529] Lustre: DEBUG MARKER: sh ./runtests -N 2 -b -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1234.227996] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -g -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1234.581110] Lustre: DEBUG MARKER: sh ./runtests -N 2 -g -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1241.533359] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -s -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1241.914629] Lustre: DEBUG MARKER: sh ./runtests -N 2 -s -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1250.386938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -l -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1250.784748] Lustre: DEBUG MARKER: sh ./runtests -N 2 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1285.778542] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1325.200970] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1355.406386] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1395.339987] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1434.762512] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1464.967844] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1473.104321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 \(1671145369\) [ 1473.474325] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 (1671145369) [ 1473.951047] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1 [ 1474.334915] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1 [ 1474.729606] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7//parallel-scale-nfsv3.test_iorssf.debug_log.$(hostname -s).1671145371.log; dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu [ 1476.359538] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 \(1671145372\) [ 1476.726679] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 (1671145372) [ 1477.201082] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1 [ 1477.578628] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1 [ 1477.984261] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7//parallel-scale-nfsv3.test_iorfpp.debug_log.$(hostname -s).1671145374.log; dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu [ 1479.340152] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 \(1671145375\) [ 1479.702251] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 (1671145375) [ 1787.824195] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 \(1671145684\) [ 1788.199820] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 (1671145684) [ 1789.526631] Lustre: DEBUG MARKER: systemctl stop nfs-server [ 1789.944393] nfsd: last server has exited, flushing export cache [ 1790.227067] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports [ 1790.872269] Lustre: DEBUG MARKER: exportfs -v [ 1791.514326] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts [ 1792.126821] Lustre: DEBUG MARKER: lsof -t /mnt/lustre [ 1792.941836] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 [ 5153.799694] sysrq: SysRq : Changing Loglevel We can disable NLM locks on the server side and repeat the same tests to confirm the guess . What do you think ?
lock / nolock Selects whether to use the NLM sideband protocol to lock files
on the server. If neither option is specified (or if lock is
specified), NLM locking is used for this mount point. When
using the nolock option, applications can lock files, but such
locks provide exclusion only against other applications running
on the same client. Remote applications are not affected by
these locks.
Thank you! |