[LU-13500] Client gets evicted - nfsd non-standard errorno -108 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Client is getting evicted by MDT as soon as nfsd service started on the client.

client (golf1) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64
client (golf1) lustre version : lustre-2.12.3-1.el7.x86_64

mds (gmds1) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64
mds (gmds1) lustre version : lustre-2.12.3-1.el7.x86_64

oss (goss1-goss6) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64
oss (goss1-goss6) lustre version : lustre-2.12.3-1.el7.x86_64

/etc/exports on golf1 :

/user_data       10.25.0.0/16(fsid=123456789,rw,anonuid=0,insecure,no_subtree_check,insecure_locks,async)

Apr 30 14:03:07 golf1 kernel: LustreError: 11-0: golf-MDT0000-mdc-ffff973bf6409800: operation ldlm_enqueue to node 10.25.22.90@tcp failed: rc = -107
Apr 30 14:03:07 golf1 kernel: Lustre: golf-MDT0000-mdc-ffff973bf6409800: Connection to golf-MDT0000 (at 10.25.22.90@tcp) was lost; in progress operations using this service will wait for recovery to complete
Apr 30 14:03:07 golf1 kernel: LustreError: Skipped 8 previous similar messages
Apr 30 14:03:07 golf1 kernel: LustreError: 167-0: golf-MDT0000-mdc-ffff973bf6409800: This client was evicted by golf-MDT0000; in progress operations using this service will fail.
Apr 30 14:03:07 golf1 kernel: LustreError: 25491:0:(file.c:4339:ll_inode_revalidate_fini()) golf: revalidate FID [0x20004884e:0x16:0x0] error: rc = -5
Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------
Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 26 PID: 25600 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd]
Apr 30 14:03:07 golf1 kernel: LustreError: 25579:0:(file.c:216:ll_close_inode_openhandle()) golf-clilmv-ffff973bf6409800: inode [0x20004884e:0x15:0x0] mdc close failed: rc = -108
Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------
Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------
Apr 30 14:03:07 golf1 kernel: nfsd: non-standard errno: -108
Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------
Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 54 PID: 25602 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd]
Apr 30 14:03:07 golf1 kernel: LustreError: 25579:0:(file.c:216:ll_close_inode_openhandle()) Skipped 2 previous similar messages
Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 24 PID: 25601 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd]
Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 9 PID: 25505 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

hmds1-log-write.png
7 kB
07/Sep/20 3:50 PM
hmds1-timezone.png
22 kB
07/Sep/20 3:49 PM
hotel1-logs-write.png
9 kB
07/Sep/20 3:50 PM
hotel1-timezone.png
22 kB
07/Sep/20 3:50 PM
vmcore-dmesg-2021-05-11.txt
156 kB
25/Jun/21 10:29 AM
vmcore-dmesg-2021-06-17.txt
150 kB
25/Jun/21 10:29 AM

Issue Links

is related to

LU-13692 MDS slow/hung threads at mdt_object_local_lock

Resolved

LU-13054 MDS kernel BUG at ldiskfs/htree_lock.c:429!

Resolved

Activity

[LU-13500] Client gets evicted - nfsd non-standard errorno -108

Oleg Drokin added a comment - 24/Nov/20 8:40 AM

So I wonder if reducing amount of mdt locks would at least help you temporarily since the problem is clearly related to parallel cancels.

lctl set_param ldlm.namespaces.*MDT*mdc*.lru_size=100

can you please run this on your nfs export node(s)? The setting is not permanent and would reset on node reboot/lustre client remount.

Oleg Drokin added a comment - 24/Nov/20 8:40 AM So I wonder if reducing amount of mdt locks would at least help you temporarily since the problem is clearly related to parallel cancels. lctl set_param ldlm.namespaces.*MDT*mdc*.lru_size=100 can you please run this on your nfs export node(s)? The setting is not permanent and would reset on node reboot/lustre client remount.

Oleg Drokin added a comment - 24/Nov/20 2:37 AM

Sorry that I have no immediate answers to this, I am still thinking about the whole thing.

Oleg Drokin added a comment - 24/Nov/20 2:37 AM Sorry that I have no immediate answers to this, I am still thinking about the whole thing.

Campbell Mcleay (Inactive) added a comment - 23/Nov/20 3:11 PM

Hello Team,

It would be really good if you can provide a solution for the same at the earliest as this is really affecting the production backup which is really putting us in a panic situations. Requesting you to check this on priority please.

Campbell Mcleay (Inactive) added a comment - 23/Nov/20 3:11 PM Hello Team, It would be really good if you can provide a solution for the same at the earliest as this is really affecting the production backup which is really putting us in a panic situations. Requesting you to check this on priority please.

Campbell Mcleay (Inactive) added a comment - 18/Nov/20 4:37 PM

Any findings here please ?

Campbell Mcleay (Inactive) added a comment - 18/Nov/20 4:37 PM Any findings here please ?

Campbell Mcleay (Inactive) added a comment - 16/Nov/20 10:47 AM - edited

After downgrading, back to 2.12.4, the lustre filesystem was hanging on the client (the other client we have was fine). It would work intermittently but mostly not (the other client was fine during this time). Is this due to lock recovery? I tried mounting with 'abort_recov', but same issue. It eventually resolved itself however.

There's still a large number of stack traces in the MDS log, e.g.:

Nov 16 02:58:40 hmds1 kernel: LNet: Service thread pid 16267 was inactive for 278.13s. The thread might be hung, or it might only be slow and will res
ume later. Dumping the stack trace for debugging purposes:
Nov 16 02:58:40 hmds1 kernel: LNet: Skipped 4 previous similar messages
Nov 16 02:58:40 hmds1 kernel: Pid: 16267, comm: mdt01_051 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
Nov 16 02:58:40 hmds1 kernel: Call Trace:
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12b8070>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ba0a1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15a817b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15aa350>] mdt_reint_rename+0x2c0/0x2900 [mdt]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15b31b3>] mdt_reint_rec+0x83/0x210 [mdt]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc158f383>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc159b0f7>] mdt_reint+0x67/0x140 [mdt]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc1356e8a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12fb83b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ff1a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Nov 16 02:58:40 hmds1 kernel: [<ffffffff8dac6321>] kthread+0xd1/0xe0
Nov 16 02:58:40 hmds1 kernel: [<ffffffff8e18ed37>] ret_from_fork_nospec_end+0x0/0x39
Nov 16 02:58:40 hmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Nov 16 02:58:40 hmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1605475720.16267

We've uploaded a log (5.1GB) from the MDS

Campbell Mcleay (Inactive) added a comment - 16/Nov/20 10:47 AM - edited After downgrading, back to 2.12.4, the lustre filesystem was hanging on the client (the other client we have was fine). It would work intermittently but mostly not (the other client was fine during this time). Is this due to lock recovery? I tried mounting with 'abort_recov', but same issue. It eventually resolved itself however. There's still a large number of stack traces in the MDS log, e.g.: Nov 16 02:58:40 hmds1 kernel: LNet: Service thread pid 16267 was inactive for 278.13s. The thread might be hung, or it might only be slow and will res ume later. Dumping the stack trace for debugging purposes: Nov 16 02:58:40 hmds1 kernel: LNet: Skipped 4 previous similar messages Nov 16 02:58:40 hmds1 kernel: Pid: 16267, comm: mdt01_051 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020 Nov 16 02:58:40 hmds1 kernel: Call Trace: Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12b8070>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ba0a1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15a817b>] mdt_rename_lock+0x24b/0x4b0 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15aa350>] mdt_reint_rename+0x2c0/0x2900 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15b31b3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc158f383>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc159b0f7>] mdt_reint+0x67/0x140 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc1356e8a>] tgt_request_handle+0xada/0x1570 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12fb83b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ff1a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffff8dac6321>] kthread+0xd1/0xe0 Nov 16 02:58:40 hmds1 kernel: [<ffffffff8e18ed37>] ret_from_fork_nospec_end+0x0/0x39 Nov 16 02:58:40 hmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Nov 16 02:58:40 hmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1605475720.16267 We've uploaded a log (5.1GB) from the MDS

Campbell Mcleay (Inactive) added a comment - 14/Nov/20 9:24 PM - edited

nfs exports stopped working. Couldn't see anything in the client log.
A restart of nfs did not fix the issue. So we had to downgrade it back to 2.12.4.

Campbell Mcleay (Inactive) added a comment - 14/Nov/20 9:24 PM - edited nfs exports stopped working. Couldn't see anything in the client log. A restart of nfs did not fix the issue. So we had to downgrade it back to 2.12.4.

Campbell Mcleay (Inactive) added a comment - 13/Nov/20 11:04 AM

Oleg,

We have applied the provided patch but unfortunately that seems to be not helping much to reduce the evictions . Your thoughts on this please.

Campbell Mcleay (Inactive) added a comment - 13/Nov/20 11:04 AM Oleg, We have applied the provided patch but unfortunately that seems to be not helping much to reduce the evictions . Your thoughts on this please.

Oleg Drokin added a comment - 11/Nov/20 9:22 PM

I think it makes sense to try the patch now - if it helps - great.

Whenever any additional patches would be needed is not yet clear and if yes and the condition would necessitate reversal of a previous patch to reproduce - that could be decided on later.

Oleg Drokin added a comment - 11/Nov/20 9:22 PM I think it makes sense to try the patch now - if it helps - great. Whenever any additional patches would be needed is not yet clear and if yes and the condition would necessitate reversal of a previous patch to reproduce - that could be decided on later.

Campbell Mcleay (Inactive) added a comment - 11/Nov/20 12:44 PM

Looks like we are seeing some more stuffs here to rectify. Should we try this patch now or we will wait to dig more on the undergoing case ?

Campbell Mcleay (Inactive) added a comment - 11/Nov/20 12:44 PM Looks like we are seeing some more stuffs here to rectify. Should we try this patch now or we will wait to dig more on the undergoing case ?

Oleg Drokin added a comment - 11/Nov/20 6:34 AM

Thank you. this is a good log but it's very big to sift through all of it fast.

Meanwhile some suspicions I had from the previous logs seem to be confirming - the "proactive lru lock cancelling" we do when preparing to send a cancel anyway seems to be a bit overezalous and does some pretty heavy processing.

I am going to post a patch that should help here while I am digging out some other suspicious stuff I also see. https://review.whamcloud.com/#/c/40603/
This patch you need to apply on the clients (nfs exporting ones at least) (or just use out build from the builders: https://build.whamcloud.com/job/lustre-reviews/77636/ - select suitable arch and distro type in the long list)

Oleg Drokin added a comment - 11/Nov/20 6:34 AM Thank you. this is a good log but it's very big to sift through all of it fast. Meanwhile some suspicions I had from the previous logs seem to be confirming - the "proactive lru lock cancelling" we do when preparing to send a cancel anyway seems to be a bit overezalous and does some pretty heavy processing. I am going to post a patch that should help here while I am digging out some other suspicious stuff I also see. https://review.whamcloud.com/#/c/40603/ This patch you need to apply on the clients (nfs exporting ones at least) (or just use out build from the builders: https://build.whamcloud.com/job/lustre-reviews/77636/ - select suitable arch and distro type in the long list)

Gerrit Updater added a comment - 11/Nov/20 5:41 AM

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40603
Subject: LU-13500 ldlm: Do not LRU-cancel "expensive" locks for in bl-ast
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3b5eb8eb50885dd690307d4ba4178ee55a5bf970

Gerrit Updater added a comment - 11/Nov/20 5:41 AM Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40603 Subject: LU-13500 ldlm: Do not LRU-cancel "expensive" locks for in bl-ast Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 3b5eb8eb50885dd690307d4ba4178ee55a5bf970

People

Assignee:: Oleg Drokin

Reporter:: Campbell Mcleay (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/May/20 1:19 PM

Updated:: 03/Nov/21 5:43 PM