[LU-14046] lov tgt 0 not cleaned! deathrow=0, lovrc=1 Created: 19/Oct/20  Updated: 23/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.5
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Stephane Thiell Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.12.5 servers + 2.13 clients, CentOS 7.6


Attachments: Text File sh02-ln03.client.dk.log    
Rank (Obsolete): 9223372036854775807

 Description   

Today we upgrade Oak servers from 2.10.8 to 2.12.5, and now we ~50 clients (2.13) out of ~1,500 that cannot mount Oak at all after reboot. Example with client 10.50.0.63@o2ib2:

Oct 19 13:31:26 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(lov_obd.c:828:lov_cleanup()) oak-clilov-ffffa0d562f8a800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
Oct 19 13:31:26 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(lov_obd.c:828:lov_cleanup()) Skipped 291 previous similar messages
Oct 19 13:31:27 sh02-ln03.stanford.edu kernel: Lustre: Unmounted oak-client
Oct 19 13:31:27 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(obd_mount.c:1669:lustre_fill_super()) Unable to mount  (-5) 

 

On the MGS side, I can see this:

/tmp/dk:00010000:02000400:7.0:1603137393.190601:0:7903:0:(ldlm_lib.c:1151:target_handle_connect()) MGS: Received new LWP connection from 10.50.0.63@o2ib2, removing former export from same NID
/tmp/dk:00010000:00080000:7.0:1603137393.190602:0:7903:0:(ldlm_lib.c:1227:target_handle_connect()) MGS: connection from f3832037-ce6f-4@10.50.0.63@o2ib2 t0 exp ffff88f2a4e59c00 cur 12765 last 1603137393 

2.10 servers with 2.13 clients worked fine. This is 2.12 servers with 2.13 clients.

Please advise. Is it the same as in LU-13719?

Thanks!

Stephane



 Comments   
Comment by Stephane Thiell [ 19/Oct/20 ]

Sorry didn't mean to open this as an improvement, it's a bug we would like to report with clients unable to mount the filesystem. Please let me know what you think. Thanks!

Comment by Stephane Thiell [ 19/Oct/20 ]

After reviewing the attached client logs (sh02-ln03.client.dk.log) again, it looks like this could be due to something else.

On the MGS/MDS, we can see endless disconnections:

Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603137677, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88f240925c40/0xe88ce0ce9dbc849f lrc: 4/1,0 mode: 
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:09:26 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2a8b2b400, cur 1603138166 expire 1603138016 last 1603137939
Oct 19 13:11:17 oak-md1-s1 kernel: LustreError: 29618:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88f1ef5e66c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:11:17 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:12:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to 0a2acb4f-5e35-e84f-7137-12310b3b17d8 (at 10.12.4.25@o2ib)
Oct 19 13:12:47 oak-md1-s1 kernel: Lustre: Skipped 3213 previous similar messages
Oct 19 13:14:26 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88e2baae9400, cur 1603138466 expire 1603138316 last 1603138239
Oct 19 13:15:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:15:22 oak-md1-s1 kernel: Lustre: Skipped 3237 previous similar messages
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603138277, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e04ef26e40/0xe88ce0ce9f2b45ae lrc: 4/1,0 mode: 
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:19:35 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2a7f22c00, cur 1603138775 expire 1603138625 last 1603138548
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 29746:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88e02528f680) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 29746:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Oct 19 13:22:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to bf3e962b-e521-22c4-b7d4-b2c82f971648 (at 10.12.4.86@o2ib)
Oct 19 13:22:47 oak-md1-s1 kernel: Lustre: Skipped 3211 previous similar messages
Oct 19 13:24:35 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2aaa09c00, cur 1603139075 expire 1603138925 last 1603138848
Oct 19 13:25:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:25:22 oak-md1-s1 kernel: Lustre: Skipped 3253 previous similar messages
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603138886, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e1ad4618c0/0xe88ce0cea0bd8bfc lrc: 4/1,0 mode: 
Oct 19 13:26:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:29:41 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2a6e94400, cur 1603139381 expire 1603139231 last 1603139154
Oct 19 13:30:21 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client c50f4c48-a8d0-2d5f-ff90-7efef0b098e9 (at 10.210.9.195@tcp1) reconnecting
Oct 19 13:30:21 oak-md1-s1 kernel: Lustre: Skipped 21 previous similar messages
Oct 19 13:31:27 oak-md1-s1 kernel: LustreError: 29879:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88dfc0b3fc80) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:31:27 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:32:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to e3ecfc0a-db4b-4 (at 10.50.10.3@o2ib2)
Oct 19 13:32:47 oak-md1-s1 kernel: Lustre: Skipped 3241 previous similar messages
Oct 19 13:34:41 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2aaadf400, cur 1603139681 expire 1603139531 last 1603139454
Oct 19 13:35:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:35:22 oak-md1-s1 kernel: Lustre: Skipped 3238 previous similar messages
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603139487, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e15230e540/0xe88ce0cea2ae56e9 lrc: 4/1,0 mode: 
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:39:46 oak-md1-s1 kernel: Lustre: MGS: haven't heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88f2a4c9b800, cur 1603139986 expire 1603139836 last 1603139759
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 29993:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88e130e895c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 29993:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Oct 19 13:42:48 oak-md1-s1 kernel: Lustre: MGS: Connection restored to 4151aadc-4857-e0bc-f1e5-8c97714919e5 (at 10.210.12.107@tcp1)
Oct 19 13:42:48 oak-md1-s1 kernel: Lustre: Skipped 3233 previous similar messages
Oct 19 13:45:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:45:22 oak-md1-s1 kernel: Lustre: Skipped 3203 previous similar messages 

Do you think this issue is related to LU-13667 for which a patch has landed in b2_12?

Comment by Peter Jones [ 20/Oct/20 ]

Hongchao

Does this seem to be a duplicate of LU-13367?

Peter

Comment by Stephane Thiell [ 20/Oct/20 ]

We have restarted the MGS that started to load quite a lot and it crashed when we tried to stop it. This used to already happen with 2.10, and the bug is still in 2.12.5. We have applied Hongchao's patch from LU-13667 "ptlrpc: fix endless loop issue" and restarted MGS/MDS. After that, our 2.13 clients could mount the filesystem again and we haven't seen lock timeout issues on MGS even after failing over more OSTs.

Generated at Sat Feb 10 03:06:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.