[LU-2324] lock collide during recovery Created: 14/Nov/12  Updated: 07/Jun/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: llnl

Attachments: Text File oss_grove403_console.txt    
Severity: 3
Rank (Obsolete): 5555

 Description   

After reboot of an OSS, we see the following on the console:

2012-11-12 16:25:54 Lustre: Lustre: Build Version: 2.3.54-4chaos-3surya1-3surya1--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64
2012-11-12 16:25:55 LustreError: 6447:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
2012-11-12 16:25:55 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
2012-11-12 16:25:55 Lustre: lstest-OST0193: Will be in recovery for at least 5:00, or until 275 clients reconnect.
2012-11-12 16:25:56 LustreError: 6528:0:(ldlm_lockd.c:824:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: filter-ffff8807fef4c000 lock: ffff880ff005dcc0/0xbdf5847332a7d090 lrc: 3/0,0 mode: PW/PW res: 10773567/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80010020 nid: 172.20.4.149@o2ib500 remote: 0x9c0890bf799c59d0 expref: 4 pid: 6531 timeout 0
2012-11-12 16:25:56 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:26:02 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:26:06 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:26:20 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:26:20 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
2012-11-12 16:26:31 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:26:45 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
2012-11-12 16:26:51 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:27:01 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target)
2012-11-12 16:27:01 LustreError: Skipped 2 previous similar messages
2012-11-12 16:27:10 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
2012-11-12 16:27:35 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
2012-11-12 16:27:51 Lustre: lstest-OST0193: Client 7877635e-a0c6-353f-51e9-47e6f0ef5fb2 (at 172.20.17.2@o2ib500) reconnecting, waiting for 275 clients in recovery for 3:04
2012-11-12 16:27:51 Lustre: lstest-OST0193: Client 7877635e-a0c6-353f-51e9-47e6f0ef5fb2 (at 172.20.17.2@o2ib500) refused reconnection, still busy with 1 active RPCs
2012-11-12 16:27:54 Lustre: lstest-OST0193: Client df4b9103-f4bf-8082-3f31-a1512a4dda76 (at 172.20.17.7@o2ib500) reconnecting, waiting for 275 clients in recovery for 3:01
2012-11-12 16:27:54 Lustre: lstest-OST0193: Client df4b9103-f4bf-8082-3f31-a1512a4dda76 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 1 active RPCs
2012-11-12 16:27:57 Lustre: lstest-OST0193: Client 4028a636-dc0d-66a7-557b-f4d960ae30a7 (at 172.20.17.9@o2ib500) reconnecting, waiting for 275 clients in recovery for 2:58
2012-11-12 16:27:57 Lustre: lstest-OST0193: Client 4028a636-dc0d-66a7-557b-f4d960ae30a7 (at 172.20.17.9@o2ib500) refused reconnection, still busy with 1 active RPCs
2012-11-12 16:27:58 Lustre: lstest-OST0193: Client 6b5359f5-fbbd-23ea-2c3e-9f96a635e074 (at 172.20.17.12@o2ib500) reconnecting, waiting for 275 clients in recovery for 2:57
2012-11-12 16:27:58 Lustre: lstest-OST0193: Client 6b5359f5-fbbd-23ea-2c3e-9f96a635e074 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs
2012-11-12 16:28:00 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11

It actually goes on for a while, with recovery not going well at all. See attached oss_grove403_console.txt file with more console log output.



 Comments   
Comment by Peter Jones [ 15/Nov/12 ]

Alex will triage this one

Comment by Andreas Dilger [ 17/Nov/12 ]

This is something that should never happen. It means that two clients were granted conflicting locks for some reason, or at least that is what they believe.

The error message itself could probably be more informative, if it printed the lock and NID info for both clients. Even with that it may be difficult to trace back when the clients were granted those locks.

Comment by Carlos Thomaz [ 04/Apr/13 ]

Peter, Andreas.

Just to let you guys know that we are seeing this in one of the customer running 2.1.3.

The problem happens during the MDS failover. Snipped logs below.

[root@mds1 ~]# grep BUG /var/log/kern
Apr 2 14:43:43 mds1 kernel: : LustreError: 34818:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff880812ab2000/0xe7058f74dda5e62d lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 5 type: IBT flags: 0x20 remote: 0x30520774866dab1 expref: 15 pid: 34501 timeout: 0
Apr 2 14:43:45 mds1 kernel: : LustreError: 34819:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff8804091d1240/0xe7058f74dda63ed3 lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 22 type: IBT flags: 0x20 remote: 0x46b88df0e87b582 expref: 16 pid: 34871 timeout: 0
Apr 2 14:43:46 mds1 kernel: : LustreError: 34564:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff880810bc3b40/0xe7058f74dda65d3b lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 30 type: IBT flags: 0x20 remote: 0xbdf62edad09fbeec expref: 16 pid: 34868 timeout: 0
Apr 2 14:43:49 mds1 kernel: : LustreError: 34888:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff88042d826240/0xe7058f74dda66e60 lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 34 type: IBT flags: 0x20 remote: 0x18687bbe17443d36 expref: 16 pid: 34886 timeout: 0
Apr 2 14:43:55 mds1 kernel: : LustreError: 34881:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff8803dc2fe000/0xe7058f74dda7c942 lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 47 type: IBT flags: 0x20 remote: 0xc4dda54c93b299a0 expref: 16 pid: 34883 timeout: 0
Apr 2 14:43:59 mds1 kernel: : LustreError: 34875:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff8803d8614900/0xe7058f74dda8ec16 lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 53 type: IBT flags: 0x20 remote: 0x58b3a1b076ddc751 expref: 87 pid: 34887 timeout: 0
Apr 2 14:46:41 mds1 kernel: : LustreError: 34816:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88082ec06000 lock: ffff8803d1b65d80/0xe7058f74ddad2710 lrc: 3/0,0 mode: PR/PR res: 8589935616/242 bits 0x3 rrc: 18 type: IBT flags: 0x20 remote: 0x42794f9b33b3b48c expref: 16 pid: 34821 timeout: 0
Apr 3 13:54:43 mds1 kernel: : LustreError: 13214:0:(ldlm_lockd.c:764:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-ffff88081cbae000 lock: ffff880816574b40/0x841070928c4a7027 lrc: 3/0,0 mode: PR/PR res: 8589935766/95870 bits 0x3 rrc: 4 type: IBT flags: 0x20 remote: 0x7408621e86517011 expref: 16 pid: 13212 timeout: 0
[root@mds1 ~]# grep "not in open replay" /var/log/kern
Apr 2 14:43:38 mds1 kernel: : LustreError: 34500:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff880813681400 x1430864677861699/t0(4295286337) o101->ac0b2bee-e562-a70a-da92-9128aaf7132b@192.168.8.14@o2ib:0/0 lens 1480/3664 e 0 to 0 dl 1364935460 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:39 mds1 kernel: : LustreError: 34793:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff880812cd7000 x1430864784744294/t0(4297456459) o101->cb3d1005-d1e7-13ec-b7c1-aba83a0cc86f@192.168.9.6@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935461 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:40 mds1 kernel: : LustreError: 34500:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803fe3e2850 x1430864694892943/t0(4295286487) o101->b178f030-060a-2357-ef23-6e7eda2d0bb4@192.168.8.4@o2ib:0/0 lens 1480/3664 e 0 to 0 dl 1364935462 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:41 mds1 kernel: : LustreError: 34559:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803fdd0a850 x1430864694892947/t0(4295286509) o101->b178f030-060a-2357-ef23-6e7eda2d0bb4@192.168.8.4@o2ib:0/0 lens 1480/3664 e 0 to 0 dl 1364935463 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:43 mds1 kernel: : LustreError: 34500:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8808123b3c00 x1430864861868148/t0(4297227644) o101->8f207ecd-2ea3-ee51-eab1-8ab4e753a0c6@192.168.10.51@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935465 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:47 mds1 kernel: : LustreError: 34883:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff88081016a800 x1430864665003404/t0(4295010119) o101->619ceaf3-1675-49dd-68f9-3f5c6e59114c@192.168.7.64@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935469 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:43:55 mds1 kernel: : LustreError: 34883:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff88080e3b5000 x1430864673481703/t0(4295454205) o101->3ee45883-d858-17c9-1431-1c32021e9aea@192.168.7.43@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935477 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:44:11 mds1 kernel: : LustreError: 34869:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803fdcf6050 x1430864760313778/t0(4297459031) o101->6521c9e9-c1fb-275e-55af-a7ee85ba9740@192.168.9.10@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935493 ref 1 fl Interpret:/4/0 rc 0/0
Apr 2 14:46:40 mds1 kernel: : LustreError: 34816:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff88081181f000 x1430815080814568/t0(4294990598) o101->13550c65-a643-fba4-b82b-67221e0ef79e@192.168.1.37@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1364935642 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:25 mds1 kernel: : LustreError: 12912:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803dc465c00 x1430864834718601/t0(4297227760) o101->54d5562b-c5a1-cb65-4ab2-db7a1112be3b@192.168.10.7@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018907 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:25 mds1 kernel: : LustreError: 12995:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803e298b850 x1430864842616380/t0(4297227808) o101->8df03b02-8d6e-147d-d1c4-cbd8a466226d@192.168.10.8@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018907 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:26 mds1 kernel: : LustreError: 12994:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803e2a86850 x1430864843963546/t0(4297227840) o101->e0902939-c60b-b840-7e58-5b8b4a0b3ebb@192.168.10.25@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018908 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:27 mds1 kernel: : LustreError: 12995:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803dc12b400 x1430864861868148/t0(4297227644) o101->8f207ecd-2ea3-ee51-eab1-8ab4e753a0c6@192.168.10.51@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018909 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:29 mds1 kernel: : LustreError: 12995:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803dc22c000 x1430864838895860/t0(4297227615) o101->2d5af97f-d22f-03ba-c14e-c323eb1dd21b@192.168.10.34@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018911 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:35 mds1 kernel: : LustreError: 12995:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803dc3bc400 x1430864890918796/t0(4297227765) o101->16f1c50d-daae-9554-2ee0-3e90e9aab9d9@192.168.10.36@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018917 ref 1 fl Interpret:/4/0 rc 0/0
Apr 3 13:54:43 mds1 kernel: : LustreError: 13214:0:(mdt_open.c:1258:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff8803dbfb8400 x1430864821697515/t0(4297227679) o101->ce738d29-2974-7eb2-c2e9-fef9d5bf5076@192.168.10.19@o2ib:0/0 lens 1024/3664 e 0 to 0 dl 1365018925 ref 1 fl Interpret:/4/0 rc 0/0

Apparently we are also being hit by LU-1353.

I appreciate if you guys could keep us updated on this.

Thank you,
Carlos

Generated at Sat Feb 10 01:24:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.