[LU-2324] lock collide during recovery Created: 14/Nov/12 Updated: 07/Jun/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Alex Zhuravlev |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5555 |
| Description |
|
After reboot of an OSS, we see the following on the console: 2012-11-12 16:25:54 Lustre: Lustre: Build Version: 2.3.54-4chaos-3surya1-3surya1--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64 2012-11-12 16:25:55 LustreError: 6447:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 2012-11-12 16:25:55 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 2012-11-12 16:25:55 Lustre: lstest-OST0193: Will be in recovery for at least 5:00, or until 275 clients reconnect. 2012-11-12 16:25:56 LustreError: 6528:0:(ldlm_lockd.c:824:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: filter-ffff8807fef4c000 lock: ffff880ff005dcc0/0xbdf5847332a7d090 lrc: 3/0,0 mode: PW/PW res: 10773567/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80010020 nid: 172.20.4.149@o2ib500 remote: 0x9c0890bf799c59d0 expref: 4 pid: 6531 timeout 0 2012-11-12 16:25:56 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:26:02 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:26:06 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:26:20 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:26:20 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 2012-11-12 16:26:31 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:26:45 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 2012-11-12 16:26:51 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:27:01 LustreError: 137-5: UUID 'lstest-OST0194_UUID' is not available for connect (no target) 2012-11-12 16:27:01 LustreError: Skipped 2 previous similar messages 2012-11-12 16:27:10 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 2012-11-12 16:27:35 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 2012-11-12 16:27:51 Lustre: lstest-OST0193: Client 7877635e-a0c6-353f-51e9-47e6f0ef5fb2 (at 172.20.17.2@o2ib500) reconnecting, waiting for 275 clients in recovery for 3:04 2012-11-12 16:27:51 Lustre: lstest-OST0193: Client 7877635e-a0c6-353f-51e9-47e6f0ef5fb2 (at 172.20.17.2@o2ib500) refused reconnection, still busy with 1 active RPCs 2012-11-12 16:27:54 Lustre: lstest-OST0193: Client df4b9103-f4bf-8082-3f31-a1512a4dda76 (at 172.20.17.7@o2ib500) reconnecting, waiting for 275 clients in recovery for 3:01 2012-11-12 16:27:54 Lustre: lstest-OST0193: Client df4b9103-f4bf-8082-3f31-a1512a4dda76 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 1 active RPCs 2012-11-12 16:27:57 Lustre: lstest-OST0193: Client 4028a636-dc0d-66a7-557b-f4d960ae30a7 (at 172.20.17.9@o2ib500) reconnecting, waiting for 275 clients in recovery for 2:58 2012-11-12 16:27:57 Lustre: lstest-OST0193: Client 4028a636-dc0d-66a7-557b-f4d960ae30a7 (at 172.20.17.9@o2ib500) refused reconnection, still busy with 1 active RPCs 2012-11-12 16:27:58 Lustre: lstest-OST0193: Client 6b5359f5-fbbd-23ea-2c3e-9f96a635e074 (at 172.20.17.12@o2ib500) reconnecting, waiting for 275 clients in recovery for 2:57 2012-11-12 16:27:58 Lustre: lstest-OST0193: Client 6b5359f5-fbbd-23ea-2c3e-9f96a635e074 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2012-11-12 16:28:00 LustreError: 11-0: lstest-MDT0000-osp-OST0193: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11 It actually goes on for a while, with recovery not going well at all. See attached oss_grove403_console.txt file with more console log output. |
| Comments |
| Comment by Peter Jones [ 15/Nov/12 ] |
|
Alex will triage this one |
| Comment by Andreas Dilger [ 17/Nov/12 ] |
|
This is something that should never happen. It means that two clients were granted conflicting locks for some reason, or at least that is what they believe. The error message itself could probably be more informative, if it printed the lock and NID info for both clients. Even with that it may be difficult to trace back when the clients were granted those locks. |
| Comment by Carlos Thomaz [ 04/Apr/13 ] |
|
Peter, Andreas. Just to let you guys know that we are seeing this in one of the customer running 2.1.3. The problem happens during the MDS failover. Snipped logs below. [root@mds1 ~]# grep BUG /var/log/kern Apparently we are also being hit by I appreciate if you guys could keep us updated on this. Thank you, |