[LU-6994] MDT recovery timer goes negative, recovery never ends Created: 12/Aug/15 Updated: 05/Aug/16 Resolved: 05/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Attempting to mount a client, the recovery timer counts down, and then apparently rolls over to a negative value - recovery never ends Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24 LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.116@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 13 previous similar messages Lustre: Skipped 2 previous similar messages LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19 Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 7:55 Lustre: Skipped 4 previous similar messages LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439394552/real 1439394552] req@ffff880815c0dcc0 x1509313907525360/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439394607 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 109 previous similar messages LustreError: Skipped 23 previous similar messages Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:20 Lustre: Skipped 10 previous similar messages LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439395077/real 1439395077] req@ffff880812ded9c0 x1509313907526388/t0(0) o38->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439395088 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 183 previous similar messages LustreError: Skipped 46 previous similar messages LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19 LustreError: Skipped 1 previous similar message Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 21188499:54 |
| Comments |
| Comment by Cliff White (Inactive) [ 12/Aug/15 ] |
|
Lustre log from the failed mount attached |
| Comment by Cliff White (Inactive) [ 14/Aug/15 ] |
|
I checked some previous versions |
| Comment by Joseph Gmitter (Inactive) [ 29/Oct/15 ] |
|
Hi Mike, |
| Comment by Andreas Dilger [ 29/Oct/15 ] |
|
Chris, have you hit this problem in your 2.7.x testing? Cliff, have you hit this problem again in testing since 2.7.57, assuming you have re-run similar tests since then? |
| Comment by Giuseppe Di Natale (Inactive) [ 03/Nov/15 ] |
|
I am currently experiencing this issue on a test cluster running 2.7.62. Let me know if there is any info I can provide which will help. |
| Comment by Peter Jones [ 03/Nov/15 ] |
|
Giusseppe Do you have any idea what you were doing prior to getting into this state? Peter |
| Comment by Giuseppe Di Natale (Inactive) [ 03/Nov/15 ] |
|
Peter, I am currently running a test cluster with DNE that has 4 MDSs and 2 OSTs. No failover is set up currently. I had an MDS crash while running a multi-node job which was writing to the lustre filesystem. I rebooted the MDS and it managed to re-establish a connection with the other MDSs. I believe I powered off the clients for some reason. The recovering MDS's timer went negative and it displayed similar messages to what is in the ticket. The clients were never evicted due to the hard timeout that was supposed to occur. Attempting to abort the recovery against the MDT on the recovering MDS would evict the clients, but the abort would eventually hang and never finish. Unfortunately, I had to reformat the filesystem and am currently trying to reproduce the above. Giuseppe |
| Comment by Giuseppe Di Natale (Inactive) [ 11/Nov/15 ] |
|
I managed to reproduce this issue. Using the setup in my previous comment, I determined that a kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes. Perhaps a negative timer is a symptom of an "unrecoverable" error? If so, is it reflected somewhere that recovery cannot complete? The details: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 2015-11-04 13:36:26 Nov 4 13:36:26 ... 2015-11-04 13:36:26 Kernel panic - not syncing: LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P --------------- 2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1 2015-11-04 13:36:26 Call Trace: 2015-11-04 13:36:26 [<ffffffff8152d471>] ? panic+0xa7/0x16f 2015-11-04 13:36:26 [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2015-11-04 13:36:26 [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod] 2015-11-04 13:36:26 [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd] 2015-11-04 13:36:26 [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd] 2015-11-04 13:36:26 [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass] 2015-11-04 13:36:26 [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130 2015-11-04 13:36:26 [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0 2015-11-04 13:36:26 [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8109fffe>] ? kthread+0x9e/0xc0 2015-11-04 13:36:27 [<ffffffff8100c24a>] ? child_rip+0xa/0x20 2015-11-04 13:36:27 [<ffffffff8109ff60>] ? kthread+0x0/0xc0 2015-11-04 13:36:27 [<ffffffff8100c240>] ? child_rip+0x0/0x20 After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below: 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted Eventually, the recovery timer went negative and displayed messages similar to what is in the ticket. I manually fixed the llog file on MDS4 and recovery now completes. I think that covers it. If necessary, I can put all this information in another ticket and I should be able to provide the corrupted and fixed llog file for diagnosis as well. |
| Comment by Peter Jones [ 11/Nov/15 ] |
|
Excellent work Giuseppe (or Joe if you prefer)! |
| Comment by Christopher Morrone [ 11/Nov/15 ] |
|
Yes, that is good work! One way or another we do probably need another ticket. I suspect that fixing the assertion and llog corruption will be the blocker that needs fixing for 2.8. At that point fixing the fact that the recovery timer can go negative will drop in priority. The recovery timer really needs to have more sane behavior in the face of errors, and we don't want to lose track of that issue. |
| Comment by Giuseppe Di Natale (Inactive) [ 11/Nov/15 ] |
|
Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned. |
| Comment by Di Wang [ 12/Nov/15 ] |
2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in |
| Comment by Mikhail Pershin [ 13/Nov/15 ] |
|
yes, that looks like |
| Comment by Peter Jones [ 13/Nov/15 ] |
|
Giuseppe Does the recommended fix solve the issue for your reproducer? Peter |
| Comment by Di Wang [ 19/Nov/15 ] |
|
Just found another problem which might contribute to this issue |
| Comment by Peter Jones [ 24/Nov/15 ] |
|
The current belief is that this is a duplicate of |
| Comment by Giuseppe Di Natale (Inactive) [ 10/Dec/15 ] |
|
Peter, I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts? Giuseppe |
| Comment by Peter Jones [ 11/Dec/15 ] |
|
Ok Giuseppe I'll reopen the ticket and defer to Mike to comment. For now, I'll drop the priority and move this to 2.9 to reflect the reduced criticality of the issue. |
| Comment by Di Wang [ 05/Aug/16 ] |
|
This recovery status reporting issue will be resolved by |
| Comment by Peter Jones [ 05/Aug/16 ] |
|
Thanks for the tipoff Di |