[LU-6994] MDT recovery timer goes negative, recovery never ends - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Attempting to mount a client, the recovery timer counts down, and then apparently rolls over to a negative value - recovery never ends

Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24
LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.116@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 13 previous similar messages
Lustre: Skipped 2 previous similar messages
LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 7:55
Lustre: Skipped 4 previous similar messages
LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439394552/real 1439394552]  req@ffff880815c0dcc0 x1509313907525360/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439394607 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
LustreError: Skipped 23 previous similar messages
Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:20
Lustre: Skipped 10 previous similar messages
LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439395077/real 1439395077]  req@ffff880812ded9c0 x1509313907526388/t0(0) o38->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439395088 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 183 previous similar messages
LustreError: Skipped 46 previous similar messages
LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
LustreError: Skipped 1 previous similar message
Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 21188499:54

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mount.fail.txt.gz
175 kB
12/Aug/15 4:16 PM

Issue Links

is related to

LU-7419 llog corruption after hitting ASSERTION( handle->lgh_hdr == ((void *)0) ) in llog_init_handle

Resolved

is related to

LU-7039 llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed:

Resolved

LU-7450 call dcb commit callback in osd_trans_stop()

Resolved

LU-8407 Recovery timer hangs at zero on DNE MDTs

Resolved

Activity

[LU-6994] MDT recovery timer goes negative, recovery never ends

Giuseppe Di Natale (Inactive) added a comment - 10/Dec/15 11:58 PM

Peter,

I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts?

Giuseppe

Giuseppe Di Natale (Inactive) added a comment - 10/Dec/15 11:58 PM Peter, I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts? Giuseppe

Peter Jones added a comment - 24/Nov/15 10:45 PM

The current belief is that this is a duplicate of ~~LU-7039~~ and/or ~~LU-7450~~. We can reopen if evidence comes to light that contradicts this.

Peter Jones added a comment - 24/Nov/15 10:45 PM The current belief is that this is a duplicate of LU-7039 and/or LU-7450 . We can reopen if evidence comes to light that contradicts this.

Di Wang (Inactive) added a comment - 19/Nov/15 5:44 AM - edited

Just found another problem which might contribute to this issue ~~LU-7450~~.

Di Wang (Inactive) added a comment - 19/Nov/15 5:44 AM - edited Just found another problem which might contribute to this issue LU-7450 .

Peter Jones added a comment - 13/Nov/15 1:42 PM

Giuseppe

Does the recommended fix solve the issue for your reproducer?

Peter

Peter Jones added a comment - 13/Nov/15 1:42 PM Giuseppe Does the recommended fix solve the issue for your reproducer? Peter

Mikhail Pershin added a comment - 13/Nov/15 1:39 PM

yes, that looks like ~~LU-7039~~

Mikhail Pershin added a comment - 13/Nov/15 1:39 PM yes, that looks like LU-7039

Di Wang (Inactive) added a comment - 12/Nov/15 6:50 PM

2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted

should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in ~~LU-7039~~

Di Wang (Inactive) added a comment - 12/Nov/15 6:50 PM 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in LU-7039

Giuseppe Di Natale (Inactive) added a comment - 11/Nov/15 11:56 PM

Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned.

https://jira.hpdd.intel.com/browse/LU-7419

Giuseppe Di Natale (Inactive) added a comment - 11/Nov/15 11:56 PM Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned. https://jira.hpdd.intel.com/browse/LU-7419

Christopher Morrone (Inactive) added a comment - 11/Nov/15 10:00 PM

Yes, that is good work!

One way or another we do probably need another ticket. I suspect that fixing the assertion and llog corruption will be the blocker that needs fixing for 2.8. At that point fixing the fact that the recovery timer can go negative will drop in priority. The recovery timer really needs to have more sane behavior in the face of errors, and we don't want to lose track of that issue.

Christopher Morrone (Inactive) added a comment - 11/Nov/15 10:00 PM Yes, that is good work! One way or another we do probably need another ticket. I suspect that fixing the assertion and llog corruption will be the blocker that needs fixing for 2.8. At that point fixing the fact that the recovery timer can go negative will drop in priority. The recovery timer really needs to have more sane behavior in the face of errors, and we don't want to lose track of that issue.

Peter Jones added a comment - 11/Nov/15 7:26 PM

Excellent work Giuseppe (or Joe if you prefer)!

Peter Jones added a comment - 11/Nov/15 7:26 PM Excellent work Giuseppe (or Joe if you prefer)!

Giuseppe Di Natale (Inactive) added a comment - 11/Nov/15 7:09 PM - edited

I managed to reproduce this issue. Using the setup in my previous comment, I determined that a kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes. Perhaps a negative timer is a symptom of an "unrecoverable" error? If so, is it reflected somewhere that recovery cannot complete?

The details:
To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below:

2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed:
2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005
2015-11-04 13:36:26 Nov  4 13:36:26
...
2015-11-04 13:36:26 Kernel panic - not syncing: LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1
2015-11-04 13:36:26 Call Trace:
2015-11-04 13:36:26  [<ffffffff8152d471>] ? panic+0xa7/0x16f
2015-11-04 13:36:26  [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2015-11-04 13:36:26  [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass]
2015-11-04 13:36:26  [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod]
2015-11-04 13:36:26  [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd]
2015-11-04 13:36:26  [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd]
2015-11-04 13:36:26  [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt]
2015-11-04 13:36:26  [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt]
2015-11-04 13:36:26  [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass]
2015-11-04 13:36:26  [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt]
2015-11-04 13:36:26  [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
2015-11-04 13:36:26  [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
2015-11-04 13:36:26  [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt]
2015-11-04 13:36:26  [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt]
2015-11-04 13:36:26  [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
2015-11-04 13:36:26  [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130
2015-11-04 13:36:26  [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0
2015-11-04 13:36:26  [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
2015-11-04 13:36:26  [<ffffffff8109fffe>] ? kthread+0x9e/0xc0
2015-11-04 13:36:27  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
2015-11-04 13:36:27  [<ffffffff8109ff60>] ? kthread+0x0/0xc0
2015-11-04 13:36:27  [<ffffffff8100c240>] ? child_rip+0x0/0x20

After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below:

2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted

Eventually, the recovery timer went negative and displayed messages similar to what is in the ticket. I manually fixed the llog file on MDS4 and recovery now completes. I think that covers it. If necessary, I can put all this information in another ticket and I should be able to provide the corrupted and fixed llog file for diagnosis as well.

Giuseppe Di Natale (Inactive) added a comment - 11/Nov/15 7:09 PM - edited I managed to reproduce this issue. Using the setup in my previous comment, I determined that a kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes. Perhaps a negative timer is a symptom of an "unrecoverable" error? If so, is it reflected somewhere that recovery cannot complete? The details: To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 2015-11-04 13:36:26 Nov 4 13:36:26 ... 2015-11-04 13:36:26 Kernel panic - not syncing: LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P --------------- 2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1 2015-11-04 13:36:26 Call Trace: 2015-11-04 13:36:26 [<ffffffff8152d471>] ? panic+0xa7/0x16f 2015-11-04 13:36:26 [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2015-11-04 13:36:26 [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod] 2015-11-04 13:36:26 [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd] 2015-11-04 13:36:26 [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd] 2015-11-04 13:36:26 [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass] 2015-11-04 13:36:26 [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130 2015-11-04 13:36:26 [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0 2015-11-04 13:36:26 [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8109fffe>] ? kthread+0x9e/0xc0 2015-11-04 13:36:27 [<ffffffff8100c24a>] ? child_rip+0xa/0x20 2015-11-04 13:36:27 [<ffffffff8109ff60>] ? kthread+0x0/0xc0 2015-11-04 13:36:27 [<ffffffff8100c240>] ? child_rip+0x0/0x20 After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below: 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted Eventually, the recovery timer went negative and displayed messages similar to what is in the ticket. I manually fixed the llog file on MDS4 and recovery now completes. I think that covers it. If necessary, I can put all this information in another ticket and I should be able to provide the corrupted and fixed llog file for diagnosis as well.

Giuseppe Di Natale (Inactive) added a comment - 03/Nov/15 10:50 PM - edited

Peter,

I am currently running a test cluster with DNE that has 4 MDSs and 2 OSTs. No failover is set up currently. I had an MDS crash while running a multi-node job which was writing to the lustre filesystem. I rebooted the MDS and it managed to re-establish a connection with the other MDSs. I believe I powered off the clients for some reason. The recovering MDS's timer went negative and it displayed similar messages to what is in the ticket. The clients were never evicted due to the hard timeout that was supposed to occur. Attempting to abort the recovery against the MDT on the recovering MDS would evict the clients, but the abort would eventually hang and never finish. Unfortunately, I had to reformat the filesystem and am currently trying to reproduce the above.

Giuseppe

Giuseppe Di Natale (Inactive) added a comment - 03/Nov/15 10:50 PM - edited Peter, I am currently running a test cluster with DNE that has 4 MDSs and 2 OSTs. No failover is set up currently. I had an MDS crash while running a multi-node job which was writing to the lustre filesystem. I rebooted the MDS and it managed to re-establish a connection with the other MDSs. I believe I powered off the clients for some reason. The recovering MDS's timer went negative and it displayed similar messages to what is in the ticket. The clients were never evicted due to the hard timeout that was supposed to occur. Attempting to abort the recovery against the MDT on the recovering MDS would evict the clients, but the abort would eventually hang and never finish. Unfortunately, I had to reformat the filesystem and am currently trying to reproduce the above. Giuseppe

People

Assignee:: Mikhail Pershin

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Aug/15 4:16 PM

Updated:: 05/Aug/16 11:17 PM

Resolved:: 05/Aug/16 11:17 PM