[LU-5457] lvbo_init failed for resource Created: 06/Aug/14 Updated: 16/Oct/15 Resolved: 16/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Mahmoud Hanafi | Assignee: | Oleg Drokin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre2.4.3 server |
||
| Severity: | 3 |
| Rank (Obsolete): | 15197 |
| Description |
|
OST goes into disconn state on MDS We see the following error on OSS 00010000:00020000:9.0:1407349941.597616:0:10879:0:(ldlm_resource.c:1165:ldlm_resource_get()) nbp8-OST001d: lvbo_init failed for resource 0x8b3e13:0x0: rc = -2
Will upload debug logs to ftp site |
| Comments |
| Comment by Mahmoud Hanafi [ 06/Aug/14 ] |
|
We tried to unmount and remount the ost but it just hung. Took a crash dump. Rebooted and remounted. It was ok for a few minutes after recover but when into disconn state again. uploaded to ftp.whamcloud.com |
| Comment by Mahmoud Hanafi [ 06/Aug/14 ] |
|
Recovery stuck nbp8-oss4 /proc/fs/lustre/obdfilter/nbp8-OST001d # cat recovery_status status: RECOVERING recovery_start: 1407348997 time_remaining: 0 connected_clients: 5792/6929 req_replay_clients: 0 lock_repay_clients: 17 completed_clients: 5775 evicted_clients: 1137 replayed_requests: 0 queued_requests: 0 next_transno: 30065297379 |
| Comment by Oleg Drokin [ 06/Aug/14 ] |
|
CAn you please also attach kernel logs from the ost and MDS? I would suspect you have preallocated object state between OST and MDT out of alignment which would make MDT to reject this ost similar to some other tickets that I cannot remember now, but will update this when I find or based on your other logs. the -2 lvbo_init error does not really have much to do with the disconnect problem you are having. It basically means that an object referenced from and MDT does not exist on OST, which might be a valid race or a result of aborted recovery, though it might also mea some slight damage to the filesystem if the OST dd not properly flush creates from the last time over. |
| Comment by Mahmoud Hanafi [ 06/Aug/14 ] |
|
uploaded to ftp site. |
| Comment by Mahmoud Hanafi [ 06/Aug/14 ] |
|
i unmounted and remounted using abort_recovery and that cleared up the issue for now. Will watch to see if it happens again. |
| Comment by Oleg Drokin [ 06/Aug/14 ] |
|
In the oss log there appears to be a stuck thread that is trying to sync the disk. |
| Comment by John Fuchs-Chesney (Inactive) [ 06/Aug/14 ] |
|
Thank you Oleg. |
| Comment by Mahmoud Hanafi [ 07/Aug/14 ] |
|
priority on this can be lowered |
| Comment by Andreas Dilger [ 08/May/15 ] |
|
Mahmoud, it looks like this issue could be closed at this point? |
| Comment by Mahmoud Hanafi [ 10/Jul/15 ] |
|
We have hit this issue again running with 2.5.3. From the client we get and on the server: |
| Comment by Mahmoud Hanafi [ 10/Jul/15 ] |
|
looks like there was a network error between the MDS and OSS that may have triggered this. Why wouldn't the MDS resend its request? Jul 10 11:50:00 nbp8-mds1 kernel: LNet: 5824:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.27.75@o2ib: error -110(sending) Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: 6270:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1436554008/real 1436554200] req@ffff8801de9f3c00 x1503598763749580/t0(0) o6->nbp8-OST00c3-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554238 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: nbp8-OST012b-osc-MDT0000: Connection to nbp8-OST012b (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: 6270:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Jul 10 11:50:30 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: haven't heard from client nbp8-MDT0000-lwp-OST000d_UUID (at 10.151.27.75@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881d30834400, cur 1436554230 expire 1436554080 last 1436554003 Jul 10 11:50:34 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client a32f8183-5ffd-56e1-9cce-c045e3ed6606 (at 10.153.12.159@o2ib233) reconnecting Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: 6277:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554008/real 0] req@ffff8820df0b3c00 x1503598763750488/t0(0) o13->nbp8-OST008f-osc-MDT0000@10.151.27.75@o2ib:7/4 lens 224/368 e 0 to 1 dl 1436554238 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: nbp8-OST0111-osc-MDT0000: Connection to nbp8-OST0111 (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: Skipped 6 previous similar messages Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: 6277:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Jul 10 11:50:40 nbp8-mds1 kernel: Lustre: 6295:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436554007/real 1436554007] req@ffff880119fce400 x1503598763748812/t0(0) o13->nbp8-OST0041-osc-MDT0000@10.151.27.75@o2ib:7/4 lens 224/368 e 0 to 1 dl 1436554237 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:50:40 nbp8-mds1 kernel: Lustre: nbp8-OST00dd-osc-MDT0000: Connection to nbp8-OST00dd (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jul 10 11:50:41 nbp8-mds1 kernel: Lustre: Skipped 1 previous similar message Jul 10 11:50:43 nbp8-mds1 kernel: Lustre: 6268:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554013/real 0] req@ffff8820e9604c00 x1503598763751740/t0(0) o6->nbp8-OST00dd-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554243 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:50:43 nbp8-mds1 kernel: Lustre: 6268:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Jul 10 11:50:55 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client f3b17551-baab-0ace-17ef-34be763ba268 (at 10.153.10.161@o2ib233) reconnecting Jul 10 11:50:55 nbp8-mds1 kernel: Lustre: Skipped 1 previous similar message Jul 10 11:51:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554032/real 0] req@ffff8820e1a31c00 x1503598763756756/t0(0) o6->nbp8-OST00dd-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554262 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:51:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Jul 10 11:51:13 nbp8-mds1 kernel: Lustre: 6292:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554042/real 0] req@ffff883244b63400 x1503598763759424/t0(0) o6->nbp8-OST000d-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554272 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:51:13 nbp8-mds1 kernel: Lustre: 6292:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Jul 10 11:51:30 nbp8-mds1 kernel: Lustre: 6272:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554058/real 0] req@ffff8820e3c28c00 x1503598763763336/t0(0) o6->nbp8-OST00a9-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554288 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:51:30 nbp8-mds1 kernel: Lustre: 6272:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Jul 10 11:51:35 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client 6910b65a-5f75-621c-ce1c-d96b21b50225 (at 10.153.11.198@o2ib233) reconnecting Jul 10 11:51:35 nbp8-mds1 kernel: Lustre: Skipped 5 previous similar messages Jul 10 11:52:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554094/real 0] req@ffff882392070c00 x1503598763774164/t0(0) o6->nbp8-OST00c3-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554324 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jul 10 11:52:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 28 previous similar messages |
| Comment by Oleg Drokin [ 14/Jul/15 ] |
|
MDS does resend requests, the error you see is just informing us that the RPC failed to send, the we retry. None of them should result in missing objects on OSTs even if RPCs never succeeded in resending. Now, if you have older pre-2.4 clients, those could initiate their own destroys that might lead to this. Or this could be fallout from earlier sync failures where OST announced it created some objects, failed to sync that to disk and then after dying and restarting the objects that were handed out by MDTs out of this pool are no longer there. - you probably can verify this with some sort of a creation date check on this file that's missing objects. |
| Comment by Mahmoud Hanafi [ 22/Jul/15 ] |
|
Is there a way we can map this error message to the specific file? Jul 22 15:27:38 nbp8-oss6 kernel: LustreError: 44478:0:(ldlm_resource.c:1188:ldlm_resource_get()) nbp8-OST0039: lvbo_init failed for resource 0x17d2529:0x0: rc = -2 |
| Comment by Oleg Drokin [ 23/Jul/15 ] |
|
Hm, that's a bit of a tough one. If the object was really there, it's somewhat simple - you mount your ost as ldiskfs as: [root@centos6-9 tests]# ../utils/ll_decode_filter_fid /mnt/nfs/O/0/d1/4897
/mnt/nfs/O/0/d1/4897: parent=[0x200002342:0xb1:0x0] stripe=0
[root@centos6-9 tests]# ../utils/lfs fid2path /mnt/lustre/ '[0x200002342:0xb1:0x0]'
/mnt/lustre/d102d/d102d.sanity/file2-1-2
So ll_decode_filter_fid gives you parent file fid and then you use that with lfs fid2path to get to the file name. Now with no object you cannot get parent file name, but the object itself seems to be in old format (is this an old filesystem that was upgraded multiple times in the past?) for i in `../utils/lfs find /mnt/lustre --ost lustre-OST0000_UUID ` ; do ../utils/lfs getstripe $i ; done and then in this output you need to look for your object id (0x17d2529 = 24978729 -so this what you are looking for ) from lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 232 0xe8 0 Note that obdidx needs to match the OST you are looking for (for multi-striped files), so in your case 0x39 converted to dec is 57. So you are looking for '57\s+24978729' in the output, essentially. This will only work if the file referencing this object still exists and visible in the namespace. |
| Comment by Mahmoud Hanafi [ 15/Oct/15 ] |
|
This can be closed |
| Comment by Peter Jones [ 16/Oct/15 ] |
|
ok Mahmoud |