[LU-5457] lvbo_init failed for resource Created: 06/Aug/14  Updated: 16/Oct/15  Resolved: 16/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Mahmoud Hanafi Assignee: Oleg Drokin
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

lustre2.4.3 server


Severity: 3
Rank (Obsolete): 15197

 Description   

OST goes into disconn state on MDS We see the following error on OSS

00010000:00020000:9.0:1407349941.597616:0:10879:0:(ldlm_resource.c:1165:ldlm_resource_get()) nbp8-OST001d: lvbo_init failed for resource 0x8b3e13:0x0: rc = -2

Will upload debug logs to ftp site



 Comments   
Comment by Mahmoud Hanafi [ 06/Aug/14 ]

We tried to unmount and remount the ost but it just hung. Took a crash dump. Rebooted and remounted. It was ok for a few minutes after recover but when into disconn state again.

uploaded to ftp.whamcloud.com
uploads/LU-5457/LU-5457.debug.beforereboot.gz
uploads/LU-5457/LU-5457.debug.afterrecovery.gz

Comment by Mahmoud Hanafi [ 06/Aug/14 ]

Recovery stuck

nbp8-oss4 /proc/fs/lustre/obdfilter/nbp8-OST001d # cat recovery_status 
status: RECOVERING
recovery_start: 1407348997
time_remaining: 0
connected_clients: 5792/6929
req_replay_clients: 0
lock_repay_clients: 17
completed_clients: 5775
evicted_clients: 1137
replayed_requests: 0
queued_requests: 0
next_transno: 30065297379
Comment by Oleg Drokin [ 06/Aug/14 ]

CAn you please also attach kernel logs from the ost and MDS?

I would suspect you have preallocated object state between OST and MDT out of alignment which would make MDT to reject this ost similar to some other tickets that I cannot remember now, but will update this when I find or based on your other logs.

the -2 lvbo_init error does not really have much to do with the disconnect problem you are having. It basically means that an object referenced from and MDT does not exist on OST, which might be a valid race or a result of aborted recovery, though it might also mea some slight damage to the filesystem if the OST dd not properly flush creates from the last time over.

Comment by Mahmoud Hanafi [ 06/Aug/14 ]

uploaded to ftp site.
oss.messages.gz
mds.messages.gz

Comment by Mahmoud Hanafi [ 06/Aug/14 ]

i unmounted and remounted using abort_recovery and that cleared up the issue for now. Will watch to see if it happens again.

Comment by Oleg Drokin [ 06/Aug/14 ]

In the oss log there appears to be a stuck thread that is trying to sync the disk.
That also prevents MDS from connecting. I wonder if it ever completed. The thread is 11603, the trace stamp is at "Aug 5 11:20:35": INFO: task tgt_recov:11603 blocked for more than 120 seconds.

Comment by John Fuchs-Chesney (Inactive) [ 06/Aug/14 ]

Thank you Oleg.

Comment by Mahmoud Hanafi [ 07/Aug/14 ]

priority on this can be lowered

Comment by Andreas Dilger [ 08/May/15 ]

Mahmoud, it looks like this issue could be closed at this point?

Comment by Mahmoud Hanafi [ 10/Jul/15 ]

We have hit this issue again running with 2.5.3. From the client we get
pfe21 /nobackupp8/tmrogers/nobackupp30/tmrogers/dougwaves/nv4-lk # ls -l ntrace.431035
ls: cannot access ntrace.431035: Cannot allocate memory

and on the server:
LustreError: 44987:0:(ldlm_resource.c:1188:ldlm_resource_get()) nbp8-OST008f: lvbo_init failed for resource 0xf1a94b:0x0: rc = -2

Comment by Mahmoud Hanafi [ 10/Jul/15 ]

looks like there was a network error between the MDS and OSS that may have triggered this. Why wouldn't the MDS resend its request?

Jul 10 11:50:00 nbp8-mds1 kernel: LNet: 5824:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.27.75@o2ib: error -110(sending)
Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: 6270:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1436554008/real 1436554200]  req@ffff8801de9f3c00 x1503598763749580/t0(0) o6->nbp8-OST00c3-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554238 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: nbp8-OST012b-osc-MDT0000: Connection to nbp8-OST012b (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 10 11:50:00 nbp8-mds1 kernel: Lustre: 6270:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Jul 10 11:50:30 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: haven't heard from client nbp8-MDT0000-lwp-OST000d_UUID (at 10.151.27.75@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881d30834400, cur 1436554230 expire 1436554080 last 1436554003
Jul 10 11:50:34 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client a32f8183-5ffd-56e1-9cce-c045e3ed6606 (at 10.153.12.159@o2ib233) reconnecting
Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: 6277:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554008/real 0]  req@ffff8820df0b3c00 x1503598763750488/t0(0) o13->nbp8-OST008f-osc-MDT0000@10.151.27.75@o2ib:7/4 lens 224/368 e 0 to 1 dl 1436554238 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: nbp8-OST0111-osc-MDT0000: Connection to nbp8-OST0111 (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: Skipped 6 previous similar messages
Jul 10 11:50:39 nbp8-mds1 kernel: Lustre: 6277:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Jul 10 11:50:40 nbp8-mds1 kernel: Lustre: 6295:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436554007/real 1436554007]  req@ffff880119fce400 x1503598763748812/t0(0) o13->nbp8-OST0041-osc-MDT0000@10.151.27.75@o2ib:7/4 lens 224/368 e 0 to 1 dl 1436554237 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:50:40 nbp8-mds1 kernel: Lustre: nbp8-OST00dd-osc-MDT0000: Connection to nbp8-OST00dd (at 10.151.27.75@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jul 10 11:50:41 nbp8-mds1 kernel: Lustre: Skipped 1 previous similar message
Jul 10 11:50:43 nbp8-mds1 kernel: Lustre: 6268:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554013/real 0]  req@ffff8820e9604c00 x1503598763751740/t0(0) o6->nbp8-OST00dd-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554243 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:50:43 nbp8-mds1 kernel: Lustre: 6268:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Jul 10 11:50:55 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client f3b17551-baab-0ace-17ef-34be763ba268 (at 10.153.10.161@o2ib233) reconnecting
Jul 10 11:50:55 nbp8-mds1 kernel: Lustre: Skipped 1 previous similar message
Jul 10 11:51:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554032/real 0]  req@ffff8820e1a31c00 x1503598763756756/t0(0) o6->nbp8-OST00dd-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554262 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:51:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jul 10 11:51:13 nbp8-mds1 kernel: Lustre: 6292:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554042/real 0]  req@ffff883244b63400 x1503598763759424/t0(0) o6->nbp8-OST000d-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554272 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:51:13 nbp8-mds1 kernel: Lustre: 6292:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Jul 10 11:51:30 nbp8-mds1 kernel: Lustre: 6272:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554058/real 0]  req@ffff8820e3c28c00 x1503598763763336/t0(0) o6->nbp8-OST00a9-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554288 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:51:30 nbp8-mds1 kernel: Lustre: 6272:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jul 10 11:51:35 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Client 6910b65a-5f75-621c-ce1c-d96b21b50225 (at 10.153.11.198@o2ib233) reconnecting
Jul 10 11:51:35 nbp8-mds1 kernel: Lustre: Skipped 5 previous similar messages
Jul 10 11:52:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1436554094/real 0]  req@ffff882392070c00 x1503598763774164/t0(0) o6->nbp8-OST00c3-osc-MDT0000@10.151.27.75@o2ib:28/4 lens 664/432 e 0 to 1 dl 1436554324 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jul 10 11:52:04 nbp8-mds1 kernel: Lustre: 6285:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 28 previous similar messages
Comment by Oleg Drokin [ 14/Jul/15 ]

MDS does resend requests, the error you see is just informing us that the RPC failed to send, the we retry.
In the log I see it's statfs and destroy RPCs affected.

None of them should result in missing objects on OSTs even if RPCs never succeeded in resending.
In fact not any single RPC failure to send from MDS would result in a problem like this either, at least I cannot think of any.

Now, if you have older pre-2.4 clients, those could initiate their own destroys that might lead to this. Or this could be fallout from earlier sync failures where OST announced it created some objects, failed to sync that to disk and then after dying and restarting the objects that were handed out by MDTs out of this pool are no longer there. - you probably can verify this with some sort of a creation date check on this file that's missing objects.

Comment by Mahmoud Hanafi [ 22/Jul/15 ]

Is there a way we can map this error message to the specific file?

Jul 22 15:27:38 nbp8-oss6 kernel: LustreError: 44478:0:(ldlm_resource.c:1188:ldlm_resource_get()) nbp8-OST0039: lvbo_init failed for resource 0x17d2529:0x0: rc = -2

Comment by Oleg Drokin [ 23/Jul/15 ]

Hm, that's a bit of a tough one.

If the object was really there, it's somewhat simple - you mount your ost as ldiskfs as:
mount /device /mnt/somewhere -t ldiskfs -o ro # (ok to do even when OST is up. if done on the same node)
then you find the object id in /mnt/somewhere/O directory (it's hashed out)
and then you do something like:

[root@centos6-9 tests]# ../utils/ll_decode_filter_fid  /mnt/nfs/O/0/d1/4897
/mnt/nfs/O/0/d1/4897: parent=[0x200002342:0xb1:0x0] stripe=0
[root@centos6-9 tests]# ../utils/lfs fid2path /mnt/lustre/ '[0x200002342:0xb1:0x0]'
/mnt/lustre/d102d/d102d.sanity/file2-1-2

So ll_decode_filter_fid gives you parent file fid and then you use that with lfs fid2path to get to the file name.

Now with no object you cannot get parent file name, but the object itself seems to be in old format (is this an old filesystem that was upgraded multiple times in the past?)
There's certainly a more labor intensive way of finding what file it belongs to:
Basically you print out striping for all files on the OST in question with:

for i in `../utils/lfs  find /mnt/lustre --ost lustre-OST0000_UUID ` ; do ../utils/lfs getstripe $i ; done

and then in this output you need to look for your object id (0x17d2529 = 24978729 -so this what you are looking for ) from

lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	           232	         0xe8	             0

Note that obdidx needs to match the OST you are looking for (for multi-striped files), so in your case 0x39 converted to dec is 57. So you are looking for '57\s+24978729' in the output, essentially.

This will only work if the file referencing this object still exists and visible in the namespace.
Typically you can spot them easir because when you do "ls" in a dir, you'd get a bunch of errors from ls about being unable to stat some files and then follows the listing of the ones that it was able to stat. (ls -l will output the problematic ones with a bunch of question marks in the permission bits space).

Comment by Mahmoud Hanafi [ 15/Oct/15 ]

This can be closed

Comment by Peter Jones [ 16/Oct/15 ]

ok Mahmoud

Generated at Sat Feb 10 01:51:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.