[LU-11574] LustreError: 157-3: Trying to start OBD nbp13-OST000b_UUID using the wrong disk Created: 26/Oct/18  Updated: 04/Sep/20  Resolved: 30/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Mahmoud Hanafi Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

Filesystem is down.

We had raid backend issues and oss crashed. We ran run fsck after crash. when trying to remount ost we get this error. I have tried to run fsck several times.

 


 [ 4993.782431] LustreError: 157-3: Trying to start OBD nbp13-OST000b_UUID using the wrong disk . Were the /dev/ assignments rearranged?
[ 4993.825963] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2f00[0x0, 1, [0x1:0x0:0x0] hash exist]{
[ 4993.825963] 
[ 4993.864146] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2f50
[ 4993.864146] 
[ 4993.896057] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882df011ba00osd-ldiskfs-object@ffff882df011ba00(i:ffff882b7af88958:78/2138703796)[plain]
[ 4993.896057] 
[ 4993.947314] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2f00
[ 4993.947314] 
[ 4993.978176] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a3140[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
[ 4993.978176] 
[ 4994.018450] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a3190
[ 4994.018450] 
[ 4994.050361] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882eee0e8d00osd-ldiskfs-object@ffff882eee0e8d00(i:ffff882b7af6f2d0:77/2138703762)[plain]
[ 4994.050361] 
[ 4994.101617] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a3140
[ 4994.101617] 
[ 4994.137076] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2c00[0x0, 1, [0xa:0x0:0x0] hash exist]{
[ 4994.137076] 
[ 4994.175261] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2c50
[ 4994.175261] 
[ 4994.207171] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882eee0e9100osd-ldiskfs-object@ffff882eee0e9100(i:ffff882b7af90d90:79/2138703830)[plain]
[ 4994.207171] 
[ 4994.258427] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2c00
[ 4994.258427] 
[ 4994.299979] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2e40[0x0, 1, [0x200000001:0x1017:0x0] hash exist]{
[ 4994.299979] 
[ 4994.341038] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2e90
[ 4994.341038] 
[ 4994.372949] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882df011a600osd-ldiskfs-object@ffff882df011a600(i:ffff882c098426e0:1090561/1004891530)[plain]
[ 4994.372949] 
[ 4994.425511] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2e40
[ 4994.425511] 
[ 4994.464017] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882bfaae7380[0x0, 1, [0xa:0x18:0x0] hash exist]{
[ 4994.464017] 
[ 4994.502461] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882bfaae73d0
[ 4994.502461] 
[ 4994.534372] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882edbe28200osd-ldiskfs-object@ffff882edbe28200(i:ffff882b7af91600:80/2939569139)[plain]
[ 4994.534372] 
[ 4994.571667] Lustre: nbp13-OST000b: Not available for connect from 10.151.25.231@o2ib (not set up)


 Comments   
Comment by Joseph Gmitter (Inactive) [ 26/Oct/18 ]

We have engineers looking into the issue and will provide an update ASAP.

Comment by Alex Zhuravlev [ 26/Oct/18 ]

last_rcvd seem to be corrupted. please wait few minutes, I'll try to reproduce locally and figure out a solution.

Comment by Zhenyu Xu [ 26/Oct/18 ]

Looks like the last_rcvd file of the OST target got corrupted, can you backup the target as a fail-safe, and mount the target as ldiskfs, then manually delete the last_rcvd file, umount it, and try to remount the ost normally again?

Comment by Alex Zhuravlev [ 26/Oct/18 ]

I'd suggest to make a copy of last_rcvd ..
at least locally that worked fine - mount as ldiskfs, save last_rcvd, rm the original one, umount
and mount as Lustre.

Comment by Mahmoud Hanafi [ 26/Oct/18 ]

what do you mean

"at least locally that worked fine"? 

just to verify.

1. Mount the OSTs that are having the issue as ldiskfs.

2. copy last_rcvd

3. rm last_rcvd from each ost

4. umount ost

5. and remount as lustre.

 

Comment by Mahmoud Hanafi [ 26/Oct/18 ]

A releated issue is some of the OST have wrong free space.

Here we have nbp15_1-OST13  says has 11TB used but this was a unused OST.

 

nbp15-srv2 /mnt/lustre/nbp15_1-OST13 # du -sk
4311660 .

nbp15-srv2 /mnt/lustre/nbp15_1-OST13 # df -h
/dev/mapper/nbp15_1-OST13 72T 11T 62T 15% /mnt/lustre/nbp15_1-OST13

 

 nbp15-srv2 ~ # e2fsck -vf /dev/mapper/nbp15_1-OST13
e2fsck 1.42.13.wc6 (05-Feb-2017)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information       80857 inodes used (0.87%, out of 9337344)
           8 non-contiguous files (0.0%)
           0 non-contiguous directories (0.0%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 80847/2
  2703993403 blocks used (14.14%, out of 19122880512)
           0 bad blocks
           0 large files       80709 regular files
         139 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
           0 symbolic links (0 fast symbolic links)
           0 sockets
------------
       80848 files

 

 

Comment by Nathaniel Clark [ 26/Oct/18 ]

1. Mount the OSTs that are having the issue as ldiskfs.

2. copy last_rcvd

3. rm last_rcvd from each ost

Just from the mounted OST that's having the issue

4. umount ost

5. and remount as lustre.

Correct.

 

With regards to free space.  I would check lfs df from a client to see where lustre shows free space.  If that doesn't clear things up, please open a separate ticket.

Comment by Mahmoud Hanafi [ 26/Oct/18 ]

This workaround work. The prio can be lowered

 

Comment by Andreas Dilger [ 26/Oct/18 ]

Also, have you run a full e2fsck after the RAID problems? If not, then it is good to save the output (i.e. run under "script" or similar).

Comment by Mahmoud Hanafi [ 30/Oct/18 ]

yes we did run full fsck.

 

Please close the case.

 

Generated at Sat Feb 10 02:45:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.