[LU-11574] LustreError: 157-3: Trying to start OBD nbp13-OST000b_UUID using the wrong disk Created: 26/Oct/18 Updated: 04/Sep/20 Resolved: 30/Oct/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Mahmoud Hanafi | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 1 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Filesystem is down. We had raid backend issues and oss crashed. We ran run fsck after crash. when trying to remount ost we get this error. I have tried to run fsck several times.
[ 4993.782431] LustreError: 157-3: Trying to start OBD nbp13-OST000b_UUID using the wrong disk . Were the /dev/ assignments rearranged?
[ 4993.825963] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2f00[0x0, 1, [0x1:0x0:0x0] hash exist]{
[ 4993.825963]
[ 4993.864146] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2f50
[ 4993.864146]
[ 4993.896057] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882df011ba00osd-ldiskfs-object@ffff882df011ba00(i:ffff882b7af88958:78/2138703796)[plain]
[ 4993.896057]
[ 4993.947314] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2f00
[ 4993.947314]
[ 4993.978176] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a3140[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
[ 4993.978176]
[ 4994.018450] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a3190
[ 4994.018450]
[ 4994.050361] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882eee0e8d00osd-ldiskfs-object@ffff882eee0e8d00(i:ffff882b7af6f2d0:77/2138703762)[plain]
[ 4994.050361]
[ 4994.101617] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a3140
[ 4994.101617]
[ 4994.137076] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2c00[0x0, 1, [0xa:0x0:0x0] hash exist]{
[ 4994.137076]
[ 4994.175261] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2c50
[ 4994.175261]
[ 4994.207171] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882eee0e9100osd-ldiskfs-object@ffff882eee0e9100(i:ffff882b7af90d90:79/2138703830)[plain]
[ 4994.207171]
[ 4994.258427] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2c00
[ 4994.258427]
[ 4994.299979] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882dcd3a2e40[0x0, 1, [0x200000001:0x1017:0x0] hash exist]{
[ 4994.299979]
[ 4994.341038] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882dcd3a2e90
[ 4994.341038]
[ 4994.372949] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882df011a600osd-ldiskfs-object@ffff882df011a600(i:ffff882c098426e0:1090561/1004891530)[plain]
[ 4994.372949]
[ 4994.425511] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff882dcd3a2e40
[ 4994.425511]
[ 4994.464017] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff882bfaae7380[0x0, 1, [0xa:0x18:0x0] hash exist]{
[ 4994.464017]
[ 4994.502461] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff882bfaae73d0
[ 4994.502461]
[ 4994.534372] LustreError: 17122:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff882edbe28200osd-ldiskfs-object@ffff882edbe28200(i:ffff882b7af91600:80/2939569139)[plain]
[ 4994.534372]
[ 4994.571667] Lustre: nbp13-OST000b: Not available for connect from 10.151.25.231@o2ib (not set up)
|
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 26/Oct/18 ] |
|
We have engineers looking into the issue and will provide an update ASAP. |
| Comment by Alex Zhuravlev [ 26/Oct/18 ] |
|
last_rcvd seem to be corrupted. please wait few minutes, I'll try to reproduce locally and figure out a solution. |
| Comment by Zhenyu Xu [ 26/Oct/18 ] |
|
Looks like the last_rcvd file of the OST target got corrupted, can you backup the target as a fail-safe, and mount the target as ldiskfs, then manually delete the last_rcvd file, umount it, and try to remount the ost normally again? |
| Comment by Alex Zhuravlev [ 26/Oct/18 ] |
|
I'd suggest to make a copy of last_rcvd .. |
| Comment by Mahmoud Hanafi [ 26/Oct/18 ] |
|
what do you mean "at least locally that worked fine"? just to verify. 1. Mount the OSTs that are having the issue as ldiskfs. 2. copy last_rcvd 3. rm last_rcvd from each ost 4. umount ost 5. and remount as lustre.
|
| Comment by Mahmoud Hanafi [ 26/Oct/18 ] |
|
A releated issue is some of the OST have wrong free space. Here we have nbp15_1-OST13 says has 11TB used but this was a unused OST.
nbp15-srv2 /mnt/lustre/nbp15_1-OST13 # du -sk nbp15-srv2 /mnt/lustre/nbp15_1-OST13 # df -h
nbp15-srv2 ~ # e2fsck -vf /dev/mapper/nbp15_1-OST13
e2fsck 1.42.13.wc6 (05-Feb-2017)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information 80857 inodes used (0.87%, out of 9337344)
8 non-contiguous files (0.0%)
0 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 80847/2
2703993403 blocks used (14.14%, out of 19122880512)
0 bad blocks
0 large files 80709 regular files
139 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
80848 files
|
| Comment by Nathaniel Clark [ 26/Oct/18 ] |
Just from the mounted OST that's having the issue
Correct.
With regards to free space. I would check lfs df from a client to see where lustre shows free space. If that doesn't clear things up, please open a separate ticket. |
| Comment by Mahmoud Hanafi [ 26/Oct/18 ] |
|
This workaround work. The prio can be lowered
|
| Comment by Andreas Dilger [ 26/Oct/18 ] |
|
Also, have you run a full e2fsck after the RAID problems? If not, then it is good to save the output (i.e. run under "script" or similar). |
| Comment by Mahmoud Hanafi [ 30/Oct/18 ] |
|
yes we did run full fsck.
Please close the case.
|