[LU-8228] replay-single test 87 fails with 'Restart of ost1 failed!' Created: 01/Jun/16 Updated: 22/Jul/18 Resolved: 22/Jul/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne-part-2 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
replay-single test_87 fails with 'Restart of ost1 failed!' From the test_log, we see that after the failover of the OST, the mount has problems Failover ost1 to trevis-4vm3 10:12:00 (1464603120) waiting for trevis-4vm3 network 900 secs ... 10:12:00 (1464603120) network interface is UP CMD: trevis-4vm3 hostname mount facets: ost1 CMD: trevis-4vm3 test -b /dev/lvm-Role_OSS/P1 CMD: trevis-4vm3 e2label /dev/lvm-Role_OSS/P1 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1 CMD: trevis-4vm3 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1 trevis-4vm3: mount.lustre: mount /dev/mapper/lvm--Role_OSS-P1 at /mnt/ost1 failed: No such file or directory trevis-4vm3: Is the MGS specification correct? trevis-4vm3: Is the filesystem name correct? trevis-4vm3: If upgrading, is the copied client log valid? (see upgrade docs) Start of /dev/lvm-Role_OSS/P1 on ost1 failed 2 This issue has occurred 4 times in the past two weeks and the logs are not consistent on all failures, but on two of the failures, the OST console has 18:17:53:[ 5498.685473] Lustre: Failing over lustre-OST0000 18:17:53:[ 5498.697452] Removing read-only on unknown block (0xfc00000) 18:17:53:[ 5498.707991] Lustre: server umount lustre-OST0000 complete 18:17:53:[ 5498.869364] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 18:17:53:[ 5499.538728] LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.2.4.216@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 18:17:53:[ 5499.548875] LustreError: Skipped 7 previous similar messages 18:17:53:[ 5509.256355] Lustre: DEBUG MARKER: hostname 18:17:53:[ 5509.645443] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P1 18:17:53:[ 5509.955019] Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1 18:17:53:[ 5510.316293] LDISKFS-fs (dm-0): recovery complete 18:17:53:[ 5510.334090] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache 18:17:53:[ 5510.375257] LustreError: 26044:0:(llog_osd.c:256:llog_osd_read_header()) lustre-OST0000-osd: bad log lustre-client [0xa:0x74:0x0] header magic: 0x2083300 (expected 0x10645539) 18:17:53:[ 5510.385449] LustreError: 26044:0:(mgc_request.c:1741:mgc_llog_local_copy()) MGC10.2.4.216@tcp: failed to copy remote log lustre-client: rc = -5 18:17:53:[ 5510.389682] LustreError: 13a-8: Failed to get MGS log lustre-client and no local copy. 18:17:53:[ 5510.391675] LustreError: 15c-8: MGC10.2.4.216@tcp: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 18:17:53:[ 5510.397556] LustreError: 26044:0:(obd_mount_server.c:1326:server_start_targets()) lustre-OST0000: failed to start LWP: -2 18:17:53:[ 5510.399960] LustreError: 26044:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -2 18:17:53:[ 5510.402275] Lustre: Failing over lustre-OST0000 18:17:53:[ 5510.447063] Lustre: server umount lustre-OST0000 complete 18:17:53:[ 5510.452435] LustreError: 26044:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount (-2) 18:17:53:[ 5510.807253] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_87: @@@@@@ FAIL: Restart of ost1 failed! All failures are on review-dne-part-2. The following links to logs are all the failures seen to date: |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 02/Jun/16 ] |
|
Hi Mike, Can you please look at this failure when you have some time? Thanks. |
| Comment by Mikhail Pershin [ 22/Jul/18 ] |
|
outdated issue |