[LU-5347] replay-ost-single test_3: tee: standard output: Input/output error Created: 15/Jul/14 Updated: 27/Feb/17 Resolved: 27/Feb/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Jian Yu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Build: https://build.hpdd.intel.com/job/lustre-master/2048/ |
||
| Severity: | 3 |
| Rank (Obsolete): | 14913 |
| Description |
|
replay-ost-single test 3 failed as follows: == replay-ost-single test 3: Fail OST during write, with verification == 20:09:46 (1400123386) Failing ost1 on shadow-49vm8 CMD: shadow-49vm8 grep -c /mnt/ost1' ' /proc/mounts Stopping /mnt/ost1 (opts:) on shadow-49vm8 CMD: shadow-49vm8 umount -d /mnt/ost1 CMD: shadow-49vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' reboot facets: ost1 Failover ost1 to shadow-49vm8 20:10:04 (1400123404) waiting for shadow-49vm8 network 900 secs ... 20:10:04 (1400123404) network interface is UP CMD: shadow-49vm8 hostname mount facets: ost1 CMD: shadow-49vm8 test -b /dev/lvm-Role_OSS/P1 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/ost1 CMD: shadow-49vm8 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-Role_OSS/P1 /mnt/ost1 CMD: shadow-49vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super\" \"all -lnet -lnd -pinger\" 4 CMD: shadow-49vm8 e2label /dev/lvm-Role_OSS/P1 2>/dev/null Started lustre-OST0000 CMD: shadow-49vm1,shadow-49vm2.shadow.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-*.ost_server_uuid shadow-49vm1: CMD: shadow-49vm1.shadow.whamcloud.com lctl get_param -n at_max shadow-49vm2: CMD: shadow-49vm2.shadow.whamcloud.com lctl get_param -n at_max shadow-49vm1: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 95 sec tee: standard output: Input/output error shadow-49vm2: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 105 sec 1280+0 records in 1280+0 records out 5242880 bytes (5.2 MB) copied, 125.153 s, 41.9 kB/s tee: write error replay-ost-single test_3: @@@@@@ FAIL: test_3 failed with 1 Debug log on client: 00000080:00200000:0.0:1400123511.721003:0:28098:0:(file.c:1229:ll_file_io_generic()) iot: 1, result: -5 00000100:00100000:1.0:1400123511.726097:0:20485:0:(client.c:1851:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_0:c890eb56-964e-2a0e-1f85-b5461794e3dc:0:1468135597178740:10.1.6.62@tcp:4 00000008:00080000:1.0:1400123511.726121:0:20486:0:(osc_request.c:1769:brw_interpret()) lustre-OST0000-osc-ffff88006b95d000: resend cross eviction for object: 0x0:15338, rc = -5. 00000100:00100000:1.0:1400123511.726335:0:20486:0:(client.c:1851:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:c890eb56-964e-2a0e-1f85-b5461794e3dc:0:1468135597149296:10.1.6.62@tcp:4 00000008:00080000:1.0:1400123511.726341:0:20486:0:(osc_request.c:1769:brw_interpret()) lustre-OST0000-osc-ffff88006b95d000: resend cross eviction for object: 0x0:15338, rc = -5. Maloo report: http://testing.hpdd.intel.com/test_sets/1786067c-dc5a-11e3-8b61-52540035b04c |
| Comments |
| Comment by Jian Yu [ 15/Jul/14 ] |
|
More instances on Lustre master and b2_5 branches: The failures were vetted as |
| Comment by Andreas Dilger [ 15/Jul/14 ] |
|
I see in https://testing.hpdd.intel.com/test_logs/a1c7f374-07ff-11e4-8e5a-5254006e85c2/show_text 19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x2cb:0x0: rc = -2 19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 583 previous similar messages 19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x2cb:0x0: rc = -2 19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 303 previous similar messages Di, I think this bug was already fixed on master? It had to do with incorrect decoding of osd_id or similar. I suspect that patch needs to be backported to b2_5. |
| Comment by Di Wang [ 14/Aug/14 ] |
|
Andreas: Not sure if you mean this patch http://review.whamcloud.com/#/c/5820/, but it is already landed to 2.4. So you may mean sth else? Besides this problem also exists on b2_6, probably needs further investigation. |
| Comment by Jian Yu [ 15/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_6/2/ (2.6.0 RC2) The same failure occurred: |
| Comment by Jian Yu [ 16/Aug/14 ] |
|
For https://testing.hpdd.intel.com/test_sets/48921180-22b8-11e4-94dd-5254006e85c2, after OST was failed over, console log showed that: 00:38:44:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P1 2>/dev/null 00:38:44:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3aaf:0x0: rc = -2 00:40:19:Lustre: lustre-OST0000: Client ad3fb862-29c2-d5fd-4885-211ab6e3d1bd (at 10.2.4.94@tcp) reconnecting, waiting for 3 clients in recovery for 1:22 00:40:19:Lustre: Skipped 2 previous similar messages 00:40:20:Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 00:40:21:Lustre: Skipped 2 previous similar messages 00:40:21:Lustre: 13466:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=21, delay=0(jiff) 00:40:21:Lustre: 13466:0:(service.c:1511:ptlrpc_at_check_timed()) Skipped 2 previous similar messages 00:40:21:Lustre: 13466:0:(service.c:1308:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff8800410a4800 x1476204951412392/t0(0) o400->lustre-MDT0000-mdtlov_UUID@10.2.4.99@tcp:0/0 lens 224/0 e 283854 to 0 dl 1407829125 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 00:40:22:Lustre: 13466:0:(service.c:1308:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages 00:40:24:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting, waiting for 3 clients in recovery for 0:47 00:40:24:Lustre: Skipped 9 previous similar messages 00:40:24:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3ab0:0x0: rc = -2 00:40:24:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 46 previous similar messages 00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3ab0:0x0: rc = -2 00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 239 previous similar messages 00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3aaf:0x0: rc = -2 00:40:25:Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting 00:40:26:Lustre: Skipped 19 previous similar messages 00:40:26:Lustre: 13028:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1407829163/real 0] req@ffff880019d12800 x1476204961866504/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.2.4.99@tcp:12/10 lens 224/224 e 0 to 1 dl 1407829172 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 00:40:26:Lustre: 13028:0:(client.c:1926:ptlrpc_expire_one_request()) Skipped 65 previous similar messages 00:40:27:Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.2.4.99@tcp) was lost; in progress operations using this service will wait for recovery to complete 00:40:27:Lustre: Skipped 48 previous similar messages 00:40:27:LustreError: 166-1: MGC10.2.4.99@tcp: Connection to MGS (at 10.2.4.99@tcp) was lost; in progress operations using this service will fail 00:40:27:LustreError: Skipped 6 previous similar messages 00:40:28:Lustre: lustre-OST0001: deleting orphan objects from 0x0:14116 to 0x0:14209 00:40:28:Lustre: lustre-OST0003: deleting orphan objects from 0x0:14148 to 0x0:14273 00:40:29:Lustre: lustre-OST0004: deleting orphan objects from 0x0:14148 to 0x0:14273 00:40:29:Lustre: lustre-OST0002: deleting orphan objects from 0x0:14212 to 0x0:14305 00:40:29:Lustre: lustre-OST0005: deleting orphan objects from 0x0:14180 to 0x0:14305 00:40:30:Lustre: lustre-OST0006: deleting orphan objects from 0x0:14179 to 0x0:14305 00:40:30:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting, waiting for 3 clients in recovery for 0:04 00:40:30:Lustre: Skipped 60 previous similar messages 00:40:31:Lustre: lustre-OST0000: recovery is timed out, evict stale exports 00:40:31:Lustre: lustre-OST0000: disconnecting 1 stale clients 00:40:31:LustreError: 17775:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 2097152 != fo_tot_granted 2523136 00:40:32:Lustre: 17775:0:(ldlm_lib.c:2071:target_recovery_thread()) too long recovery - read logs 00:40:32:LustreError: dumping log to /tmp/lustre-log.1407829195.17775 Unfortunately, the debug log /tmp/lustre-log.1407829195.17775 was not gathered. Still investigating. |
| Comment by Jian Yu [ 21/Aug/14 ] |
|
The failure has not occurred on Lustre b2_5 branch since build #76. |
| Comment by Andreas Dilger [ 27/Feb/17 ] |
|
Close old bug. |