[LU-5347] replay-ost-single test_3: tee: standard output: Input/output error Created: 15/Jul/14  Updated: 27/Feb/17  Resolved: 27/Feb/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Build: https://build.hpdd.intel.com/job/lustre-master/2048/
Distro/Arch: RHEL6.5/x86_64


Severity: 3
Rank (Obsolete): 14913

 Description   

replay-ost-single test 3 failed as follows:

== replay-ost-single test 3: Fail OST during write, with verification == 20:09:46 (1400123386)
Failing ost1 on shadow-49vm8
CMD: shadow-49vm8 grep -c /mnt/ost1' ' /proc/mounts
Stopping /mnt/ost1 (opts:) on shadow-49vm8
CMD: shadow-49vm8 umount -d /mnt/ost1
CMD: shadow-49vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: ost1
Failover ost1 to shadow-49vm8
20:10:04 (1400123404) waiting for shadow-49vm8 network 900 secs ...
20:10:04 (1400123404) network interface is UP
CMD: shadow-49vm8 hostname
mount facets: ost1
CMD: shadow-49vm8 test -b /dev/lvm-Role_OSS/P1
Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: shadow-49vm8 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/ost1
CMD: shadow-49vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super\" \"all -lnet -lnd -pinger\" 4 
CMD: shadow-49vm8 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
Started lustre-OST0000
CMD: shadow-49vm1,shadow-49vm2.shadow.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-*.ost_server_uuid 
shadow-49vm1: CMD: shadow-49vm1.shadow.whamcloud.com lctl get_param -n at_max
shadow-49vm2: CMD: shadow-49vm2.shadow.whamcloud.com lctl get_param -n at_max
shadow-49vm1: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 95 sec
tee: standard output: Input/output error
shadow-49vm2: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 105 sec
1280+0 records in
1280+0 records out
5242880 bytes (5.2 MB) copied, 125.153 s, 41.9 kB/s
tee: write error
 replay-ost-single test_3: @@@@@@ FAIL: test_3 failed with 1 

Debug log on client:

00000080:00200000:0.0:1400123511.721003:0:28098:0:(file.c:1229:ll_file_io_generic()) iot: 1, result: -5
00000100:00100000:1.0:1400123511.726097:0:20485:0:(client.c:1851:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_0:c890eb56-964e-2a0e-1f85-b5461794e3dc:0:1468135597178740:10.1.6.62@tcp:4
00000008:00080000:1.0:1400123511.726121:0:20486:0:(osc_request.c:1769:brw_interpret()) lustre-OST0000-osc-ffff88006b95d000: resend cross eviction for object: 0x0:15338, rc = -5.
00000100:00100000:1.0:1400123511.726335:0:20486:0:(client.c:1851:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_1:c890eb56-964e-2a0e-1f85-b5461794e3dc:0:1468135597149296:10.1.6.62@tcp:4
00000008:00080000:1.0:1400123511.726341:0:20486:0:(osc_request.c:1769:brw_interpret()) lustre-OST0000-osc-ffff88006b95d000: resend cross eviction for object: 0x0:15338, rc = -5.

Maloo report: http://testing.hpdd.intel.com/test_sets/1786067c-dc5a-11e3-8b61-52540035b04c



 Comments   
Comment by Jian Yu [ 15/Jul/14 ]

More instances on Lustre master and b2_5 branches:
https://testing.hpdd.intel.com/test_sets/53a5c16c-07ff-11e4-8e5a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1576f466-edaf-11e3-a130-5254006e85c2
https://testing.hpdd.intel.com/test_sets/128130c2-f86e-11e3-8846-5254006e85c2

The failures were vetted as LU-2285. However, it looks like they are not the same issue.

Comment by Andreas Dilger [ 15/Jul/14 ]

I see in https://testing.hpdd.intel.com/test_logs/a1c7f374-07ff-11e4-8e5a-5254006e85c2/show_text

19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x2cb:0x0: rc = -2
19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 583 previous similar messages
19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x2cb:0x0: rc = -2
19:34:11:LustreError: 2177:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 303 previous similar messages

Di, I think this bug was already fixed on master? It had to do with incorrect decoding of osd_id or similar. I suspect that patch needs to be backported to b2_5.

Comment by Di Wang [ 14/Aug/14 ]

Andreas: Not sure if you mean this patch http://review.whamcloud.com/#/c/5820/, but it is already landed to 2.4. So you may mean sth else? Besides this problem also exists on b2_6, probably needs further investigation.

Comment by Jian Yu [ 15/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_6/2/ (2.6.0 RC2)
Distro/Arch: SLES11SP3/x86_64 (client), RHEL6.5/x86_64 (server)

The same failure occurred:
https://testing.hpdd.intel.com/test_sets/48921180-22b8-11e4-94dd-5254006e85c2
https://testing.hpdd.intel.com/test_sets/11666c4e-2063-11e4-9d86-5254006e85c2
https://testing.hpdd.intel.com/test_sets/58d0aa7e-1422-11e4-88ed-5254006e85c2

Comment by Jian Yu [ 16/Aug/14 ]

For https://testing.hpdd.intel.com/test_sets/48921180-22b8-11e4-94dd-5254006e85c2, after OST was failed over, console log showed that:

00:38:44:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P1 2>/dev/null
00:38:44:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3aaf:0x0: rc = -2
00:40:19:Lustre: lustre-OST0000: Client ad3fb862-29c2-d5fd-4885-211ab6e3d1bd (at 10.2.4.94@tcp) reconnecting, waiting for 3 clients in recovery for 1:22
00:40:19:Lustre: Skipped 2 previous similar messages
00:40:20:Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
00:40:21:Lustre: Skipped 2 previous similar messages
00:40:21:Lustre: 13466:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=21, delay=0(jiff)
00:40:21:Lustre: 13466:0:(service.c:1511:ptlrpc_at_check_timed()) Skipped 2 previous similar messages
00:40:21:Lustre: 13466:0:(service.c:1308:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending early reply. Consider increasing at_early_margin (5)?  req@ffff8800410a4800 x1476204951412392/t0(0) o400->lustre-MDT0000-mdtlov_UUID@10.2.4.99@tcp:0/0 lens 224/0 e 283854 to 0 dl 1407829125 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
00:40:22:Lustre: 13466:0:(service.c:1308:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
00:40:24:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting, waiting for 3 clients in recovery for 0:47
00:40:24:Lustre: Skipped 9 previous similar messages
00:40:24:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3ab0:0x0: rc = -2
00:40:24:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 46 previous similar messages
00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3ab0:0x0: rc = -2
00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) Skipped 239 previous similar messages
00:40:25:LustreError: 17775:0:(ldlm_resource.c:1150:ldlm_resource_get()) lustre-OST0000: lvbo_init failed for resource 0x3aaf:0x0: rc = -2
00:40:25:Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting
00:40:26:Lustre: Skipped 19 previous similar messages
00:40:26:Lustre: 13028:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1407829163/real 0]  req@ffff880019d12800 x1476204961866504/t0(0) o400->lustre-MDT0000-lwp-OST0002@10.2.4.99@tcp:12/10 lens 224/224 e 0 to 1 dl 1407829172 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
00:40:26:Lustre: 13028:0:(client.c:1926:ptlrpc_expire_one_request()) Skipped 65 previous similar messages
00:40:27:Lustre: lustre-MDT0000-lwp-OST0002: Connection to lustre-MDT0000 (at 10.2.4.99@tcp) was lost; in progress operations using this service will wait for recovery to complete
00:40:27:Lustre: Skipped 48 previous similar messages
00:40:27:LustreError: 166-1: MGC10.2.4.99@tcp: Connection to MGS (at 10.2.4.99@tcp) was lost; in progress operations using this service will fail
00:40:27:LustreError: Skipped 6 previous similar messages
00:40:28:Lustre: lustre-OST0001: deleting orphan objects from 0x0:14116 to 0x0:14209
00:40:28:Lustre: lustre-OST0003: deleting orphan objects from 0x0:14148 to 0x0:14273
00:40:29:Lustre: lustre-OST0004: deleting orphan objects from 0x0:14148 to 0x0:14273
00:40:29:Lustre: lustre-OST0002: deleting orphan objects from 0x0:14212 to 0x0:14305
00:40:29:Lustre: lustre-OST0005: deleting orphan objects from 0x0:14180 to 0x0:14305
00:40:30:Lustre: lustre-OST0006: deleting orphan objects from 0x0:14179 to 0x0:14305
00:40:30:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.2.4.99@tcp) reconnecting, waiting for 3 clients in recovery for 0:04
00:40:30:Lustre: Skipped 60 previous similar messages
00:40:31:Lustre: lustre-OST0000: recovery is timed out, evict stale exports
00:40:31:Lustre: lustre-OST0000: disconnecting 1 stale clients
00:40:31:LustreError: 17775:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 2097152 != fo_tot_granted 2523136
00:40:32:Lustre: 17775:0:(ldlm_lib.c:2071:target_recovery_thread()) too long recovery - read logs
00:40:32:LustreError: dumping log to /tmp/lustre-log.1407829195.17775

Unfortunately, the debug log /tmp/lustre-log.1407829195.17775 was not gathered. Still investigating.

Comment by Jian Yu [ 21/Aug/14 ]

The failure has not occurred on Lustre b2_5 branch since build #76.

Comment by Andreas Dilger [ 27/Feb/17 ]

Close old bug.

Generated at Sat Feb 10 01:50:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.