[LU-2554] replay-single test 80b: dd: opening `/mnt/lustre/f80b': Input/output error Created: 31/Dec/12  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/236/
Distro/Arch: RHEL5.8/x86_64 (kernel version: 2.6.18-308.11.1.el5)
Network: TCP (1GigE)
Test Group: failover

ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-27vm3 (active), client-27vm7(passive)
\ /
1 combined MGS/MDT

OSS Nodes: client-27vm4 (active), client-27vm8(active)


Severity: 3
Rank (Obsolete): 5978

 Description   

replay-single test 80b failed as follows:

== replay-single test 80b: write replay with changed data (checksum resend) ========================== 03:49:14 (1356781754)
CMD: client-27vm4 lctl get_param obdfilter.lustre-OST0000.sync_journal
obdfilter.lustre-OST0000.sync_journal=1
CMD: client-27vm4 lctl set_param -n obdfilter.lustre-OST0000.sync_journal 0
CMD: client-27vm4 sync
Filesystem           1K-blocks      Used Available Use% Mounted on
client-27vm3:client-27vm7:/lustre
                      36535940   1731756  32947784   5% /mnt/lustre
CMD: client-27vm4 /usr/sbin/lctl --device %lustre-OST0000 notransno
CMD: client-27vm4 /usr/sbin/lctl --device %lustre-OST0000 readonly
CMD: client-27vm4 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
error on ioctl 0x4008669a for '/mnt/lustre/f80b' (3): Input/output error
error: setstripe: create stripe file '/mnt/lustre/f80b' failed
dd: opening `/mnt/lustre/f80b': Input/output error
 replay-single test_80b: @@@@@@ FAIL: Cannot write

Syslog on MDS client-27vm7 showed that:

Dec 29 03:49:16 client-27vm7 kernel: Lustre: DEBUG MARKER: == replay-single test 80b: write replay with changed data (checksum resend) ========================== 03:49:14 (1356781754)
Dec 29 03:49:16 client-27vm7 rshd[3181]: pam_unix(rsh:session): session closed for user root
Dec 29 03:49:16 client-27vm7 xinetd[2057]: EXIT: shell status=0 pid=3181 duration=0(sec)
Dec 29 03:49:32 client-27vm7 kernel: LustreError: 2442:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x30003c sub-object on OST idx 0/1: rc = -11
Dec 29 03:50:18 client-27vm7 kernel: Lustre: Service thread pid 2526 was inactive for 56.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Dec 29 03:50:19 client-27vm7 kernel: Pid: 2526, comm: ll_mdt_02
Dec 29 03:50:19 client-27vm7 kernel:
Dec 29 03:50:19 client-27vm7 kernel: Call Trace:
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff88921220>] lustre_pack_request+0x630/0x6f0 [ptlrpc]
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff8006389f>] schedule_timeout+0x8a/0xad
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff8009a41d>] process_timeout+0x0/0x5 
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff889e7695>] osc_create+0xc75/0x13d0 [osc]
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff8008ee84>] default_wake_function+0x0/0xe 
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff88a96edb>] qos_remedy_create+0x45b/0x570 [lov]
Dec 29 03:50:19 client-27vm7 kernel:  [<ffffffff8002deea>] __wake_up+0x38/0x4f
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff8008e67d>] dequeue_task+0x18/0x37
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88a90df3>] lov_fini_create_set+0x243/0x11e0 [lov]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88a84b72>] lov_create+0x1552/0x1860 [lov]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88a857a8>] lov_iocontrol+0x928/0xf0f [lov]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff8008ee84>] default_wake_function+0x0/0xe 
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c72b21>] mds_finish_open+0x1fa1/0x4370 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff80009860>] __d_lookup+0xb0/0xff
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff8000d543>] dput+0x2c/0x114
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c52fad>] mds_verify_child+0x2dd/0x870 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff888f59a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c79d41>] mds_open+0x2f01/0x386b [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff887bacfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff888d886c>] _ldlm_lock_debug+0x57c/0x6e0 [ptlrpc]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff8891f5f1>] lustre_swab_buf+0x81/0x170 [ptlrpc]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff8000d543>] dput+0x2c/0x114
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c500a5>] mds_reint_rec+0x365/0x550 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c7ac6e>] mds_update_unpack+0x1fe/0x280 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c42eda>] mds_reint+0x35a/0x420 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c41dea>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
Dec 29 03:50:20 client-27vm7 kernel:  [<ffffffff88c4cbee>] mds_intent_policy+0x49e/0xc10 [mds]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff888e0270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff888ddeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff888da7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff88902870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff888ffb39>] ldlm_handle_enqueue+0xc09/0x1210 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff88c4bb2e>] mds_handle+0x40ce/0x4cf0 [mds]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff887b7868>] libcfs_ip_addr2str+0x38/0x40 [libcfs]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff887b7c7e>] libcfs_nid2str+0xbe/0x110 [libcfs]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8892aaf5>] ptlrpc_server_log_handling_request+0x105/0x130 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8892d874>] ptlrpc_server_handle_request+0x984/0xe00 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8892dfd5>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8008d2a9>] __wake_up_common+0x3e/0x68
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8892ef16>] ptlrpc_main+0xf16/0x10e0 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8892e000>] ptlrpc_main+0x0/0x10e0 [ptlrpc]
Dec 29 03:50:21 client-27vm7 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec 29 03:50:21 client-27vm7 kernel:
Dec 29 03:50:21 client-27vm7 kernel: LustreError: dumping log to /tmp/lustre-log.1356781818.2526
Dec 29 03:50:32 client-27vm7 kernel: LustreError: 2526:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x30003c sub-object on OST idx 0/1: rc = -5
Dec 29 03:50:32 client-27vm7 kernel: LustreError: 2526:0:(mds_open.c:440:mds_create_objects()) error creating objects for inode 3145788: rc = -5
Dec 29 03:50:32 client-27vm7 kernel: LustreError: 2526:0:(mds_open.c:825:mds_finish_open()) mds_create_objects: rc = -5
Dec 29 03:50:32 client-27vm7 kernel: Lustre: Service thread pid 2526 completed after 69.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Dec 29 03:50:42 client-27vm7 kernel: LustreError: 2442:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x30003c sub-object on OST idx 1/1: rc = -11
Dec 29 03:51:42 client-27vm7 kernel: LustreError: 2532:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x30003c sub-object on OST idx 1/1: rc = -5
Dec 29 03:51:42 client-27vm7 kernel: LustreError: 2532:0:(mds_open.c:440:mds_create_objects()) error creating objects for inode 3145788: rc = -5
Dec 29 03:51:42 client-27vm7 kernel: LustreError: 2532:0:(mds_open.c:825:mds_finish_open()) mds_create_objects: rc = -5
Dec 29 03:51:42 client-27vm7 rshd[3223]: root@client-27vm1.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "/usr/sbin/lctl mark  replay-single test_80b: @@@@@@ FAIL: Cannot write ";echo XXRETCODE:$?'
Dec 29 03:51:42 client-27vm7 kernel: Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_80b: @@@@@@ FAIL: Cannot write

Maloo report: https://maloo.whamcloud.com/test_sets/56e3c084-51b9-11e2-a904-52540035b04c

Test 81a,81b,82,83 also failed with the same issue.



 Comments   
Comment by Jian Yu [ 03/Jan/13 ]

The replay-single test 80b, 81a, 81b, 82, 83 passed in another failover test run on the same Lustre b1_8 build #236:
https://maloo.whamcloud.com/test_sets/f9b78b36-5503-11e2-9b6a-52540035b04c

Comment by Jian Yu [ 09/Jan/13 ]

One more instance:

Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/238
https://maloo.whamcloud.com/test_sets/e19843b4-5a2c-11e2-bcf5-52540035b04c

Comment by James A Simmons [ 14/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:26:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.