[LU-1751] Test failure on test suite ost-pools, subtest test_23b Created: 15/Aug/12  Updated: 20/Aug/12  Resolved: 20/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5399

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b5cb7efc-e4ea-11e1-9681-52540035b04c.

The sub-test test_23b failed with the following error:

test failed to respond and timed out

00:23:57:Lustre: DEBUG MARKER: == ost-pools test 23b: OST pools and OOS == 00:23:56 (1344756236)
00:24:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark OSTCOUNT=7, OSTSIZE=2147483648
00:24:15:Lustre: DEBUG MARKER: OSTCOUNT=7, OSTSIZE=2147483648
00:24:15:Lustre: DEBUG MARKER: /usr/sbin/lctl mark MAXFREE=14000000, AVAIL=320004, SLOW=yes
00:24:15:Lustre: DEBUG MARKER: MAXFREE=14000000, AVAIL=320004, SLOW=yes
00:31:48:INFO: task jbd2/dm-3-8:18075 blocked for more than 120 seconds.
00:31:48:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00:31:48:jbd2/dm-3-8   D 0000000000000000     0 18075      2 0x00000080
00:31:48: ffff88005b6cdd20 0000000000000046 ffff88006c2e1540 0000000000000000
00:31:48: ffff88005aa0e400 ffff880002216728 00000000000002f3 ffff88005f9f0ae0
00:31:48: ffff88005f9f1098 ffff88005b6cdfd8 000000000000fb88 ffff88005f9f1098
00:31:48:Call Trace:
00:31:48: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
00:31:48: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2]
00:31:48: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
00:31:48: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
00:31:48: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
00:31:48: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2]
00:31:48: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
00:31:48: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2]
00:31:48: [<ffffffff81091d66>] kthread+0x96/0xa0
00:31:48: [<ffffffff8100c14a>] child_rip+0xa/0x20
00:31:48: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
00:31:49: [<ffffffff8100c140>] ? child_rip+0x0/0x20
00:31:49:INFO: task ll_ost_io00_004:25634 blocked for more than 120 seconds.
00:31:49:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00:31:49:ll_ost_io00_0 D 0000000000000000     0 25634      2 0x00000080
00:31:49: ffff8800403f97a0 0000000000000046 ffff8800403f9710 ffffffff810920e6
00:31:49: ffff8800403f97c0 ffff880073d3f8b8 ffff8800403f9760 ffffffff8104e309
00:31:49: ffff88006c2e1af8 ffff8800403f9fd8 000000000000fb88 ffff88006c2e1af8
00:31:49:Call Trace:
00:31:49: [<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40
00:31:49: [<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
00:31:49: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
00:31:49: [<ffffffffa03aad85>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
00:31:49: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
00:31:49: [<ffffffffa061f48d>] fsfilt_ldiskfs_commit_wait+0x6d/0xf0 [fsfilt_ldiskfs]
00:31:49: [<ffffffffa0aac23d>] filter_preprw_write+0xc4d/0x22f0 [obdfilter]
00:31:49: [<ffffffffa03ccedb>] ? lnet_ni_send+0x4b/0x110 [lnet]
00:31:49: [<ffffffffa06cea3b>] ? null_alloc_rs+0x1ab/0x3b0 [ptlrpc]
00:31:50: [<ffffffffa06bbbf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
00:31:50: [<ffffffffa0aae6e0>] filter_preprw+0x80/0xa0 [obdfilter]
00:31:50: [<ffffffffa04c681c>] obd_preprw+0x12c/0x3d0 [ost]
00:31:50: [<ffffffffa04cd98a>] ost_brw_write+0x87a/0x1600 [ost]
00:31:50: [<ffffffff8127cea6>] ? vsnprintf+0x2b6/0x5f0
00:31:50: [<ffffffffa068d02c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
00:31:50: [<ffffffffa068d188>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
00:31:50: [<ffffffffa04d402c>] ost_handle+0x360c/0x4850 [ost]
00:31:50: [<ffffffffa0db3521>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
00:31:50: [<ffffffffa0daf324>] ? libcfs_id2str+0x74/0xb0 [libcfs]
00:31:50: [<ffffffffa069c82d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
00:31:50: [<ffffffffa0da365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
00:31:50: [<ffffffffa0693cb7>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
00:31:50: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
00:31:50: [<ffffffffa069de19>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
00:31:51: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
00:31:51: [<ffffffff8100c14a>] child_rip+0xa/0x20
00:31:51: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
00:31:52: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
00:31:52: [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Keith Mannthey (Inactive) [ 15/Aug/12 ]

The below is from the client dmesg during the previous test "replay-vbr test 4f: setattr of permission checks versions"

LustreError: 11-0: an error occurred while communicating with 10.10.2.226@tcp. The obd_ping operation failed with -107
LustreError: Skipped 2 previous similar messages
LustreError: 166-1: MGC10.10.2.226@tcp: Connection to MGS (at 10.10.2.226@tcp) was lost; in progress operations using this service will fail
LustreError: Skipped 2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff88007a1fcc00: Connection to lustre-MDT0000 (at 10.10.2.226@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 5 previous similar messages
Lustre: Evicted from MGS (at 10.10.2.226@tcp) after server handle changed from 0xccf4b5b0e3d0559d to 0xccf4b5b0e3d059da
Lustre: Skipped 2 previous similar messages
Lustre: MGC10.10.2.226@tcp: Reactivating import
Lustre: 24498:0:(client.c:2708:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff880074e52000 x1410038284518012/t30064771088(30064771088) o36->lustre-MDT0000-mdc-ffff88007a1fcc00@10.10.2.226@tcp:12/10 lens 488/424 e 6 to 0 dl 1344732209 ref 2 fl Interpret:R/4/0 rc -75/-75
Lustre: 24498:0:(import.c:1174:completed_replay_interpret()) lustre-MDT0000-mdc-ffff88007a1fcc00: version recovery fails, reconnecting
LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.

It appears the MDS reset during replay-vbr test 4f: but was not quite able to make a 2nd good connection. It was evicted from lustre-MDT0000.

I am a little confused by the logs. The Client dmesg ends running replay-vbr test 4[fg] but this bug and many of the other log are about sub-test test_23b (lustre pool tests). Can we verify we have all the right logs?

Comment by Peter Jones [ 19/Aug/12 ]

Hongchao

Could you please look into this one? Looking at recent testing on master it looks like something has broken recently for this test https://maloo.whamcloud.com/reports/show_pass_rate_report?in_last_n_days=28&source_code_branch=d776127c-4096-11e1-9cbd-5254004bbbd3

Thanks

Peter

Comment by Hongchao Zhang [ 20/Aug/12 ]

this does look like another side effect of the bug in the patch at LU-657, which uses the wrong transaction id (tid)
to wait the commit of the journal. the corresponding patch is tracked at http://review.whamcloud.com/#change,3692

Comment by Peter Jones [ 20/Aug/12 ]

ok then let's close this as a duplicate of LU-657 for now and reopen if it reoccurs with that issue fixed

Generated at Sat Feb 10 01:19:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.