[LU-1751] Test failure on test suite ost-pools, subtest test_23b Created: 15/Aug/12 Updated: 20/Aug/12 Resolved: 20/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 5399 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b5cb7efc-e4ea-11e1-9681-52540035b04c. The sub-test test_23b failed with the following error:
00:23:57:Lustre: DEBUG MARKER: == ost-pools test 23b: OST pools and OOS == 00:23:56 (1344756236) 00:24:14:Lustre: DEBUG MARKER: /usr/sbin/lctl mark OSTCOUNT=7, OSTSIZE=2147483648 00:24:15:Lustre: DEBUG MARKER: OSTCOUNT=7, OSTSIZE=2147483648 00:24:15:Lustre: DEBUG MARKER: /usr/sbin/lctl mark MAXFREE=14000000, AVAIL=320004, SLOW=yes 00:24:15:Lustre: DEBUG MARKER: MAXFREE=14000000, AVAIL=320004, SLOW=yes 00:31:48:INFO: task jbd2/dm-3-8:18075 blocked for more than 120 seconds. 00:31:48:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00:31:48:jbd2/dm-3-8 D 0000000000000000 0 18075 2 0x00000080 00:31:48: ffff88005b6cdd20 0000000000000046 ffff88006c2e1540 0000000000000000 00:31:48: ffff88005aa0e400 ffff880002216728 00000000000002f3 ffff88005f9f0ae0 00:31:48: ffff88005f9f1098 ffff88005b6cdfd8 000000000000fb88 ffff88005f9f1098 00:31:48:Call Trace: 00:31:48: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80 00:31:48: [<ffffffffa03a493f>] jbd2_journal_commit_transaction+0x19f/0x1530 [jbd2] 00:31:48: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 00:31:48: [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70 00:31:48: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 00:31:48: [<ffffffffa03ab128>] kjournald2+0xb8/0x220 [jbd2] 00:31:48: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 00:31:48: [<ffffffffa03ab070>] ? kjournald2+0x0/0x220 [jbd2] 00:31:48: [<ffffffff81091d66>] kthread+0x96/0xa0 00:31:48: [<ffffffff8100c14a>] child_rip+0xa/0x20 00:31:48: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 00:31:49: [<ffffffff8100c140>] ? child_rip+0x0/0x20 00:31:49:INFO: task ll_ost_io00_004:25634 blocked for more than 120 seconds. 00:31:49:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00:31:49:ll_ost_io00_0 D 0000000000000000 0 25634 2 0x00000080 00:31:49: ffff8800403f97a0 0000000000000046 ffff8800403f9710 ffffffff810920e6 00:31:49: ffff8800403f97c0 ffff880073d3f8b8 ffff8800403f9760 ffffffff8104e309 00:31:49: ffff88006c2e1af8 ffff8800403f9fd8 000000000000fb88 ffff88006c2e1af8 00:31:49:Call Trace: 00:31:49: [<ffffffff810920e6>] ? autoremove_wake_function+0x16/0x40 00:31:49: [<ffffffff8104e309>] ? __wake_up_common+0x59/0x90 00:31:49: [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80 00:31:49: [<ffffffffa03aad85>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] 00:31:49: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 00:31:49: [<ffffffffa061f48d>] fsfilt_ldiskfs_commit_wait+0x6d/0xf0 [fsfilt_ldiskfs] 00:31:49: [<ffffffffa0aac23d>] filter_preprw_write+0xc4d/0x22f0 [obdfilter] 00:31:49: [<ffffffffa03ccedb>] ? lnet_ni_send+0x4b/0x110 [lnet] 00:31:49: [<ffffffffa06cea3b>] ? null_alloc_rs+0x1ab/0x3b0 [ptlrpc] 00:31:50: [<ffffffffa06bbbf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc] 00:31:50: [<ffffffffa0aae6e0>] filter_preprw+0x80/0xa0 [obdfilter] 00:31:50: [<ffffffffa04c681c>] obd_preprw+0x12c/0x3d0 [ost] 00:31:50: [<ffffffffa04cd98a>] ost_brw_write+0x87a/0x1600 [ost] 00:31:50: [<ffffffff8127cea6>] ? vsnprintf+0x2b6/0x5f0 00:31:50: [<ffffffffa068d02c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 00:31:50: [<ffffffffa068d188>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] 00:31:50: [<ffffffffa04d402c>] ost_handle+0x360c/0x4850 [ost] 00:31:50: [<ffffffffa0db3521>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 00:31:50: [<ffffffffa0daf324>] ? libcfs_id2str+0x74/0xb0 [libcfs] 00:31:50: [<ffffffffa069c82d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc] 00:31:50: [<ffffffffa0da365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 00:31:50: [<ffffffffa0693cb7>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 00:31:50: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 00:31:50: [<ffffffffa069de19>] ptlrpc_main+0xb59/0x1860 [ptlrpc] 00:31:51: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 00:31:51: [<ffffffff8100c14a>] child_rip+0xa/0x20 00:31:51: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 00:31:52: [<ffffffffa069d2c0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 00:31:52: [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 15/Aug/12 ] |
|
The below is from the client dmesg during the previous test "replay-vbr test 4f: setattr of permission checks versions" LustreError: 11-0: an error occurred while communicating with 10.10.2.226@tcp. The obd_ping operation failed with -107 LustreError: Skipped 2 previous similar messages LustreError: 166-1: MGC10.10.2.226@tcp: Connection to MGS (at 10.10.2.226@tcp) was lost; in progress operations using this service will fail LustreError: Skipped 2 previous similar messages Lustre: lustre-MDT0000-mdc-ffff88007a1fcc00: Connection to lustre-MDT0000 (at 10.10.2.226@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 5 previous similar messages Lustre: Evicted from MGS (at 10.10.2.226@tcp) after server handle changed from 0xccf4b5b0e3d0559d to 0xccf4b5b0e3d059da Lustre: Skipped 2 previous similar messages Lustre: MGC10.10.2.226@tcp: Reactivating import Lustre: 24498:0:(client.c:2708:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff880074e52000 x1410038284518012/t30064771088(30064771088) o36->lustre-MDT0000-mdc-ffff88007a1fcc00@10.10.2.226@tcp:12/10 lens 488/424 e 6 to 0 dl 1344732209 ref 2 fl Interpret:R/4/0 rc -75/-75 Lustre: 24498:0:(import.c:1174:completed_replay_interpret()) lustre-MDT0000-mdc-ffff88007a1fcc00: version recovery fails, reconnecting LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. It appears the MDS reset during replay-vbr test 4f: but was not quite able to make a 2nd good connection. It was evicted from lustre-MDT0000. I am a little confused by the logs. The Client dmesg ends running replay-vbr test 4[fg] but this bug and many of the other log are about sub-test test_23b (lustre pool tests). Can we verify we have all the right logs? |
| Comment by Peter Jones [ 19/Aug/12 ] |
|
Hongchao Could you please look into this one? Looking at recent testing on master it looks like something has broken recently for this test https://maloo.whamcloud.com/reports/show_pass_rate_report?in_last_n_days=28&source_code_branch=d776127c-4096-11e1-9cbd-5254004bbbd3 Thanks Peter |
| Comment by Hongchao Zhang [ 20/Aug/12 ] |
|
this does look like another side effect of the bug in the patch at |
| Comment by Peter Jones [ 20/Aug/12 ] |
|
ok then let's close this as a duplicate of |