[LU-4132] Test failure on test suite sanity, subtest test_46 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
11189

Description

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/557b9ede-3a38-11e3-aede-52540035b04c.

The sub-test test_46 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 46

Attachments

Issue Links

is related to

LU-4106 racer test hang

Resolved

Activity

[LU-4132] Test failure on test suite sanity, subtest test_46

Andreas Dilger added a comment - 17/Apr/17 9:25 PM

Close old issue as duplicate of ~~LU-4106~~.

Andreas Dilger added a comment - 17/Apr/17 9:25 PM Close old issue as duplicate of LU-4106 .

nasf (Inactive) added a comment - 23/Oct/13 12:17 AM

It is probably caused by the OI cache. I have made a patch for ~~LU-4106~~ which contains the fixing for OI cache cleanup:

http://review.whamcloud.com/#/c/8002/

nasf (Inactive) added a comment - 23/Oct/13 12:17 AM It is probably caused by the OI cache. I have made a patch for LU-4106 which contains the fixing for OI cache cleanup: http://review.whamcloud.com/#/c/8002/

Di Wang (Inactive) added a comment - 22/Oct/13 9:30 PM

Hmm, It seems client are trying to get layout lock from MDT4 during sync(there are a few sync in sanity 46), but MDT4 is busy with OI scrub somehow. And I checked debug log of MDT4.

80000000:00000001:0.0:1381970194.705176:0:10443:0:(fld_cache.c:559:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1381970194.705177:0:10443:0:(fld_handler.c:138:fld_server_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000040:0.0:1381970194.705179:0:10443:0:(lod_dev.c:88:lod_fld_lookup()) LOD: got tgt 3 for sequence: 0x6c0000400
00000004:00000001:0.0:1381970194.705181:0:10443:0:(lod_dev.c:90:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705183:0:10443:0:(lod_dev.c:143:lod_object_alloc()) Process leaving (rc=18446612134391954456 : -131939317597160 : ffff88007bd7bc18)
00000004:00000001:0.0:1381970194.705185:0:10443:0:(mdd_object.c:181:mdd_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705188:0:10443:0:(lod_object.c:1203:lod_object_init()) Process entered
00000004:00000010:0.0:1381970194.705193:0:10443:0:(osd_handler.c:172:osd_object_alloc()) kmalloced 'mo': 176 at ffff88006752c9c0.
00000004:00000001:0.0:1381970194.705195:0:10443:0:(lod_object.c:1215:lod_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705198:0:10443:0:(osd_handler.c:394:osd_fid_lookup()) Process entered
00000004:00000001:0.0:1381970194.705204:0:10443:0:(osd_handler.c:327:osd_check_lma()) Process entered
00000004:00000002:0.0:1381970194.705221:0:10443:0:(osd_handler.c:371:osd_check_lma()) lustre-MDT0003: FID [0x6c0000400:0xd56:0x0] != self_fid [0x6c0000400:0xd57:0x0]
00000004:00000001:0.0:1381970194.705224:0:10443:0:(osd_handler.c:375:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2)
00000004:00000001:0.0:1381970194.705228:0:10443:0:(osd_scrub.c:1991:osd_scrub_start()) Process entered

It seems OI cache problem(Note: this object is normal local object see result from lod_fld_lookup), this might also explain why LFSCK can not fix it? (Since the on-disk LMA might be right). Fan Yong, could you please comment? Thanks.

Di Wang (Inactive) added a comment - 22/Oct/13 9:30 PM Hmm, It seems client are trying to get layout lock from MDT4 during sync(there are a few sync in sanity 46), but MDT4 is busy with OI scrub somehow. And I checked debug log of MDT4. 80000000:00000001:0.0:1381970194.705176:0:10443:0:(fld_cache.c:559:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:0.0:1381970194.705177:0:10443:0:(fld_handler.c:138:fld_server_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000040:0.0:1381970194.705179:0:10443:0:(lod_dev.c:88:lod_fld_lookup()) LOD: got tgt 3 for sequence: 0x6c0000400 00000004:00000001:0.0:1381970194.705181:0:10443:0:(lod_dev.c:90:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705183:0:10443:0:(lod_dev.c:143:lod_object_alloc()) Process leaving (rc=18446612134391954456 : -131939317597160 : ffff88007bd7bc18) 00000004:00000001:0.0:1381970194.705185:0:10443:0:(mdd_object.c:181:mdd_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705188:0:10443:0:(lod_object.c:1203:lod_object_init()) Process entered 00000004:00000010:0.0:1381970194.705193:0:10443:0:(osd_handler.c:172:osd_object_alloc()) kmalloced 'mo': 176 at ffff88006752c9c0. 00000004:00000001:0.0:1381970194.705195:0:10443:0:(lod_object.c:1215:lod_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705198:0:10443:0:(osd_handler.c:394:osd_fid_lookup()) Process entered 00000004:00000001:0.0:1381970194.705204:0:10443:0:(osd_handler.c:327:osd_check_lma()) Process entered 00000004:00000002:0.0:1381970194.705221:0:10443:0:(osd_handler.c:371:osd_check_lma()) lustre-MDT0003: FID [0x6c0000400:0xd56:0x0] != self_fid [0x6c0000400:0xd57:0x0] 00000004:00000001:0.0:1381970194.705224:0:10443:0:(osd_handler.c:375:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) 00000004:00000001:0.0:1381970194.705228:0:10443:0:(osd_scrub.c:1991:osd_scrub_start()) Process entered It seems OI cache problem(Note: this object is normal local object see result from lod_fld_lookup), this might also explain why LFSCK can not fix it? (Since the on-disk LMA might be right). Fan Yong, could you please comment? Thanks.

Andreas Dilger added a comment - 22/Oct/13 5:50 PM

It looks like there IS some kind of problem specific to DNE and LFSCK. In the MDT4 console log it is repeatedly reporting problems with a FID on that MDT but it seems that LFSCK is not fixing it.

10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 21 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 42 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 85 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 99 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 97 previous similar messages

This is happening hundreds if times per second, which is the first problem. LFSCK shouldn't be restarting so quickly.

Secondly, DNE shouldn't be creating filesystems during normal operation that LFSCK isn't happy about. This would be a bug in DNE or LFSCK.

Finally, whatever problem is being found by LFSCK is not being fixed by LFSCK, since it is being reported in the same FID repeatedly.

Andreas Dilger added a comment - 22/Oct/13 5:50 PM It looks like there IS some kind of problem specific to DNE and LFSCK. In the MDT4 console log it is repeatedly reporting problems with a FID on that MDT but it seems that LFSCK is not fixing it. 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 21 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 42 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 85 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 99 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 97 previous similar messages This is happening hundreds if times per second, which is the first problem. LFSCK shouldn't be restarting so quickly. Secondly, DNE shouldn't be creating filesystems during normal operation that LFSCK isn't happy about. This would be a bug in DNE or LFSCK. Finally, whatever problem is being found by LFSCK is not being fixed by LFSCK, since it is being reported in the same FID repeatedly.

Oleg Drokin added a comment - 22/Oct/13 5:36 PM

the client is stuck waiting on lock enqueue as part of a sync (trying to refresh a layout lock?):

flush-lustre- S 0000000000000000     0 10803      2 0x00000080
 ffff88005328b4a0 0000000000000046 0000000000000000 0a64657265746e65
 0000000000000000 0000000000000000 ffff88005571e9c0 ffff88005571e9f0
 ffff88007c13fab8 ffff88005328bfd8 000000000000fb88 ffff88007c13fab8
Call Trace:
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff8150ef22>] schedule_timeout+0x192/0x2e0
 [<ffffffff810811e0>] ? process_timeout+0x0/0x10
 [<ffffffffa06c295a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa06cc056>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
 [<ffffffffa06c2f67>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
 [<ffffffffa06a5855>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc]
 [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
 [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
 [<ffffffffa08a9b0e>] mdc_enqueue+0x2be/0x1a10 [mdc]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa085db34>] lmv_enqueue+0x2f4/0xfc0 [lmv]
 [<ffffffffa0a5c155>] ll_layout_refresh+0x515/0xfe0 [lustre]
 [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
 [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
 [<ffffffffa05958b2>] ? cl_io_slice_add+0x132/0x190 [obdclass]
 [<ffffffffa0aaa7c0>] vvp_io_init+0x340/0x490 [lustre]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0594ae8>] cl_io_init0+0x98/0x160 [obdclass]
 [<ffffffffa0597874>] cl_io_init+0x64/0xe0 [obdclass]
 [<ffffffffa0a50f4d>] cl_sync_file_range+0x12d/0x500 [lustre]
 [<ffffffffa0a7861b>] ll_writepages+0x8b/0x1c0 [lustre]
 [<ffffffff8112e181>] do_writepages+0x21/0x40
 [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0
 [<ffffffff8150e130>] ? thread_return+0x4e/0x76e
 [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30
 [<ffffffff811ad52b>] wb_do_writeback+0xbb/0x240
 [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Oleg Drokin added a comment - 22/Oct/13 5:36 PM the client is stuck waiting on lock enqueue as part of a sync (trying to refresh a layout lock?): flush-lustre- S 0000000000000000 0 10803 2 0x00000080 ffff88005328b4a0 0000000000000046 0000000000000000 0a64657265746e65 0000000000000000 0000000000000000 ffff88005571e9c0 ffff88005571e9f0 ffff88007c13fab8 ffff88005328bfd8 000000000000fb88 ffff88007c13fab8 Call Trace: [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8150ef22>] schedule_timeout+0x192/0x2e0 [<ffffffff810811e0>] ? process_timeout+0x0/0x10 [<ffffffffa06c295a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa06cc056>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa06c2f67>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa06a5855>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc] [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre] [<ffffffffa08a9b0e>] mdc_enqueue+0x2be/0x1a10 [mdc] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa085db34>] lmv_enqueue+0x2f4/0xfc0 [lmv] [<ffffffffa0a5c155>] ll_layout_refresh+0x515/0xfe0 [lustre] [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre] [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa05958b2>] ? cl_io_slice_add+0x132/0x190 [obdclass] [<ffffffffa0aaa7c0>] vvp_io_init+0x340/0x490 [lustre] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0594ae8>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa0597874>] cl_io_init+0x64/0xe0 [obdclass] [<ffffffffa0a50f4d>] cl_sync_file_range+0x12d/0x500 [lustre] [<ffffffffa0a7861b>] ll_writepages+0x8b/0x1c0 [lustre] [<ffffffff8112e181>] do_writepages+0x21/0x40 [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290 [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0 [<ffffffff8150e130>] ? thread_return+0x4e/0x76e [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30 [<ffffffff811ad52b>] wb_do_writeback+0xbb/0x240 [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100 [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100 [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Di Wang (Inactive) added a comment - 22/Oct/13 6:30 AM

Though this failure only happens in DNE test, but I suspect this is related with 16 OSTs. Usually we use 8 OSTs in single MDT run, but 16 OSTs(4 MDTs) in DNE run.

Di Wang (Inactive) added a comment - 22/Oct/13 6:30 AM Though this failure only happens in DNE test, but I suspect this is related with 16 OSTs. Usually we use 8 OSTs in single MDT run, but 16 OSTs(4 MDTs) in DNE run.

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Oct/13 6:29 AM

Updated:: 17/Apr/17 9:25 PM

Resolved:: 17/Apr/17 9:25 PM