[LU-4132] Test failure on test suite sanity, subtest test_46 Created: 22/Oct/13  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4106 racer test hang Resolved
Severity: 3
Rank (Obsolete): 11189

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/557b9ede-3a38-11e3-aede-52540035b04c.

The sub-test test_46 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 46



 Comments   
Comment by Di Wang [ 22/Oct/13 ]

Though this failure only happens in DNE test, but I suspect this is related with 16 OSTs. Usually we use 8 OSTs in single MDT run, but 16 OSTs(4 MDTs) in DNE run.

Comment by Oleg Drokin [ 22/Oct/13 ]

the client is stuck waiting on lock enqueue as part of a sync (trying to refresh a layout lock?):

flush-lustre- S 0000000000000000     0 10803      2 0x00000080
 ffff88005328b4a0 0000000000000046 0000000000000000 0a64657265746e65
 0000000000000000 0000000000000000 ffff88005571e9c0 ffff88005571e9f0
 ffff88007c13fab8 ffff88005328bfd8 000000000000fb88 ffff88007c13fab8
Call Trace:
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff8150ef22>] schedule_timeout+0x192/0x2e0
 [<ffffffff810811e0>] ? process_timeout+0x0/0x10
 [<ffffffffa06c295a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa06cc056>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
 [<ffffffffa06c2f67>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
 [<ffffffffa06a5855>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc]
 [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
 [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
 [<ffffffffa08a9b0e>] mdc_enqueue+0x2be/0x1a10 [mdc]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa085db34>] lmv_enqueue+0x2f4/0xfc0 [lmv]
 [<ffffffffa0a5c155>] ll_layout_refresh+0x515/0xfe0 [lustre]
 [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
 [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
 [<ffffffffa05958b2>] ? cl_io_slice_add+0x132/0x190 [obdclass]
 [<ffffffffa0aaa7c0>] vvp_io_init+0x340/0x490 [lustre]
 [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0594ae8>] cl_io_init0+0x98/0x160 [obdclass]
 [<ffffffffa0597874>] cl_io_init+0x64/0xe0 [obdclass]
 [<ffffffffa0a50f4d>] cl_sync_file_range+0x12d/0x500 [lustre]
 [<ffffffffa0a7861b>] ll_writepages+0x8b/0x1c0 [lustre]
 [<ffffffff8112e181>] do_writepages+0x21/0x40
 [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0
 [<ffffffff8150e130>] ? thread_return+0x4e/0x76e
 [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30
 [<ffffffff811ad52b>] wb_do_writeback+0xbb/0x240
 [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Andreas Dilger [ 22/Oct/13 ]

It looks like there IS some kind of problem specific to DNE and LFSCK. In the MDT4 console log it is repeatedly reporting problems with a FID on that MDT but it seems that LFSCK is not fixing it.

10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 21 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 42 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 85 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 99 previous similar messages
10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
10:43:43:LustreError: Skipped 97 previous similar messages

This is happening hundreds if times per second, which is the first problem. LFSCK shouldn't be restarting so quickly.

Secondly, DNE shouldn't be creating filesystems during normal operation that LFSCK isn't happy about. This would be a bug in DNE or LFSCK.

Finally, whatever problem is being found by LFSCK is not being fixed by LFSCK, since it is being reported in the same FID repeatedly.

Comment by Di Wang [ 22/Oct/13 ]

Hmm, It seems client are trying to get layout lock from MDT4 during sync(there are a few sync in sanity 46), but MDT4 is busy with OI scrub somehow. And I checked debug log of MDT4.

80000000:00000001:0.0:1381970194.705176:0:10443:0:(fld_cache.c:559:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1381970194.705177:0:10443:0:(fld_handler.c:138:fld_server_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000040:0.0:1381970194.705179:0:10443:0:(lod_dev.c:88:lod_fld_lookup()) LOD: got tgt 3 for sequence: 0x6c0000400
00000004:00000001:0.0:1381970194.705181:0:10443:0:(lod_dev.c:90:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705183:0:10443:0:(lod_dev.c:143:lod_object_alloc()) Process leaving (rc=18446612134391954456 : -131939317597160 : ffff88007bd7bc18)
00000004:00000001:0.0:1381970194.705185:0:10443:0:(mdd_object.c:181:mdd_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705188:0:10443:0:(lod_object.c:1203:lod_object_init()) Process entered
00000004:00000010:0.0:1381970194.705193:0:10443:0:(osd_handler.c:172:osd_object_alloc()) kmalloced 'mo': 176 at ffff88006752c9c0.
00000004:00000001:0.0:1381970194.705195:0:10443:0:(lod_object.c:1215:lod_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1381970194.705198:0:10443:0:(osd_handler.c:394:osd_fid_lookup()) Process entered
00000004:00000001:0.0:1381970194.705204:0:10443:0:(osd_handler.c:327:osd_check_lma()) Process entered
00000004:00000002:0.0:1381970194.705221:0:10443:0:(osd_handler.c:371:osd_check_lma()) lustre-MDT0003: FID [0x6c0000400:0xd56:0x0] != self_fid [0x6c0000400:0xd57:0x0]
00000004:00000001:0.0:1381970194.705224:0:10443:0:(osd_handler.c:375:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2)
00000004:00000001:0.0:1381970194.705228:0:10443:0:(osd_scrub.c:1991:osd_scrub_start()) Process entered

It seems OI cache problem(Note: this object is normal local object see result from lod_fld_lookup), this might also explain why LFSCK can not fix it? (Since the on-disk LMA might be right). Fan Yong, could you please comment? Thanks.

Comment by nasf (Inactive) [ 23/Oct/13 ]

It is probably caused by the OI cache. I have made a patch for LU-4106 which contains the fixing for OI cache cleanup:

http://review.whamcloud.com/#/c/8002/

Comment by Andreas Dilger [ 17/Apr/17 ]

Close old issue as duplicate of LU-4106.

Generated at Sat Feb 10 01:39:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.