Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4132

Test failure on test suite sanity, subtest test_46

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 11189

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/557b9ede-3a38-11e3-aede-52540035b04c.

      The sub-test test_46 failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 46

      Attachments

        Issue Links

          Activity

            [LU-4132] Test failure on test suite sanity, subtest test_46

            Close old issue as duplicate of LU-4106.

            adilger Andreas Dilger added a comment - Close old issue as duplicate of LU-4106 .

            It is probably caused by the OI cache. I have made a patch for LU-4106 which contains the fixing for OI cache cleanup:

            http://review.whamcloud.com/#/c/8002/

            yong.fan nasf (Inactive) added a comment - It is probably caused by the OI cache. I have made a patch for LU-4106 which contains the fixing for OI cache cleanup: http://review.whamcloud.com/#/c/8002/

            Hmm, It seems client are trying to get layout lock from MDT4 during sync(there are a few sync in sanity 46), but MDT4 is busy with OI scrub somehow. And I checked debug log of MDT4.

            80000000:00000001:0.0:1381970194.705176:0:10443:0:(fld_cache.c:559:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
            80000000:00000001:0.0:1381970194.705177:0:10443:0:(fld_handler.c:138:fld_server_lookup()) Process leaving (rc=0 : 0 : 0)
            00000004:00000040:0.0:1381970194.705179:0:10443:0:(lod_dev.c:88:lod_fld_lookup()) LOD: got tgt 3 for sequence: 0x6c0000400
            00000004:00000001:0.0:1381970194.705181:0:10443:0:(lod_dev.c:90:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:0.0:1381970194.705183:0:10443:0:(lod_dev.c:143:lod_object_alloc()) Process leaving (rc=18446612134391954456 : -131939317597160 : ffff88007bd7bc18)
            00000004:00000001:0.0:1381970194.705185:0:10443:0:(mdd_object.c:181:mdd_object_init()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:0.0:1381970194.705188:0:10443:0:(lod_object.c:1203:lod_object_init()) Process entered
            00000004:00000010:0.0:1381970194.705193:0:10443:0:(osd_handler.c:172:osd_object_alloc()) kmalloced 'mo': 176 at ffff88006752c9c0.
            00000004:00000001:0.0:1381970194.705195:0:10443:0:(lod_object.c:1215:lod_object_init()) Process leaving (rc=0 : 0 : 0)
            00000004:00000001:0.0:1381970194.705198:0:10443:0:(osd_handler.c:394:osd_fid_lookup()) Process entered
            00000004:00000001:0.0:1381970194.705204:0:10443:0:(osd_handler.c:327:osd_check_lma()) Process entered
            00000004:00000002:0.0:1381970194.705221:0:10443:0:(osd_handler.c:371:osd_check_lma()) lustre-MDT0003: FID [0x6c0000400:0xd56:0x0] != self_fid [0x6c0000400:0xd57:0x0]
            00000004:00000001:0.0:1381970194.705224:0:10443:0:(osd_handler.c:375:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2)
            00000004:00000001:0.0:1381970194.705228:0:10443:0:(osd_scrub.c:1991:osd_scrub_start()) Process entered
            

            It seems OI cache problem(Note: this object is normal local object see result from lod_fld_lookup), this might also explain why LFSCK can not fix it? (Since the on-disk LMA might be right). Fan Yong, could you please comment? Thanks.

            di.wang Di Wang (Inactive) added a comment - Hmm, It seems client are trying to get layout lock from MDT4 during sync(there are a few sync in sanity 46), but MDT4 is busy with OI scrub somehow. And I checked debug log of MDT4. 80000000:00000001:0.0:1381970194.705176:0:10443:0:(fld_cache.c:559:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:0.0:1381970194.705177:0:10443:0:(fld_handler.c:138:fld_server_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000040:0.0:1381970194.705179:0:10443:0:(lod_dev.c:88:lod_fld_lookup()) LOD: got tgt 3 for sequence: 0x6c0000400 00000004:00000001:0.0:1381970194.705181:0:10443:0:(lod_dev.c:90:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705183:0:10443:0:(lod_dev.c:143:lod_object_alloc()) Process leaving (rc=18446612134391954456 : -131939317597160 : ffff88007bd7bc18) 00000004:00000001:0.0:1381970194.705185:0:10443:0:(mdd_object.c:181:mdd_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705188:0:10443:0:(lod_object.c:1203:lod_object_init()) Process entered 00000004:00000010:0.0:1381970194.705193:0:10443:0:(osd_handler.c:172:osd_object_alloc()) kmalloced 'mo': 176 at ffff88006752c9c0. 00000004:00000001:0.0:1381970194.705195:0:10443:0:(lod_object.c:1215:lod_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1381970194.705198:0:10443:0:(osd_handler.c:394:osd_fid_lookup()) Process entered 00000004:00000001:0.0:1381970194.705204:0:10443:0:(osd_handler.c:327:osd_check_lma()) Process entered 00000004:00000002:0.0:1381970194.705221:0:10443:0:(osd_handler.c:371:osd_check_lma()) lustre-MDT0003: FID [0x6c0000400:0xd56:0x0] != self_fid [0x6c0000400:0xd57:0x0] 00000004:00000001:0.0:1381970194.705224:0:10443:0:(osd_handler.c:375:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) 00000004:00000001:0.0:1381970194.705228:0:10443:0:(osd_scrub.c:1991:osd_scrub_start()) Process entered It seems OI cache problem(Note: this object is normal local object see result from lod_fld_lookup), this might also explain why LFSCK can not fix it? (Since the on-disk LMA might be right). Fan Yong, could you please comment? Thanks.

            It looks like there IS some kind of problem specific to DNE and LFSCK. In the MDT4 console log it is repeatedly reporting problems with a FID on that MDT but it seems that LFSCK is not fixing it.

            10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
            10:43:43:LustreError: Skipped 21 previous similar messages
            10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
            10:43:43:LustreError: Skipped 42 previous similar messages
            10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
            10:43:43:LustreError: Skipped 85 previous similar messages
            10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
            10:43:43:LustreError: Skipped 99 previous similar messages
            10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1]
            10:43:43:LustreError: Skipped 97 previous similar messages
            

            This is happening hundreds if times per second, which is the first problem. LFSCK shouldn't be restarting so quickly.

            Secondly, DNE shouldn't be creating filesystems during normal operation that LFSCK isn't happy about. This would be a bug in DNE or LFSCK.

            Finally, whatever problem is being found by LFSCK is not being fixed by LFSCK, since it is being reported in the same FID repeatedly.

            adilger Andreas Dilger added a comment - It looks like there IS some kind of problem specific to DNE and LFSCK. In the MDT4 console log it is repeatedly reporting problems with a FID on that MDT but it seems that LFSCK is not fixing it. 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 21 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 42 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 85 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 99 previous similar messages 10:43:43:LustreError: 0-0: lustre-MDT0003: trigger OI scrub by RPC for [0x6c0000400:0xd36:0x0], rc = 0 [1] 10:43:43:LustreError: Skipped 97 previous similar messages This is happening hundreds if times per second, which is the first problem. LFSCK shouldn't be restarting so quickly. Secondly, DNE shouldn't be creating filesystems during normal operation that LFSCK isn't happy about. This would be a bug in DNE or LFSCK. Finally, whatever problem is being found by LFSCK is not being fixed by LFSCK, since it is being reported in the same FID repeatedly.
            green Oleg Drokin added a comment -

            the client is stuck waiting on lock enqueue as part of a sync (trying to refresh a layout lock?):

            flush-lustre- S 0000000000000000     0 10803      2 0x00000080
             ffff88005328b4a0 0000000000000046 0000000000000000 0a64657265746e65
             0000000000000000 0000000000000000 ffff88005571e9c0 ffff88005571e9f0
             ffff88007c13fab8 ffff88005328bfd8 000000000000fb88 ffff88007c13fab8
            Call Trace:
             [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffff8150ef22>] schedule_timeout+0x192/0x2e0
             [<ffffffff810811e0>] ? process_timeout+0x0/0x10
             [<ffffffffa06c295a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
             [<ffffffff81063410>] ? default_wake_function+0x0/0x20
             [<ffffffffa06cc056>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
             [<ffffffffa06c2f67>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
             [<ffffffffa06a5855>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc]
             [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
             [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
             [<ffffffffa08a9b0e>] mdc_enqueue+0x2be/0x1a10 [mdc]
             [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa085db34>] lmv_enqueue+0x2f4/0xfc0 [lmv]
             [<ffffffffa0a5c155>] ll_layout_refresh+0x515/0xfe0 [lustre]
             [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre]
             [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
             [<ffffffffa05958b2>] ? cl_io_slice_add+0x132/0x190 [obdclass]
             [<ffffffffa0aaa7c0>] vvp_io_init+0x340/0x490 [lustre]
             [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0594ae8>] cl_io_init0+0x98/0x160 [obdclass]
             [<ffffffffa0597874>] cl_io_init+0x64/0xe0 [obdclass]
             [<ffffffffa0a50f4d>] cl_sync_file_range+0x12d/0x500 [lustre]
             [<ffffffffa0a7861b>] ll_writepages+0x8b/0x1c0 [lustre]
             [<ffffffff8112e181>] do_writepages+0x21/0x40
             [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290
             [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180
             [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0
             [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0
             [<ffffffff8150e130>] ? thread_return+0x4e/0x76e
             [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30
             [<ffffffff811ad52b>] wb_do_writeback+0xbb/0x240
             [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0
             [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0
             [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
             [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100
             [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
             [<ffffffff81096a36>] kthread+0x96/0xa0
             [<ffffffff8100c0ca>] child_rip+0xa/0x20
             [<ffffffff810969a0>] ? kthread+0x0/0xa0
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            
            green Oleg Drokin added a comment - the client is stuck waiting on lock enqueue as part of a sync (trying to refresh a layout lock?): flush-lustre- S 0000000000000000 0 10803 2 0x00000080 ffff88005328b4a0 0000000000000046 0000000000000000 0a64657265746e65 0000000000000000 0000000000000000 ffff88005571e9c0 ffff88005571e9f0 ffff88007c13fab8 ffff88005328bfd8 000000000000fb88 ffff88007c13fab8 Call Trace: [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8150ef22>] schedule_timeout+0x192/0x2e0 [<ffffffff810811e0>] ? process_timeout+0x0/0x10 [<ffffffffa06c295a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa06cc056>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa06c2f67>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa06a5855>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc] [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre] [<ffffffffa08a9b0e>] mdc_enqueue+0x2be/0x1a10 [mdc] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa085db34>] lmv_enqueue+0x2f4/0xfc0 [lmv] [<ffffffffa0a5c155>] ll_layout_refresh+0x515/0xfe0 [lustre] [<ffffffffa0a81740>] ? ll_md_blocking_ast+0x0/0x790 [lustre] [<ffffffffa06aa520>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa05958b2>] ? cl_io_slice_add+0x132/0x190 [obdclass] [<ffffffffa0aaa7c0>] vvp_io_init+0x340/0x490 [lustre] [<ffffffffa03e77b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0594ae8>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa0597874>] cl_io_init+0x64/0xe0 [obdclass] [<ffffffffa0a50f4d>] cl_sync_file_range+0x12d/0x500 [lustre] [<ffffffffa0a7861b>] ll_writepages+0x8b/0x1c0 [lustre] [<ffffffff8112e181>] do_writepages+0x21/0x40 [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290 [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0 [<ffffffff8150e130>] ? thread_return+0x4e/0x76e [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30 [<ffffffff811ad52b>] wb_do_writeback+0xbb/0x240 [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100 [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100 [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

            Though this failure only happens in DNE test, but I suspect this is related with 16 OSTs. Usually we use 8 OSTs in single MDT run, but 16 OSTs(4 MDTs) in DNE run.

            di.wang Di Wang (Inactive) added a comment - Though this failure only happens in DNE test, but I suspect this is related with 16 OSTs. Usually we use 8 OSTs in single MDT run, but 16 OSTs(4 MDTs) in DNE run.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: