Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8824

sanity-sec test_9: ASSERTION( config->nmc_default_nodemap )

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2.

      The sub-test test_9 failed with the following error:

      trevis-34vm4:LBUG/LASSERT detected
      
      02:04:11:[15785.994958] Lustre: DEBUG MARKER: == sanity-sec test 9: nodemap range add ============================================================== 02:02:49 (1478656969)
      02:04:11:[15792.826885] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) adding nodemap '27295_7' to config without default nodemap
      02:04:11:[15792.830823] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) Skipped 3 previous similar messages
      02:04:11:[15800.705743] Lustre: 10421:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC10.9.5.176@tcp: error processing nodemap log nodemap: rc = -2
      02:04:11:[15800.709914] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
      02:04:11:[15800.714076] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) LBUG
      02:04:11:[15800.716317] Pid: 10421, comm: ll_cfg_requeue
      02:04:11:[15800.718308] 
      02:04:11:[15800.718308] Call Trace:
      02:04:11:[15800.721741]  [<ffffffffa09387d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      02:04:11:[15800.723818]  [<ffffffffa0938d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      02:04:11:[15800.725837]  [<ffffffffa0d34a17>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc]
      02:04:11:[15800.727873]  [<ffffffffa0d3d908>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc]
      02:04:11:[15800.729985]  [<ffffffffa0ca28f0>] ? ptlrpc_request_cache_free+0x90/0x1d0 [ptlrpc]
      02:04:11:[15800.732071]  [<ffffffffa0ca35d5>] ? __ptlrpc_req_finished+0x475/0x690 [ptlrpc]
      02:04:11:[15800.734162]  [<ffffffffa0c43e6b>] mgc_process_recover_nodemap_log+0x34b/0xe10 [mgc]
      02:04:11:[15800.736195]  [<ffffffffa0c46894>] mgc_process_log+0x754/0x880 [mgc]
      02:04:11:[15800.738132]  [<ffffffff816399cd>] ? schedule_timeout+0x17d/0x2d0
      02:04:11:[15800.740126]  [<ffffffffa09439d7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      02:04:11:[15800.742013]  [<ffffffffa0c48908>] mgc_requeue_thread+0x2b8/0x880 [mgc]
      02:04:11:[15800.744113]  [<ffffffff810b8940>] ? default_wake_function+0x0/0x20
      02:04:11:[15800.746313]  [<ffffffffa0c48650>] ? mgc_requeue_thread+0x0/0x880 [mgc]
      02:04:11:[15800.748437]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      02:04:11:[15800.750331]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      02:04:11:[15800.752203]  [<ffffffff81646c98>] ret_from_fork+0x58/0x90
      02:04:11:[15800.754097]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      02:04:11:[15800.755897] 
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-sec 9

      Attachments

        Issue Links

          Activity

            [LU-8824] sanity-sec test_9: ASSERTION( config->nmc_default_nodemap )
            pjones Peter Jones added a comment -

            Kit

            This is indeed good news. How are things progressing on making the changes necessary with the error handling?

            Peter

            pjones Peter Jones added a comment - Kit This is indeed good news. How are things progressing on making the changes necessary with the error handling? Peter

            Kit,

            Awesome find.

            EXCEPTing test_9 just delays the ASSERTION to test_15:
            https://testing.hpdd.intel.com/sub_tests/aaedadbe-a888-11e6-b6bd-5254006e85c2

            I'm think getting a real fix is necissary for sanity-sec to pass with ZFS.

            utopiabound Nathaniel Clark added a comment - Kit, Awesome find. EXCEPTing test_9 just delays the ASSERTION to test_15: https://testing.hpdd.intel.com/sub_tests/aaedadbe-a888-11e6-b6bd-5254006e85c2 I'm think getting a real fix is necissary for sanity-sec to pass with ZFS.

            I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment:
            /*

            • XXX: implement support for fixed-size keys sorted with natural
            • numerical way (not using internal hash value)
              */

            We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this.

            I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after.

            kit.westneat Kit Westneat (Inactive) added a comment - I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment: /* XXX: implement support for fixed-size keys sorted with natural numerical way (not using internal hash value) */ We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this. I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after.

            It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs.

            utopiabound Nathaniel Clark added a comment - It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs.

            Thanks for gathering the logs on this Nathaniel.

            It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion?

            I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random?

            Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems.

            kit.westneat Kit Westneat (Inactive) added a comment - Thanks for gathering the logs on this Nathaniel. It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion? I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random? Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems.
            pjones Peter Jones added a comment -

            Kit

            Could you please advise on this issue and how whether we could live with this in 2.9?

            Peter

            pjones Peter Jones added a comment - Kit Could you please advise on this issue and how whether we could live with this in 2.9? Peter

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706
            Subject: LU-8824 test: EXCEPT test_9 till fixed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a469c81df1a8fea2798c8fe7866456c53a53a00d

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706 Subject: LU-8824 test: EXCEPT test_9 till fixed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a469c81df1a8fea2798c8fe7866456c53a53a00d
            utopiabound Nathaniel Clark added a comment - Uncaught failures in past 4 weeks (console log of OST vm shows LBUG): https://testing.hpdd.intel.com/test_sets/969f1d60-9a60-11e6-a546-5254006e85c2 https://testing.hpdd.intel.com/sub_tests/d0606342-93df-11e6-91aa-5254006e85c2 https://testing.hpdd.intel.com/sub_tests/ff975cc6-9ed7-11e6-b8c4-5254006e85c2 https://testing.hpdd.intel.com/sub_tests/b3b42986-9f06-11e6-a747-5254006e85c2 https://testing.hpdd.intel.com/sub_tests/619feb44-a63b-11e6-bf77-5254006e85c2
            utopiabound Nathaniel Clark added a comment - https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2

            People

              kit.westneat Kit Westneat (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: