Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8824

sanity-sec test_9: ASSERTION( config->nmc_default_nodemap )

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2.

      The sub-test test_9 failed with the following error:

      trevis-34vm4:LBUG/LASSERT detected
      
      02:04:11:[15785.994958] Lustre: DEBUG MARKER: == sanity-sec test 9: nodemap range add ============================================================== 02:02:49 (1478656969)
      02:04:11:[15792.826885] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) adding nodemap '27295_7' to config without default nodemap
      02:04:11:[15792.830823] Lustre: 10421:0:(nodemap_handler.c:1020:nodemap_create()) Skipped 3 previous similar messages
      02:04:11:[15800.705743] Lustre: 10421:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC10.9.5.176@tcp: error processing nodemap log nodemap: rc = -2
      02:04:11:[15800.709914] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) ASSERTION( config->nmc_default_nodemap ) failed: 
      02:04:11:[15800.714076] LustreError: 10421:0:(nodemap_handler.c:1428:nodemap_config_set_active()) LBUG
      02:04:11:[15800.716317] Pid: 10421, comm: ll_cfg_requeue
      02:04:11:[15800.718308] 
      02:04:11:[15800.718308] Call Trace:
      02:04:11:[15800.721741]  [<ffffffffa09387d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      02:04:11:[15800.723818]  [<ffffffffa0938d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      02:04:11:[15800.725837]  [<ffffffffa0d34a17>] nodemap_config_set_active+0x2a7/0x2e0 [ptlrpc]
      02:04:11:[15800.727873]  [<ffffffffa0d3d908>] nodemap_config_set_active_mgc+0x38/0x1e0 [ptlrpc]
      02:04:11:[15800.729985]  [<ffffffffa0ca28f0>] ? ptlrpc_request_cache_free+0x90/0x1d0 [ptlrpc]
      02:04:11:[15800.732071]  [<ffffffffa0ca35d5>] ? __ptlrpc_req_finished+0x475/0x690 [ptlrpc]
      02:04:11:[15800.734162]  [<ffffffffa0c43e6b>] mgc_process_recover_nodemap_log+0x34b/0xe10 [mgc]
      02:04:11:[15800.736195]  [<ffffffffa0c46894>] mgc_process_log+0x754/0x880 [mgc]
      02:04:11:[15800.738132]  [<ffffffff816399cd>] ? schedule_timeout+0x17d/0x2d0
      02:04:11:[15800.740126]  [<ffffffffa09439d7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      02:04:11:[15800.742013]  [<ffffffffa0c48908>] mgc_requeue_thread+0x2b8/0x880 [mgc]
      02:04:11:[15800.744113]  [<ffffffff810b8940>] ? default_wake_function+0x0/0x20
      02:04:11:[15800.746313]  [<ffffffffa0c48650>] ? mgc_requeue_thread+0x0/0x880 [mgc]
      02:04:11:[15800.748437]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      02:04:11:[15800.750331]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      02:04:11:[15800.752203]  [<ffffffff81646c98>] ret_from_fork+0x58/0x90
      02:04:11:[15800.754097]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      02:04:11:[15800.755897] 
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-sec 9

      Attachments

        Issue Links

          Activity

            [LU-8824] sanity-sec test_9: ASSERTION( config->nmc_default_nodemap )

            Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23849
            Subject: LU-8824 nodemap: load nodemap definitions first
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7d5800455161e0d2fca47a1754b7fc734d4a2999

            gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23849 Subject: LU-8824 nodemap: load nodemap definitions first Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7d5800455161e0d2fca47a1754b7fc734d4a2999
            pjones Peter Jones added a comment -

            Thanks Kit! This is encouraging news

            pjones Peter Jones added a comment - Thanks Kit! This is encouraging news

            Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23778
            Subject: LU-8824 nodemap: properly handle errors loading nodemap conf
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0ae8e3db5cd16acc4f3bde47a896b05a01383c9b

            gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23778 Subject: LU-8824 nodemap: properly handle errors loading nodemap conf Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0ae8e3db5cd16acc4f3bde47a896b05a01383c9b

            Hi Peter,

            I can get a patch up for the error handling tonight or tomorrow. Fixing the config loading and unloading will take a bit longer, but I'll ty to get a patch up by the end of the week.

            • Kit
            kit.westneat Kit Westneat (Inactive) added a comment - Hi Peter, I can get a patch up for the error handling tonight or tomorrow. Fixing the config loading and unloading will take a bit longer, but I'll ty to get a patch up by the end of the week. Kit
            pjones Peter Jones added a comment -

            Kit

            This is indeed good news. How are things progressing on making the changes necessary with the error handling?

            Peter

            pjones Peter Jones added a comment - Kit This is indeed good news. How are things progressing on making the changes necessary with the error handling? Peter

            Kit,

            Awesome find.

            EXCEPTing test_9 just delays the ASSERTION to test_15:
            https://testing.hpdd.intel.com/sub_tests/aaedadbe-a888-11e6-b6bd-5254006e85c2

            I'm think getting a real fix is necissary for sanity-sec to pass with ZFS.

            utopiabound Nathaniel Clark added a comment - Kit, Awesome find. EXCEPTing test_9 just delays the ASSERTION to test_15: https://testing.hpdd.intel.com/sub_tests/aaedadbe-a888-11e6-b6bd-5254006e85c2 I'm think getting a real fix is necissary for sanity-sec to pass with ZFS.

            I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment:
            /*

            • XXX: implement support for fixed-size keys sorted with natural
            • numerical way (not using internal hash value)
              */

            We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this.

            I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after.

            kit.westneat Kit Westneat (Inactive) added a comment - I think I've figured out what's going on. The config load code expects the index file to return the key/values in key-sorted order, which the ldiskfs index files do. The ZFS index files however appear to return the keys in hash sorted order, at least according to this comment: /* XXX: implement support for fixed-size keys sorted with natural numerical way (not using internal hash value) */ We currently embed the config record type in the key so that create records are processed before update records, and so not having the records sent in key-order breaks this. I'm going to investigate how easy it would be to modify the config load/send operation to have it do a two-pass load, where the create records would be loaded first, and then the other records could be loaded after.

            It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs.

            utopiabound Nathaniel Clark added a comment - It seems to happen anytime sanity-sec is run on ZFS, but not when run on ldiskfs.

            Thanks for gathering the logs on this Nathaniel.

            It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion?

            I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random?

            Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems.

            kit.westneat Kit Westneat (Inactive) added a comment - Thanks for gathering the logs on this Nathaniel. It looks like there is an error handling issue in mgc_process_recover_nodemap_log. I can fix up the error handling for the nodemap portion, but the handling for the recovery log part is also missing, and I don't feel confident adding it there. Is there someone who can look at that portion? I'm not sure what the root cause is yet, though it looks like something to do with the default nodemap not getting transfered correctly - does this LBUG happen on all ZFS full group tests or is it more random? Fixing up the error handling should be enough for 2.9, though it means that nodemap will be only partially functional on ZFS systems.
            pjones Peter Jones added a comment -

            Kit

            Could you please advise on this issue and how whether we could live with this in 2.9?

            Peter

            pjones Peter Jones added a comment - Kit Could you please advise on this issue and how whether we could live with this in 2.9? Peter

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706
            Subject: LU-8824 test: EXCEPT test_9 till fixed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a469c81df1a8fea2798c8fe7866456c53a53a00d

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/23706 Subject: LU-8824 test: EXCEPT test_9 till fixed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a469c81df1a8fea2798c8fe7866456c53a53a00d

            People

              kit.westneat Kit Westneat (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: