Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5361

sanity test_77i: algo set to crc32 instead of adler

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • Lustre 2.9.0
    • None
    • 3
    • 14946

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/307abe62-0cd8-11e4-8ce3-5254006e85c2.

      The sub-test test_77i failed with the following error:

      algo set to crc32 instead of adler

      Info required for matching: sanity 77i

      Attachments

        Issue Links

          Activity

            [LU-5361] sanity test_77i: algo set to crc32 instead of adler

            Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/23644
            Subject: LU-5361 tests: Incorrect checksum algorithm reported
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a476729a9ca737bd00d611c870837d2b7977317c

            gerrit Gerrit Updater added a comment - Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/23644 Subject: LU-5361 tests: Incorrect checksum algorithm reported Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a476729a9ca737bd00d611c870837d2b7977317c

            The algorithm used depends on the CPU hardware support. It may be that CRC32 is a lot faster than Adler, so Adler is not available. That probably needs a fix to the test.

            adilger Andreas Dilger added a comment - The algorithm used depends on the CPU hardware support. It may be that CRC32 is a lot faster than Adler, so Adler is not available. That probably needs a fix to the test.
            emoly.liu Emoly Liu added a comment - Another failure at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited Another failure on master at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2 2015-11-13 21:15:03 - https://testing.hpdd.intel.com/test_sets/aae313cc-8a66-11e5-ba42-5254006e85c2 2016-01-21 15:34:52 - https://testing.hpdd.intel.com/test_sets/70bbba40-c06d-11e5-acc7-5254006e85c2 2016-01-21 23:10:38 - https://testing.hpdd.intel.com/test_sets/cb1f99b2-c0ad-11e5-b0fe-5254006e85c2 2016-02-01 15:26:49 - https://testing.hpdd.intel.com/test_sets/11b856f0-c911-11e5-bedf-5254006e85c2 2016-02-08 08:55:14 - https://testing.hpdd.intel.com/test_sets/2c566f30-ce8c-11e5-af15-5254006e85c2 2016-02-08 16:43:35 - https://testing.hpdd.intel.com/test_sets/2a62fb4c-ce9d-11e5-bfa1-5254006e85c2
            bfaccini Bruno Faccini (Inactive) added a comment - +1 again on master : https://testing.hpdd.intel.com/test_sets/aa999490-764d-11e5-86b7-5254006e85c2
            yong.fan nasf (Inactive) added a comment - Another failure instance on master: https://testing.hpdd.intel.com/test_sets/269d217c-738a-11e5-ae2e-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited Another instance of this error: 2015-07-16 03:17:56 - https://testing.hpdd.intel.com/test_sets/156e74ae-2b83-11e5-b0b2-5254006e85c2 2015-07-20 15:59:05 - https://testing.hpdd.intel.com/test_sets/03de714c-2f13-11e5-bc70-5254006e85c2 2015-08-21 10:53:05 - https://testing.hpdd.intel.com/test_sets/7461086c-483b-11e5-813b-5254006e85c2

            Note also that this test failure appears to have been around a long time and is unlikely to be (directly) related to the checksum algorithm changes. LU-3243 predates it, and was closed when the debug patch was landed, even though the comment said:

            I submitted http://review.whamcloud.com/6205 to improve the error message, but that is not fixing the bug here.

            I also see several cases of this test failing if there is an error reading from /proc, but I think those predate the 6205 patch, which prints out the actual checksum algorithm used and doesn't just fail if '[adler]' is not found.

            adilger Andreas Dilger added a comment - Note also that this test failure appears to have been around a long time and is unlikely to be (directly) related to the checksum algorithm changes. LU-3243 predates it, and was closed when the debug patch was landed, even though the comment said: I submitted http://review.whamcloud.com/6205 to improve the error message, but that is not fixing the bug here. I also see several cases of this test failing if there is an error reading from /proc, but I think those predate the 6205 patch, which prints out the actual checksum algorithm used and doesn't just fail if ' [adler] ' is not found.
            adilger Andreas Dilger added a comment - - edited

            When fixing this, the test_77i() code could be cleaned up a bit. It uses cut -d= when it could be using lctl get_param -N to just print the parameter names. It should also use $(...) for subshells instead of `...`, and it should print out the OSC device on which the failure(s) happened.

            Actually, the whole check could be simplified to:

                    lctl get_param osc.*osc-[^mM]*.checksum_type | grep -v '\[adler\]' &&
                             error "checksum algorithm not set to adler"
            
            adilger Andreas Dilger added a comment - - edited When fixing this, the test_77i() code could be cleaned up a bit. It uses cut -d= when it could be using lctl get_param -N to just print the parameter names. It should also use $(...) for subshells instead of `...`, and it should print out the OSC device on which the failure(s) happened. Actually, the whole check could be simplified to: lctl get_param osc.*osc-[^mM]*.checksum_type | grep -v '\[adler\]' && error "checksum algorithm not set to adler"

            When the test was originally written, the "adler" algorithm was always available since the availability had nothing to do with performance. The change to make only a subset of algorithms available to the client was done in 2.2.58 so that clients would not choose the locally-fastest algorithm at the expense of selecting a slow algorithm on the server.

            That said, this test should be exercising the client checksum selection if OBD_CONNECT_CKSUM is not available, in which case it should select adler:

            test_77i() { # bug 13805
                    #define OBD_FAIL_OSC_CONNECT_CKSUM       0x40b
                    lctl set_param fail_loc=0x40b
                    remount_client $MOUNT
                    lctl set_param fail_loc=0
                    for VALUE in `lctl get_param osc.*osc-[^mM]*.checksum_type`; do
                            PARAM=`echo ${VALUE[0]} | cut -d "=" -f1`
                            algo=`lctl get_param -n $PARAM | sed 's/.*\[\(.*\)\].*/\1/g'`
                            [ "$algo" = "adler" ] || error "algo set to $algo instead of adler"
                    done
                    remount_client $MOUNT
            }
            run_test 77i "client not supporting OSD_CONNECT_CKSUM"
            
                    if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_CKSUM)) {
                            /* OBD_CONNECT_CKSUM should always be set, even if checksums are
                             * disabled by default, because it can still be enabled on the
                             * fly via /proc. As a consequence, we still need to come to an
                             * agreement on the supported algorithms at connect time */
                            data->ocd_connect_flags |= OBD_CONNECT_CKSUM;
                    
                            if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY))
                                    data->ocd_cksum_types = OBD_CKSUM_ADLER;
                            else
                                    data->ocd_cksum_types = cksum_types_supported_client();
                    }
            
                            if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) {
                                    /* We sent to the server ocd_cksum_types with bits set
                                     * for algorithms we understand. The server masked off
                                     * the checksum types it doesn't support */
                                    :
                                    :
                            } else {
                                    /* The server does not support OBD_CONNECT_CKSUM.
                                     * Enforce ADLER for backward compatibility*/
                                    cli->cl_supp_cksum_types = OBD_CKSUM_ADLER;
                            }
            

            though the old comments and debug message in ofd_parse_connect_data() would have you think otherwise (they were not updated as part of http://review.whamcloud.com/3098 when the fallback was changed from crc32 to adler), but that should not actually have any effect on operation:

                    } else {             
                            /* This client does not support OBD_CONNECT_CKSUM
                             * fall back to CRC32 */      
                            CDEBUG(D_RPCTRACE, "%s: cli %s does not support "
                                   "OBD_CONNECT_CKSUM, CRC32 will be used\n",
                                   exp->exp_obd->obd_name, obd_export_nid2str(exp));
                    }
            

            So this test shouldn't have anything to do with the relative speed of the algorithms.

            It may be that the test is racy and the fail_loc is cleared before all of the OSC connections have been made? It might make sense to move the fail_loc=0 line to just before the second remount to avoid this problem.

            adilger Andreas Dilger added a comment - When the test was originally written, the "adler" algorithm was always available since the availability had nothing to do with performance. The change to make only a subset of algorithms available to the client was done in 2.2.58 so that clients would not choose the locally-fastest algorithm at the expense of selecting a slow algorithm on the server. That said, this test should be exercising the client checksum selection if OBD_CONNECT_CKSUM is not available, in which case it should select adler: test_77i() { # bug 13805 #define OBD_FAIL_OSC_CONNECT_CKSUM 0x40b lctl set_param fail_loc=0x40b remount_client $MOUNT lctl set_param fail_loc=0 for VALUE in `lctl get_param osc.*osc-[^mM]*.checksum_type`; do PARAM=`echo ${VALUE[0]} | cut -d "=" -f1` algo=`lctl get_param -n $PARAM | sed 's/.*\[\(.*\)\].*/\1/g' ` [ "$algo" = "adler" ] || error "algo set to $algo instead of adler" done remount_client $MOUNT } run_test 77i "client not supporting OSD_CONNECT_CKSUM" if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_CKSUM)) { /* OBD_CONNECT_CKSUM should always be set, even if checksums are * disabled by default , because it can still be enabled on the * fly via /proc. As a consequence, we still need to come to an * agreement on the supported algorithms at connect time */ data->ocd_connect_flags |= OBD_CONNECT_CKSUM; if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY)) data->ocd_cksum_types = OBD_CKSUM_ADLER; else data->ocd_cksum_types = cksum_types_supported_client(); } if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) { /* We sent to the server ocd_cksum_types with bits set * for algorithms we understand. The server masked off * the checksum types it doesn't support */ : : } else { /* The server does not support OBD_CONNECT_CKSUM. * Enforce ADLER for backward compatibility*/ cli->cl_supp_cksum_types = OBD_CKSUM_ADLER; } though the old comments and debug message in ofd_parse_connect_data() would have you think otherwise (they were not updated as part of http://review.whamcloud.com/3098 when the fallback was changed from crc32 to adler), but that should not actually have any effect on operation: } else { /* This client does not support OBD_CONNECT_CKSUM * fall back to CRC32 */ CDEBUG(D_RPCTRACE, "%s: cli %s does not support " "OBD_CONNECT_CKSUM, CRC32 will be used\n" , exp->exp_obd->obd_name, obd_export_nid2str(exp)); } So this test shouldn't have anything to do with the relative speed of the algorithms. It may be that the test is racy and the fail_loc is cleared before all of the OSC connections have been made? It might make sense to move the fail_loc=0 line to just before the second remount to avoid this problem.
            jhammond John Hammond added a comment -

            Note also that we have recently changed the crypto speed test code. See LU-5063 and LU-5279.

            jhammond John Hammond added a comment - Note also that we have recently changed the crypto speed test code. See LU-5063 and LU-5279 .

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: