[LU-5361] sanity test_77i: algo set to crc32 instead of adler - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.9.0
Labels:
None

Severity:
3
Rank (Obsolete):
14946

Description

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/307abe62-0cd8-11e4-8ce3-5254006e85c2.

The sub-test test_77i failed with the following error:

algo set to crc32 instead of adler

Info required for matching: sanity 77i

Attachments

Issue Links

is related to

LU-3243 sanity.sh test_77i: algo set to crc32 instead of adler

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...

(1 mentioned in)

Activity

[LU-5361] sanity test_77i: algo set to crc32 instead of adler

Gerrit Updater added a comment - 08/Nov/16 10:03 AM

Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/23644
Subject: ~~LU-5361~~ tests: Incorrect checksum algorithm reported
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a476729a9ca737bd00d611c870837d2b7977317c

Gerrit Updater added a comment - 08/Nov/16 10:03 AM Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/23644 Subject: LU-5361 tests: Incorrect checksum algorithm reported Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a476729a9ca737bd00d611c870837d2b7977317c

Andreas Dilger added a comment - 28/Jun/16 10:16 PM

The algorithm used depends on the CPU hardware support. It may be that CRC32 is a lot faster than Adler, so Adler is not available. That probably needs a fix to the test.

Andreas Dilger added a comment - 28/Jun/16 10:16 PM The algorithm used depends on the CPU hardware support. It may be that CRC32 is a lot faster than Adler, so Adler is not available. That probably needs a fix to the test.

Emoly Liu added a comment - 02/Nov/15 2:42 AM

Another failure at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2

Emoly Liu added a comment - 02/Nov/15 2:42 AM Another failure at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2

James Nunez (Inactive) added a comment - 30/Oct/15 2:48 PM - edited

Another failure on master at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2
2015-11-13 21:15:03 - https://testing.hpdd.intel.com/test_sets/aae313cc-8a66-11e5-ba42-5254006e85c2
2016-01-21 15:34:52 - https://testing.hpdd.intel.com/test_sets/70bbba40-c06d-11e5-acc7-5254006e85c2
2016-01-21 23:10:38 - https://testing.hpdd.intel.com/test_sets/cb1f99b2-c0ad-11e5-b0fe-5254006e85c2
2016-02-01 15:26:49 - https://testing.hpdd.intel.com/test_sets/11b856f0-c911-11e5-bedf-5254006e85c2
2016-02-08 08:55:14 - https://testing.hpdd.intel.com/test_sets/2c566f30-ce8c-11e5-af15-5254006e85c2
2016-02-08 16:43:35 - https://testing.hpdd.intel.com/test_sets/2a62fb4c-ce9d-11e5-bfa1-5254006e85c2

James Nunez (Inactive) added a comment - 30/Oct/15 2:48 PM - edited Another failure on master at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2 2015-11-13 21:15:03 - https://testing.hpdd.intel.com/test_sets/aae313cc-8a66-11e5-ba42-5254006e85c2 2016-01-21 15:34:52 - https://testing.hpdd.intel.com/test_sets/70bbba40-c06d-11e5-acc7-5254006e85c2 2016-01-21 23:10:38 - https://testing.hpdd.intel.com/test_sets/cb1f99b2-c0ad-11e5-b0fe-5254006e85c2 2016-02-01 15:26:49 - https://testing.hpdd.intel.com/test_sets/11b856f0-c911-11e5-bedf-5254006e85c2 2016-02-08 08:55:14 - https://testing.hpdd.intel.com/test_sets/2c566f30-ce8c-11e5-af15-5254006e85c2 2016-02-08 16:43:35 - https://testing.hpdd.intel.com/test_sets/2a62fb4c-ce9d-11e5-bfa1-5254006e85c2

Bruno Faccini (Inactive) added a comment - 19/Oct/15 1:30 PM

+1 again on master : https://testing.hpdd.intel.com/test_sets/aa999490-764d-11e5-86b7-5254006e85c2

Bruno Faccini (Inactive) added a comment - 19/Oct/15 1:30 PM +1 again on master : https://testing.hpdd.intel.com/test_sets/aa999490-764d-11e5-86b7-5254006e85c2

nasf (Inactive) added a comment - 16/Oct/15 3:00 AM

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/269d217c-738a-11e5-ae2e-5254006e85c2

nasf (Inactive) added a comment - 16/Oct/15 3:00 AM Another failure instance on master: https://testing.hpdd.intel.com/test_sets/269d217c-738a-11e5-ae2e-5254006e85c2

James Nunez (Inactive) added a comment - 16/Jul/15 3:22 PM - edited

Another instance of this error:
2015-07-16 03:17:56 - https://testing.hpdd.intel.com/test_sets/156e74ae-2b83-11e5-b0b2-5254006e85c2
2015-07-20 15:59:05 - https://testing.hpdd.intel.com/test_sets/03de714c-2f13-11e5-bc70-5254006e85c2
2015-08-21 10:53:05 - https://testing.hpdd.intel.com/test_sets/7461086c-483b-11e5-813b-5254006e85c2

James Nunez (Inactive) added a comment - 16/Jul/15 3:22 PM - edited Another instance of this error: 2015-07-16 03:17:56 - https://testing.hpdd.intel.com/test_sets/156e74ae-2b83-11e5-b0b2-5254006e85c2 2015-07-20 15:59:05 - https://testing.hpdd.intel.com/test_sets/03de714c-2f13-11e5-bc70-5254006e85c2 2015-08-21 10:53:05 - https://testing.hpdd.intel.com/test_sets/7461086c-483b-11e5-813b-5254006e85c2

Andreas Dilger added a comment - 16/Jul/14 11:51 PM

Note also that this test failure appears to have been around a long time and is unlikely to be (directly) related to the checksum algorithm changes. ~~LU-3243~~ predates it, and was closed when the debug patch was landed, even though the comment said:

I submitted http://review.whamcloud.com/6205 to improve the error message, but that is not fixing the bug here.

I also see several cases of this test failing if there is an error reading from /proc, but I think those predate the 6205 patch, which prints out the actual checksum algorithm used and doesn't just fail if '[adler]' is not found.

Andreas Dilger added a comment - 16/Jul/14 11:51 PM Note also that this test failure appears to have been around a long time and is unlikely to be (directly) related to the checksum algorithm changes. LU-3243 predates it, and was closed when the debug patch was landed, even though the comment said: I submitted http://review.whamcloud.com/6205 to improve the error message, but that is not fixing the bug here. I also see several cases of this test failing if there is an error reading from /proc, but I think those predate the 6205 patch, which prints out the actual checksum algorithm used and doesn't just fail if ' [adler] ' is not found.

Andreas Dilger added a comment - 16/Jul/14 11:33 PM - edited

When fixing this, the test_77i() code could be cleaned up a bit. It uses cut -d= when it could be using lctl get_param -N to just print the parameter names. It should also use $(...) for subshells instead of `...`, and it should print out the OSC device on which the failure(s) happened.

Actually, the whole check could be simplified to:

        lctl get_param osc.*osc-[^mM]*.checksum_type | grep -v '\[adler\]' &&
                 error "checksum algorithm not set to adler"

Andreas Dilger added a comment - 16/Jul/14 11:33 PM - edited When fixing this, the test_77i() code could be cleaned up a bit. It uses cut -d= when it could be using lctl get_param -N to just print the parameter names. It should also use $(...) for subshells instead of `...`, and it should print out the OSC device on which the failure(s) happened. Actually, the whole check could be simplified to: lctl get_param osc.*osc-[^mM]*.checksum_type | grep -v '\[adler\]' && error "checksum algorithm not set to adler"

Andreas Dilger added a comment - 16/Jul/14 11:24 PM

When the test was originally written, the "adler" algorithm was always available since the availability had nothing to do with performance. The change to make only a subset of algorithms available to the client was done in 2.2.58 so that clients would not choose the locally-fastest algorithm at the expense of selecting a slow algorithm on the server.

That said, this test should be exercising the client checksum selection if OBD_CONNECT_CKSUM is not available, in which case it should select adler:

test_77i() { # bug 13805
        #define OBD_FAIL_OSC_CONNECT_CKSUM       0x40b
        lctl set_param fail_loc=0x40b
        remount_client $MOUNT
        lctl set_param fail_loc=0
        for VALUE in `lctl get_param osc.*osc-[^mM]*.checksum_type`; do
                PARAM=`echo ${VALUE[0]} | cut -d "=" -f1`
                algo=`lctl get_param -n $PARAM | sed 's/.*\[\(.*\)\].*/\1/g'`
                [ "$algo" = "adler" ] || error "algo set to $algo instead of adler"
        done
        remount_client $MOUNT
}
run_test 77i "client not supporting OSD_CONNECT_CKSUM"

        if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_CKSUM)) {
                /* OBD_CONNECT_CKSUM should always be set, even if checksums are
                 * disabled by default, because it can still be enabled on the
                 * fly via /proc. As a consequence, we still need to come to an
                 * agreement on the supported algorithms at connect time */
                data->ocd_connect_flags |= OBD_CONNECT_CKSUM;
        
                if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY))
                        data->ocd_cksum_types = OBD_CKSUM_ADLER;
                else
                        data->ocd_cksum_types = cksum_types_supported_client();
        }

                if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) {
                        /* We sent to the server ocd_cksum_types with bits set
                         * for algorithms we understand. The server masked off
                         * the checksum types it doesn't support */
                        :
                        :
                } else {
                        /* The server does not support OBD_CONNECT_CKSUM.
                         * Enforce ADLER for backward compatibility*/
                        cli->cl_supp_cksum_types = OBD_CKSUM_ADLER;
                }

though the old comments and debug message in ofd_parse_connect_data() would have you think otherwise (they were not updated as part of http://review.whamcloud.com/3098 when the fallback was changed from crc32 to adler), but that should not actually have any effect on operation:

        } else {             
                /* This client does not support OBD_CONNECT_CKSUM
                 * fall back to CRC32 */      
                CDEBUG(D_RPCTRACE, "%s: cli %s does not support "
                       "OBD_CONNECT_CKSUM, CRC32 will be used\n",
                       exp->exp_obd->obd_name, obd_export_nid2str(exp));
        }

So this test shouldn't have anything to do with the relative speed of the algorithms.

It may be that the test is racy and the fail_loc is cleared before all of the OSC connections have been made? It might make sense to move the fail_loc=0 line to just before the second remount to avoid this problem.

Andreas Dilger added a comment - 16/Jul/14 11:24 PM When the test was originally written, the "adler" algorithm was always available since the availability had nothing to do with performance. The change to make only a subset of algorithms available to the client was done in 2.2.58 so that clients would not choose the locally-fastest algorithm at the expense of selecting a slow algorithm on the server. That said, this test should be exercising the client checksum selection if OBD_CONNECT_CKSUM is not available, in which case it should select adler: test_77i() { # bug 13805 #define OBD_FAIL_OSC_CONNECT_CKSUM 0x40b lctl set_param fail_loc=0x40b remount_client $MOUNT lctl set_param fail_loc=0 for VALUE in `lctl get_param osc.*osc-[^mM]*.checksum_type`; do PARAM=`echo ${VALUE[0]} | cut -d "=" -f1` algo=`lctl get_param -n $PARAM | sed 's/.*\[$.*$\].*/\1/g' ` [ "$algo" = "adler" ] || error "algo set to $algo instead of adler" done remount_client $MOUNT } run_test 77i "client not supporting OSD_CONNECT_CKSUM" if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_CKSUM)) { /* OBD_CONNECT_CKSUM should always be set, even if checksums are * disabled by default , because it can still be enabled on the * fly via /proc. As a consequence, we still need to come to an * agreement on the supported algorithms at connect time */ data->ocd_connect_flags |= OBD_CONNECT_CKSUM; if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY)) data->ocd_cksum_types = OBD_CKSUM_ADLER; else data->ocd_cksum_types = cksum_types_supported_client(); } if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) { /* We sent to the server ocd_cksum_types with bits set * for algorithms we understand. The server masked off * the checksum types it doesn't support */ : : } else { /* The server does not support OBD_CONNECT_CKSUM. * Enforce ADLER for backward compatibility*/ cli->cl_supp_cksum_types = OBD_CKSUM_ADLER; } though the old comments and debug message in ofd_parse_connect_data() would have you think otherwise (they were not updated as part of http://review.whamcloud.com/3098 when the fallback was changed from crc32 to adler), but that should not actually have any effect on operation: } else { /* This client does not support OBD_CONNECT_CKSUM * fall back to CRC32 */ CDEBUG(D_RPCTRACE, "%s: cli %s does not support " "OBD_CONNECT_CKSUM, CRC32 will be used\n" , exp->exp_obd->obd_name, obd_export_nid2str(exp)); } So this test shouldn't have anything to do with the relative speed of the algorithms. It may be that the test is racy and the fail_loc is cleared before all of the OSC connections have been made? It might make sense to move the fail_loc=0 line to just before the second remount to avoid this problem.

John Hammond added a comment - 16/Jul/14 2:36 PM

Note also that we have recently changed the crypto speed test code. See LU-5063 and ~~LU-5279~~.

John Hammond added a comment - 16/Jul/14 2:36 PM Note also that we have recently changed the crypto speed test code. See LU-5063 and LU-5279 .

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 16/Jul/14 2:30 PM

Updated:: 22/May/18 6:01 AM

Resolved:: 14/Mar/17 6:05 AM