[LU-5361] sanity test_77i: algo set to crc32 instead of adler Created: 16/Jul/14  Updated: 22/May/18  Resolved: 14/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3243 sanity.sh test_77i: algo set to crc32... Resolved
Severity: 3
Rank (Obsolete): 14946

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/307abe62-0cd8-11e4-8ce3-5254006e85c2.

The sub-test test_77i failed with the following error:

algo set to crc32 instead of adler

Info required for matching: sanity 77i



 Comments   
Comment by John Hammond [ 16/Jul/14 ]

Seems like a bogus test to me. There's no guarantee that adler will be faster than crc32 all the time.

Comment by James A Simmons [ 16/Jul/14 ]

Completely agree. On systems with hardware accelerated crc32 it will beat alder32 hands down.

Comment by John Hammond [ 16/Jul/14 ]

Note also that we have recently changed the crypto speed test code. See LU-5063 and LU-5279.

Comment by Andreas Dilger [ 16/Jul/14 ]

When the test was originally written, the "adler" algorithm was always available since the availability had nothing to do with performance. The change to make only a subset of algorithms available to the client was done in 2.2.58 so that clients would not choose the locally-fastest algorithm at the expense of selecting a slow algorithm on the server.

That said, this test should be exercising the client checksum selection if OBD_CONNECT_CKSUM is not available, in which case it should select adler:

test_77i() { # bug 13805
        #define OBD_FAIL_OSC_CONNECT_CKSUM       0x40b
        lctl set_param fail_loc=0x40b
        remount_client $MOUNT
        lctl set_param fail_loc=0
        for VALUE in `lctl get_param osc.*osc-[^mM]*.checksum_type`; do
                PARAM=`echo ${VALUE[0]} | cut -d "=" -f1`
                algo=`lctl get_param -n $PARAM | sed 's/.*\[\(.*\)\].*/\1/g'`
                [ "$algo" = "adler" ] || error "algo set to $algo instead of adler"
        done
        remount_client $MOUNT
}
run_test 77i "client not supporting OSD_CONNECT_CKSUM"
        if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_CKSUM)) {
                /* OBD_CONNECT_CKSUM should always be set, even if checksums are
                 * disabled by default, because it can still be enabled on the
                 * fly via /proc. As a consequence, we still need to come to an
                 * agreement on the supported algorithms at connect time */
                data->ocd_connect_flags |= OBD_CONNECT_CKSUM;
        
                if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY))
                        data->ocd_cksum_types = OBD_CKSUM_ADLER;
                else
                        data->ocd_cksum_types = cksum_types_supported_client();
        }
                if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) {
                        /* We sent to the server ocd_cksum_types with bits set
                         * for algorithms we understand. The server masked off
                         * the checksum types it doesn't support */
                        :
                        :
                } else {
                        /* The server does not support OBD_CONNECT_CKSUM.
                         * Enforce ADLER for backward compatibility*/
                        cli->cl_supp_cksum_types = OBD_CKSUM_ADLER;
                }

though the old comments and debug message in ofd_parse_connect_data() would have you think otherwise (they were not updated as part of http://review.whamcloud.com/3098 when the fallback was changed from crc32 to adler), but that should not actually have any effect on operation:

        } else {             
                /* This client does not support OBD_CONNECT_CKSUM
                 * fall back to CRC32 */      
                CDEBUG(D_RPCTRACE, "%s: cli %s does not support "
                       "OBD_CONNECT_CKSUM, CRC32 will be used\n",
                       exp->exp_obd->obd_name, obd_export_nid2str(exp));
        }

So this test shouldn't have anything to do with the relative speed of the algorithms.

It may be that the test is racy and the fail_loc is cleared before all of the OSC connections have been made? It might make sense to move the fail_loc=0 line to just before the second remount to avoid this problem.

Comment by Andreas Dilger [ 16/Jul/14 ]

When fixing this, the test_77i() code could be cleaned up a bit. It uses cut -d= when it could be using lctl get_param -N to just print the parameter names. It should also use $(...) for subshells instead of `...`, and it should print out the OSC device on which the failure(s) happened.

Actually, the whole check could be simplified to:

        lctl get_param osc.*osc-[^mM]*.checksum_type | grep -v '\[adler\]' &&
                 error "checksum algorithm not set to adler"
Comment by Andreas Dilger [ 16/Jul/14 ]

Note also that this test failure appears to have been around a long time and is unlikely to be (directly) related to the checksum algorithm changes. LU-3243 predates it, and was closed when the debug patch was landed, even though the comment said:

I submitted http://review.whamcloud.com/6205 to improve the error message, but that is not fixing the bug here.

I also see several cases of this test failing if there is an error reading from /proc, but I think those predate the 6205 patch, which prints out the actual checksum algorithm used and doesn't just fail if '[adler]' is not found.

Comment by James Nunez (Inactive) [ 16/Jul/15 ]

Another instance of this error:
2015-07-16 03:17:56 - https://testing.hpdd.intel.com/test_sets/156e74ae-2b83-11e5-b0b2-5254006e85c2
2015-07-20 15:59:05 - https://testing.hpdd.intel.com/test_sets/03de714c-2f13-11e5-bc70-5254006e85c2
2015-08-21 10:53:05 - https://testing.hpdd.intel.com/test_sets/7461086c-483b-11e5-813b-5254006e85c2

Comment by nasf (Inactive) [ 16/Oct/15 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/269d217c-738a-11e5-ae2e-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 19/Oct/15 ]

+1 again on master : https://testing.hpdd.intel.com/test_sets/aa999490-764d-11e5-86b7-5254006e85c2

Comment by James Nunez (Inactive) [ 30/Oct/15 ]

Another failure on master at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2
2015-11-13 21:15:03 - https://testing.hpdd.intel.com/test_sets/aae313cc-8a66-11e5-ba42-5254006e85c2
2016-01-21 15:34:52 - https://testing.hpdd.intel.com/test_sets/70bbba40-c06d-11e5-acc7-5254006e85c2
2016-01-21 23:10:38 - https://testing.hpdd.intel.com/test_sets/cb1f99b2-c0ad-11e5-b0fe-5254006e85c2
2016-02-01 15:26:49 - https://testing.hpdd.intel.com/test_sets/11b856f0-c911-11e5-bedf-5254006e85c2
2016-02-08 08:55:14 - https://testing.hpdd.intel.com/test_sets/2c566f30-ce8c-11e5-af15-5254006e85c2
2016-02-08 16:43:35 - https://testing.hpdd.intel.com/test_sets/2a62fb4c-ce9d-11e5-bfa1-5254006e85c2

Comment by Emoly Liu [ 02/Nov/15 ]

Another failure at https://testing.hpdd.intel.com/test_sets/1654529c-7e46-11e5-965a-5254006e85c2

Comment by Andreas Dilger [ 28/Jun/16 ]

The algorithm used depends on the CPU hardware support. It may be that CRC32 is a lot faster than Adler, so Adler is not available. That probably needs a fix to the test.

Comment by Gerrit Updater [ 08/Nov/16 ]

Ashish Purkar (ashish.purkar@seagate.com) uploaded a new patch: http://review.whamcloud.com/23644
Subject: LU-5361 tests: Incorrect checksum algorithm reported
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a476729a9ca737bd00d611c870837d2b7977317c

Comment by Jian Yu [ 05/Dec/16 ]

One more failure instance on master branch:

https://testing.hpdd.intel.com/test_sets/a136d632-b36a-11e6-a0b8-5254006e85c2

Comment by Gerrit Updater [ 03/Mar/17 ]

Abrarahmed Momin (kais_abrar@yahoo.co.in) uploaded a new patch: https://review.whamcloud.com/25745
Subject: LU-5361 llite: Remove OBD_FAIL_OSC_CONNECT_CKSUM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b3e0a5be1edf02b8facb8928b746c23e17a376c3

Comment by Gerrit Updater [ 14/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23644/
Subject: LU-5361 llite: Remove OBD_FAIL_OSC_CONNECT_CKSUM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5c3151841520001b75e908af0fc3446d80853969

Comment by Peter Jones [ 14/Mar/17 ]

Landed for 2.10

Generated at Sat Feb 10 01:50:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.