Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9234

replay-single test_70f: checksum doesn't match

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.10.0, Lustre 2.11.0
    • None
    • onyx-35vm3 thru 6, Interop test,
      RHEL7.3/ldiskfs, branch master, v2.9.54, b3541, 2.9 Lustre,
      Client 2.10 Lustre
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/1a6bc6e8-0a05-11e7-9053-5254006e85c2

      After unmounting/mounting an OST, client detects a checksum mismatch:

      test_log:

      CMD: onyx-35vm4 umount -d /mnt/lustre-ost1
      CMD: onyx-35vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      reboot facets: ost1
      Failover ost1 to onyx-35vm4
      03:10:44 (1489572644) waiting for onyx-35vm4 network 900 secs ...
      03:10:44 (1489572644) network interface is UP
      CMD: onyx-35vm4 hostname
      mount facets: ost1
      CMD: onyx-35vm4 test -b /dev/lvm-Role_OSS/P1
      CMD: onyx-35vm4 e2label /dev/lvm-Role_OSS/P1
      Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
      CMD: onyx-35vm4 mkdir -p /mnt/lustre-ost1; mount -t lustre
      

      followed by:

      CMD: onyx-35vm6 md5sum /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com
      onyx-35vm5: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec
      onyx-35vm6: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec
       replay-single test_70f: @@@@@@ FAIL: /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com: checksum doesn't match on onyx-35vm6 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
        = /usr/lib64/lustre/tests/replay-single.sh:2334:test_70f_write_and_read()
        = /usr/lib64/lustre/tests/replay-single.sh:2350:test_70f_loop()
        = /usr/lib64/lustre/tests/replay-single.sh:2394:test_70f()
        = /usr/lib64/lustre/tests/test-framework.sh:5117:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5156:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5003:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:2415:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9234] replay-single test_70f: checksum doesn't match

            There seems to still be an issue with checksums that shows up in replay-single test 87a, but I'll open a new ticket for that issue.

            jamesanunez James Nunez (Inactive) added a comment - There seems to still be an issue with checksums that shows up in replay-single test 87a, but I'll open a new ticket for that issue.
            jcasper James Casper (Inactive) added a comment - Seen again in 2.10.51 (b3620): https://testing.hpdd.intel.com/test_sessions/c4874eda-04c9-40c5-9e92-b8e7574bd5fe
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26739/
            Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 00db1ffe72bc1f4504adfaba539a1ec4f0fde74b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26739/ Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 00db1ffe72bc1f4504adfaba539a1ec4f0fde74b

            Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26739
            Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 70d77276b26a9a92621bc4aef4f3f04cb6da310f

            gerrit Gerrit Updater added a comment - Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26739 Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 70d77276b26a9a92621bc4aef4f3f04cb6da310f
            bobijam Zhenyu Xu added a comment -

            the replay-single 70f only works for OSS server version after 2.9.52.60 (the issue is LU-1573, patch #16680 with git commit 1d2fbade1b658db4386091e7938d9483f7aa4a05), so 2.8/2.9 server does not contain this fix.

            As for the master server failure case, I checked the maloo report, their OSS version is 2.9.52.54.gc6f5e81, it does not have this fix as well.

            bobijam Zhenyu Xu added a comment - the replay-single 70f only works for OSS server version after 2.9.52.60 (the issue is LU-1573 , patch #16680 with git commit 1d2fbade1b658db4386091e7938d9483f7aa4a05), so 2.8/2.9 server does not contain this fix. As for the master server failure case, I checked the maloo report, their OSS version is 2.9.52.54.gc6f5e81, it does not have this fix as well.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise on this one? Thanks Peter
            jamesanunez James Nunez (Inactive) added a comment - - edited

            In the console for the OSTs, we see:

            03:11:42:[18142.619725] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
            03:11:42:[18144.679276] LustreError: 32511:0:(ofd_grant.c:686:ofd_grant_check()) lustre-OST0000: cli df47be9e-3378-d13b-7555-c154ed48e9ba is replaying OST_WRITE while one rnb hasn't OBD_BRW_FROM_GRANT set (0x108)
            03:11:42:[18144.711883] LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-10.2.4.142@tcp inode [0x20001a213:0xb88d:0x0] object 0x0:9508 extent [0-1048575]: client csum 32a1611e, server csum 66771e72
            03:11:42:[18144.873216] Lustre: lustre-OST0000: Recovery over after 0:03, of 3 clients 3 recovered and 0 were evicted.
            

            Looking through Maloo, replay-single test 70f started failing at the beginning of February of this year (2017) and has failed 60 times for the full test group since that time. All failures I've seen are during interop testing.

            Some early failure logs are at:
            2017-02-03 - (upstream client + master servers) https://testing.hpdd.intel.com/test_sets/766efa26-eaf4-11e6-af25-5254006e85c2
            2017-02-04 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/2122fcde-eba8-11e6-848c-5254006e85c2
            2017-02-04 - (master clients + b2_9 servers) https://testing.hpdd.intel.com/test_sets/3e6136de-eba9-11e6-9bb9-5254006e85c2
            2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/deae1082-ee27-11e6-bbfe-5254006e85c2
            2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/bf86630e-edf9-11e6-8f6d-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - - edited In the console for the OSTs, we see: 03:11:42:[18142.619725] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect 03:11:42:[18144.679276] LustreError: 32511:0:(ofd_grant.c:686:ofd_grant_check()) lustre-OST0000: cli df47be9e-3378-d13b-7555-c154ed48e9ba is replaying OST_WRITE while one rnb hasn't OBD_BRW_FROM_GRANT set (0x108) 03:11:42:[18144.711883] LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-10.2.4.142@tcp inode [0x20001a213:0xb88d:0x0] object 0x0:9508 extent [0-1048575]: client csum 32a1611e, server csum 66771e72 03:11:42:[18144.873216] Lustre: lustre-OST0000: Recovery over after 0:03, of 3 clients 3 recovered and 0 were evicted. Looking through Maloo, replay-single test 70f started failing at the beginning of February of this year (2017) and has failed 60 times for the full test group since that time. All failures I've seen are during interop testing. Some early failure logs are at: 2017-02-03 - (upstream client + master servers) https://testing.hpdd.intel.com/test_sets/766efa26-eaf4-11e6-af25-5254006e85c2 2017-02-04 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/2122fcde-eba8-11e6-848c-5254006e85c2 2017-02-04 - (master clients + b2_9 servers) https://testing.hpdd.intel.com/test_sets/3e6136de-eba9-11e6-9bb9-5254006e85c2 2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/deae1082-ee27-11e6-bbfe-5254006e85c2 2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/bf86630e-edf9-11e6-8f6d-5254006e85c2

            People

              sarah Sarah Liu
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: