Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14963

sanity-flr test_33a: timeout with no errors

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c3779d8f-6a21-4801-b14b-6f81a72baade

      test_33a failed with the following error:

      Timeout occurred after 163 mins, last suite running was sanity-flr
      

      The client does not show any errors/timeouts in this case (from client dmesg log):

      [ 6322.413011] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
      [ 6322.507218] bash (262976): drop_caches: 3
      [ 9912.891960] sysrq: SysRq : Changing Loglevel
      [ 9912.892752] sysrq: Loglevel set to 8
      [ 9912.894583] sysrq: SysRq : Show State
      :
      :
      [ 9915.051898] cat             I    0 264902 264901 0x00004080
      [ 9915.052798] Call Trace:
      [ 9915.053236]  __schedule+0x2c4/0x700
      [ 9915.053828]  schedule+0x38/0xa0
      [ 9915.054418]  cl_sync_io_wait+0x211/0x2f0 [obdclass]
      [ 9915.055886]  ll_io_read_page+0x443/0x820 [lustre]
      [ 9915.056681]  ll_readpage+0xf8/0x820 [lustre]
      [ 9915.057391]  generic_file_buffered_read+0x623/0xb00
      [ 9915.058908]  vvp_io_read_start+0x4c6/0x6e0 [lustre]
      [ 9915.059739]  cl_io_start+0x59/0x110 [obdclass]
      [ 9915.060500]  cl_io_loop+0x9a/0x1e0 [obdclass]
      [ 9915.061231]  ll_file_io_generic+0x447/0xc40 [lustre]
      [ 9915.062884]  ll_file_read_iter+0x246/0x2e0 [lustre]
      [ 9915.063693]  new_sync_read+0x121/0x170
      [ 9915.064321]  vfs_read+0x91/0x140
      [ 9915.064875]  ksys_read+0x4f/0xb0
      [ 9915.065429]  do_syscall_64+0x5b/0x1a0
      

      It looks like the OST0000 remount did not finish recovery before OST0001 was unmounted, and OST0001 was never remounted because the cat did not complete (from OSS dmesg log):

      [ 6219.486214] Lustre: lustre-OST0000: deleting orphan objects from 0x0:1027 to 0x0:1089
      [ 6219.605486] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts || true
      [ 6220.265117] Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost2
      [ 6220.605623] Lustre: Failing over lustre-OST0001
      [ 6220.606433] Lustre: Skipped 3 previous similar messages
      [ 6220.618067] Lustre: server umount lustre-OST0001 complete
      [ 6224.603047] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac
      [ 6224.606376] Lustre: Skipped 2 previous similar messages
      [ 6224.832608] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-207vm1.trevis.whamcloud.com: executing wait_import_state \(DISCONN\|IDLE\) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40
      [ 6225.231972] Lustre: DEBUG MARKER: trevis-207vm1.trevis.whamcloud.com: executing wait_import_state (DISCONN|IDLE) osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid 40
      [ 6226.704321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
      [ 6227.113264] Lustre: DEBUG MARKER: osc.lustre-OST0001-osc-ffff95d5a6dd8000.ost_server_uuid in DISCONN state after 1 sec
      [ 6229.721530] Lustre: lustre-OST0000: already connected client lustre-MDT0002-mdtlov_UUID (at 10.9.7.34@tcp) with handle 0x6b4562569fa55212. Rejecting client with the same UUID trying to reconnect with handle 0xd5828fc44237bcac
      [ 6229.729916] Lustre: lustre-OST0000: deleting orphan objects from 0x300000401:642 to 0x300000401:1089
      [ 6239.965362] Lustre: lustre-OST0000: already connected client lustre-MDT0003-mdtlov_UUID (at 10.9.7.39@tcp) with handle 0xb40b5df2a03e11e5. Rejecting client with the same UUID trying to reconnect with handle 0x5296d9516634a720
      [ 6239.965824] Lustre: lustre-OST0000: deleting orphan objects from 0x300000402:1187 to 0x300000402:1313
      [ 6239.968458] Lustre: Skipped 3 previous similar messages
      [ 6250.208413] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:1188 to 0x300000400:1313
      [ 6357.721301] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.7.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 6357.724292] LustreError: Skipped 142 previous similar messages
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-flr test_33a - Timeout occurred after 163 mins, last suite running was sanity-flr

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: