Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10600

sanity test 66 hangs when run after tests 64d and 65k

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      To reproduce this hang, run the following:

      # ./llmount.sh
      #  ./auster -v -k sanity --only "64d 65k 66"
      

      The above commands with set up Lustre with one combined MGS/MDS and two OSTs with loop back devices on a single node. The auster command will run sanity test 64d, 65k and 66 only.

      The output from the run is:

      == sanity test 64d: check grant limit exceed ========================================================= 19:10:43 (1517512243)
      debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
      dd: error writing '/mnt/lustre/f64d.sanity': No space left on device
      278+0 records in
      277+0 records out
      290684928 bytes (291 MB) copied, 13.1767 s, 22.1 MB/s
      /usr/lib64/lustre/tests/sanity.sh: line 6121: kill: (25805) - No such process
      debug=trace inode super ext2 malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec lfsck hsm snapshot layout
      Resetting fail_loc on all nodes...done.
      PASS 64d (15s)
      
      
      == sanity test 65k: validate manual striping works properly with deactivated OSCs ==================== 19:10:58 (1517512258)
      Check OST status: 
      lustre-OST0000-osc-MDT0000 is active
      lustre-OST0001-osc-MDT0000 is active
      total: 1000 open/close in 1.49 seconds: 672.84 ops/second
      Deactivate:  lustre-OST0000-osc-MDT0000
      /usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0
      /usr/bin/lfs setstripe -i 1 -c 1 /mnt/lustre/d65k.sanity/1
       - unlinked 0 (time 1517512260 ; total 0 ; last 0)
      total: 1000 unlinks in 2 seconds: 500.000000 unlinks/second
      lustre-OST0000-osc-MDT0000 is Activate
      trevis-58vm6.trevis.hpdd.intel.com: executing wait_import_state FULL osc.*OST*-osc-MDT0000.ost_server_uuid 40
      osc.*OST*-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
      total: 1000 open/close in 1.62 seconds: 615.67 ops/second
      Deactivate:  lustre-OST0001-osc-MDT0000
      /usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0
      lfs setstripe: error on ioctl 0x4008669a for '/mnt/lustre/d65k.sanity/0' (3): No space left on device
       sanity test_65k: @@@@@@ FAIL: setstripe 0 should succeed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5130:error()
        = /usr/lib64/lustre/tests/sanity.sh:6310:test_65k()
        = /usr/lib64/lustre/tests/test-framework.sh:5406:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5445:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5244:run_test()
        = /usr/lib64/lustre/tests/sanity.sh:6326:main()
      Dumping lctl log to /tmp/test_logs/2018-02-01/191040/sanity.test_65k.*.1517512264.log
      Dumping logs only on local client.
      Resetting fail_loc on all nodes...done.
      FAIL 65k (6s)
      
      
      == sanity test 66: update inode blocks count on client =============================================== 19:11:04 (1517512264)
      

      Looking at the testing output, we see that test 64d fills an OST and doesn’t remove the file and then test 65k deactivates one OST at a time and, when it has deactivated the empty OST and tries to use the full one, the ‘lfs setstripe’ command fails. Test 65k exits when the ‘lfs setstripe’ command fail and leaves one OST deactivated.

      The last thing I see in dmesg from test 66 is:

       [797721.192593] Lustre: DEBUG MARKER: == sanity test 66: update inode blocks count on client =============================================== 23:33:24 (1517614404)
      [797773.923382] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff88000c752c00, cur 1517614457 expire 1517614427 last 1517614404
      

      We need to clean up the file from test 64d and, on failure, we need to activate the OST and clean up files in test 65k.

      I’ll upload a patch for this.

      Attachments

        Activity

          People

            jamesanunez James Nunez (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: