Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.4.0, Lustre 2.5.0
    • kernel 2.6.32-358.6.2
      lustre 2.4.0

    Description

      I faced several bugs while running mds-survey tests.

      1)
      An error message is displayed by mds-survey at directory creation step (lctl test_mkdir).
      "ostid_set_id() Bad 18446744073709551615 to set 0:0"

      Here are the step by step commands executed by mds-survey.

      # lctl dl
        0 UP osd-ldiskfs fs2-MDT0000-osd fs2-MDT0000-osd_UUID 11
        1 UP mgc MGC30.1.0.95@o2ib 21007e3d-b2b2-494c-8c9e-a536f037ee6d 5
        2 UP mds MDS MDS_uuid 3
        3 UP lod fs2-MDT0000-mdtlov fs2-MDT0000-mdtlov_UUID 4
        4 UP mdt fs2-MDT0000 fs2-MDT0000_UUID 15
        5 UP mdd fs2-MDD0000 fs2-MDD0000_UUID 4
        6 UP qmt fs2-QMT0000 fs2-QMT0000_UUID 4
        7 UP lwp fs2-MDT0000-lwp-MDT0000 fs2-MDT0000-lwp-MDT0000_UUID 5
        8 UP osp fs2-OST0003-osc-MDT0000 fs2-MDT0000-mdtlov_UUID 5
        9 UP osp fs2-OST0002-osc-MDT0000 fs2-MDT0000-mdtlov_UUID 5
       10 UP osp fs2-OST0001-osc-MDT0000 fs2-MDT0000-mdtlov_UUID 5
       11 UP osp fs2-OST0000-osc-MDT0000 fs2-MDT0000-mdtlov_UUID 5
      # modprobe obdecho
      # lctl << EOF
      > attach echo_client fs2-MDT0000_ecc fs2-MDT0000_ecc_UUID
      > setup fs2-MDT0000 mdd
      > EOF
      # lctl --device 12 test_mkdir /test0
      (/homes/pichong/SB/AE4_kernel38/obj/x86_64_bullxlinux6.3/topdir/BUILD/lustre-2.4.0/lustre/include/lustre/lustre_idl.h:683:ostid_set_id()) Bad 18446744073709551615 to set 0:0
      

      2)
      The get_global_stats() subroutine of mds-survey does not correctly aggregate results from each MDT.

      get_global_stats () {
          local rfile=$1
          awk < $rfile                                               \
          'BEGIN {n = 0;}                                            \
          {    n++;                                                  \
               if (n == 1) { err = $1; ave = $2; min = $3; max = $4} \
               else                                                  \
               { if ($1 < err) err = $1;                             \
                 if ($2 < min) min = $2;                             \
                 if ($3 > max) max = $3;                             \
               }                                                     \
          }                                                          \
          END { if (n == 0) err = 0;                                 \
                printf "%d %f %f %f\n", err, ave, min, max}'
      }
      

      should be

      get_global_stats () {
          local rfile=$1
          awk < $rfile                                               \
          'BEGIN {n = 0;}                                            \
          {    n++;                                                  \
               if (n == 1) { err = $1; ave = $2; min = $3; max = $4} \
               else                                                  \
               { if ($1 < err) err = $1;                             \
                 ave += $2;                                          \
                 if ($3 < min) min = $3;                             \
                 if ($4 > max) max = $4;                             \
               }                                                     \
          }                                                          \
          END { if (n == 0) err = 0;                                 \
                printf "%d %f %f %f\n", err, ave/n, min, max}'
      }
      

      I am going to provide a patch.

      3)
      When stripe_count is positive, the mds-survey "destroy" action does not free the objects that were created on the OSTs. As a result, after several runs of mds-survey, the OSTs return ENOSPC at file creation and "lctl test_create" hangs with the following stack.

      [<ffffffffa0171731>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa0e74154>] osp_precreate_reserve+0x5c4/0x1ee0 [osp]
      [<ffffffffa0e6dc55>] osp_declare_object_create+0x155/0x4f0 [osp]
      [<ffffffffa0b7438d>] lod_qos_declare_object_on+0xed/0x480 [lod]
      [<ffffffffa0b75f0f>] lod_alloc_rr.clone.2+0x66f/0xde0 [lod]
      [<ffffffffa0b77b69>] lod_qos_prep_create+0xfa9/0x1b14 [lod]
      [<ffffffffa0b71cab>] lod_declare_striped_object+0x14b/0x880 [lod]
      [<ffffffffa0b72df3>] lod_declare_xattr_set+0x273/0x410 [lod]
      [<ffffffffa0547700>] mdo_declare_xattr_set.clone.4+0x40/0xe0 [mdd]
      [<ffffffffa054a470>] mdd_declare_create+0x4b0/0x860 [mdd]
      [<ffffffffa054afb1>] mdd_create+0x791/0x1740 [mdd]
      [<ffffffffa0ea8fef>] echo_md_create_internal+0x1cf/0x640 [obdecho]
      [<ffffffffa0eb2b43>] echo_md_handler+0x1333/0x1ac0 [obdecho]
      [<ffffffffa0eb7257>] echo_client_iocontrol+0x2dc7/0x3b40 [obdecho]
      [<ffffffffa060849f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]
      [<ffffffffa05f02ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
      [<ffffffff81181372>] vfs_ioctl+0x22/0xa0
      [<ffffffff81181514>] do_vfs_ioctl+0x84/0x580
      [<ffffffff81181a91>] sys_ioctl+0x81/0xa0
      [<ffffffff81003072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Activity

          [LU-3624] mds-survey has several bugs
          mdiep Minh Diep added a comment -

          Hi Gregoire,

          If you still have your system, could you provide the output of mds-survey run after you apply patch 2? - thanks

          mdiep Minh Diep added a comment - Hi Gregoire, If you still have your system, could you provide the output of mds-survey run after you apply patch 2? - thanks
          pjones Peter Jones added a comment -

          Minh will help with landing the remaining patch

          pjones Peter Jones added a comment - Minh will help with landing the remaining patch

          Patch 1 has landed to master as 5ae2c575ff234c7b1189d2f71d8e5a73509591f3

          bobbielind Bobbie Lind (Inactive) added a comment - Patch 1 has landed to master as 5ae2c575ff234c7b1189d2f71d8e5a73509591f3

          I have posted a patch for item 2) and other issues related to multiple MDT support in mds-survey.
          Here it is: http://review.whamcloud.com/7558

          pichong Gregoire Pichon added a comment - I have posted a patch for item 2) and other issues related to multiple MDT support in mds-survey. Here it is: http://review.whamcloud.com/7558
          pjones Peter Jones added a comment -

          Bobbie

          Could you please take care of this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobbie Could you please take care of this one? Thanks Peter

          Here is a patch for item 1)
          http://review.whamcloud.com/7101

          pichong Gregoire Pichon added a comment - Here is a patch for item 1) http://review.whamcloud.com/7101

          About point 3), the issue might not be caused by the "destroy" action. Looking at free/used objects on the OSTs shows that freeing is delayed and lasts several minutes.

          I run mds-survey with stripe_count=4 and file_count=100000, and in the meantime displayed the objects in use on one of the OSTs (/proc/fs/lustre/obdfilter/fs2-OST0000/filestotal - cat /proc/fs/lustre/obdfilter/fs2-OST0000/filesfree).

          15:53:42 fs2-OST0000 filesused=16543
          15:53:52 fs2-OST0000 filesused=16543
          15:54:02 fs2-OST0000 filesused=46543
          15:54:12 fs2-OST0000 filesused=112447
          15:54:22 fs2-OST0000 filesused=108351
          15:54:32 fs2-OST0000 filesused=104255
          15:54:42 fs2-OST0000 filesused=100159
          15:54:52 fs2-OST0000 filesused=96063
          15:55:02 fs2-OST0000 filesused=91967
          15:55:12 fs2-OST0000 filesused=87871
          15:55:22 fs2-OST0000 filesused=83775
          15:55:32 fs2-OST0000 filesused=79679
          15:55:42 fs2-OST0000 filesused=75583
          15:55:52 fs2-OST0000 filesused=71487
          15:56:02 fs2-OST0000 filesused=67391
          15:56:12 fs2-OST0000 filesused=63295
          15:56:22 fs2-OST0000 filesused=59199
          15:56:32 fs2-OST0000 filesused=55103
          15:56:42 fs2-OST0000 filesused=51007
          15:56:52 fs2-OST0000 filesused=46911
          15:57:02 fs2-OST0000 filesused=42815
          15:57:12 fs2-OST0000 filesused=38719
          15:57:22 fs2-OST0000 filesused=34623
          15:57:32 fs2-OST0000 filesused=30527
          15:57:42 fs2-OST0000 filesused=26431
          15:57:52 fs2-OST0000 filesused=22335
          15:58:02 fs2-OST0000 filesused=18239
          15:58:12 fs2-OST0000 filesused=16543
          15:58:22 fs2-OST0000 filesused=16543
          15:58:32 fs2-OST0000 filesused=16543
          15:58:42 fs2-OST0000 filesused=16543

          This behavior seems to be a problem when launching several mds-survey runs in a raw.

          pichong Gregoire Pichon added a comment - About point 3), the issue might not be caused by the "destroy" action. Looking at free/used objects on the OSTs shows that freeing is delayed and lasts several minutes. I run mds-survey with stripe_count=4 and file_count=100000, and in the meantime displayed the objects in use on one of the OSTs (/proc/fs/lustre/obdfilter/fs2-OST0000/filestotal - cat /proc/fs/lustre/obdfilter/fs2-OST0000/filesfree). 15:53:42 fs2-OST0000 filesused=16543 15:53:52 fs2-OST0000 filesused=16543 15:54:02 fs2-OST0000 filesused=46543 15:54:12 fs2-OST0000 filesused=112447 15:54:22 fs2-OST0000 filesused=108351 15:54:32 fs2-OST0000 filesused=104255 15:54:42 fs2-OST0000 filesused=100159 15:54:52 fs2-OST0000 filesused=96063 15:55:02 fs2-OST0000 filesused=91967 15:55:12 fs2-OST0000 filesused=87871 15:55:22 fs2-OST0000 filesused=83775 15:55:32 fs2-OST0000 filesused=79679 15:55:42 fs2-OST0000 filesused=75583 15:55:52 fs2-OST0000 filesused=71487 15:56:02 fs2-OST0000 filesused=67391 15:56:12 fs2-OST0000 filesused=63295 15:56:22 fs2-OST0000 filesused=59199 15:56:32 fs2-OST0000 filesused=55103 15:56:42 fs2-OST0000 filesused=51007 15:56:52 fs2-OST0000 filesused=46911 15:57:02 fs2-OST0000 filesused=42815 15:57:12 fs2-OST0000 filesused=38719 15:57:22 fs2-OST0000 filesused=34623 15:57:32 fs2-OST0000 filesused=30527 15:57:42 fs2-OST0000 filesused=26431 15:57:52 fs2-OST0000 filesused=22335 15:58:02 fs2-OST0000 filesused=18239 15:58:12 fs2-OST0000 filesused=16543 15:58:22 fs2-OST0000 filesused=16543 15:58:32 fs2-OST0000 filesused=16543 15:58:42 fs2-OST0000 filesused=16543 This behavior seems to be a problem when launching several mds-survey runs in a raw.

          People

            mdiep Minh Diep
            pichong Gregoire Pichon
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: