Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4601

Failure on test suite parallel-scale-nfsv3 test_compilebench

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.6.0
    • client and server: lustre-master build # 1876 RHEL6 ldiskfs
    • 3
    • 12586

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/cada8aee-8e18-11e3-9383-52540035b04c.

      The sub-test test_compilebench failed with the following error:

      compilebench failed: 1

      == parallel-scale-nfsv3 test compilebench: compilebench == 14:45:47 (1391553947)
      OPTIONS:
      cbench_DIR=/usr/bin
      cbench_IDIRS=2
      cbench_RUNS=2
      client-16vm1
      client-16vm2.lab.whamcloud.com
      ./compilebench -D /mnt/lustre/d0.compilebench -i 2         -r 2 --makej
      using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs
      native unpatched native-0 222MB in 189.77 seconds (1.17 MB/s)
      Traceback (most recent call last):
        File "./compilebench", line 567, in <module>
          dset = dataset(options.sources, rnd)
        File "./compilebench", line 319, in __init__
          self.unpatched = native_order(self.unpatched, "unpatched")
        File "./compilebench", line 104, in native_order
          os.rmdir(fullpath)
      OSError: [Errno 39] Directory not empty: '/mnt/lustre/d0.compilebench/native-0'
       parallel-scale-nfsv3 test_compilebench: @@@@@@ FAIL: compilebench failed: 1 
      

      Attachments

        Issue Links

          Activity

            [LU-4601] Failure on test suite parallel-scale-nfsv3 test_compilebench
            sarah Sarah Liu added a comment -

            Close this bug since didn't see it in the latest tests. The tests are failed due to LU-5109

            sarah Sarah Liu added a comment - Close this bug since didn't see it in the latest tests. The tests are failed due to LU-5109
            simmonsja James A Simmons added a comment - - edited

            Is this bug still seen. Now that patch 9038 has landed it should have gone away. If this bug is not longer showing up please close this ticket.

            simmonsja James A Simmons added a comment - - edited Is this bug still seen. Now that patch 9038 has landed it should have gone away. If this bug is not longer showing up please close this ticket.

            better yet I push a separate 9038 patch to see if it address this problem with the proper test suite string. What is the maloo test string I should push to cover the above failures?

            simmonsja James A Simmons added a comment - better yet I push a separate 9038 patch to see if it address this problem with the proper test suite string. What is the maloo test string I should push to cover the above failures?

            bobijam, can you look at the Maloo issues?

            cliffw Cliff White (Inactive) added a comment - bobijam, can you look at the Maloo issues?
            simmonsja James A Simmons added a comment - - edited

            The problem is that one module could before another when the server/client stack is in the same node. In this case it is osc/osp. A work around could be for the test framework to load osc.ko first then osp.ko later. Also please try my 9038 patch to make sure on your end that it passes the failing test. My testing is going okay but you never know.

            simmonsja James A Simmons added a comment - - edited The problem is that one module could before another when the server/client stack is in the same node. In this case it is osc/osp. A work around could be for the test framework to load osc.ko first then osp.ko later. Also please try my 9038 patch to make sure on your end that it passes the failing test. My testing is going okay but you never know.

            James, is there anything you can do to address the Maloo failures?

            cliffw Cliff White (Inactive) added a comment - James, is there anything you can do to address the Maloo failures?
            yujian Jian Yu added a comment - - edited Lustre Build: http://build.whamcloud.com/job/lustre-master/1890/ The same failure occurred: https://maloo.whamcloud.com/test_sets/4ab871b0-9687-11e3-bc3b-52540035b04c https://maloo.whamcloud.com/test_sets/68f2c766-9687-11e3-bc3b-52540035b04c https://maloo.whamcloud.com/test_sets/56a578e4-9680-11e3-a009-52540035b04c https://maloo.whamcloud.com/test_sets/68db57f4-9680-11e3-a009-52540035b04c
            simmonsja James A Simmons added a comment - - edited

            Lustre manages its own internal proc_dir_entry list which has the benefit of being searched with lprocfs_srch. As we move to the using the linux kernel's own internal list we lose the ability to search if a directory is already registered and worst yet the linux kernel will allow more than one directory to created with the same name in the same parent directory. That is why you see two osc directories if the client and server are run on the same node. You also see this problem with lod/lov.

            I was already seeing module loading race conditions with symlinks so I folded the fix into patch

            http://review.whamcloud.com/#/c/9038

            from LU-3319. To handle this problem I create a new proc entry for struct obd_type called typ_procsym. So for example when osp module is loaded first both osp and osc proc entries are registered first. When osc module is loaded then we find the obd_type for OSP if it exist and see if procsym is set. If it is then use that as the procroot for osc. The basic idea is make the procroot registration conditional for class_register_type.

            simmonsja James A Simmons added a comment - - edited Lustre manages its own internal proc_dir_entry list which has the benefit of being searched with lprocfs_srch. As we move to the using the linux kernel's own internal list we lose the ability to search if a directory is already registered and worst yet the linux kernel will allow more than one directory to created with the same name in the same parent directory. That is why you see two osc directories if the client and server are run on the same node. You also see this problem with lod/lov. I was already seeing module loading race conditions with symlinks so I folded the fix into patch http://review.whamcloud.com/#/c/9038 from LU-3319 . To handle this problem I create a new proc entry for struct obd_type called typ_procsym. So for example when osp module is loaded first both osp and osc proc entries are registered first. When osc module is loaded then we find the obd_type for OSP if it exist and see if procsym is set. If it is then use that as the procroot for osc. The basic idea is make the procroot registration conditional for class_register_type.
            bergwolf Peng Tao added a comment -

            I think it is the proc race, as the dmesg says

            proc_dir_entry 'lustre/osc' already registered
            bergwolf Peng Tao added a comment - I think it is the proc race, as the dmesg says proc_dir_entry 'lustre/osc' already registered

            Can you tell what snapshot of master this occurred on? What is the file system setup? Peng pointed the potential of a race in the osc/osp module loading. If osp gets loaded first then it conflict with the osc loading. Is this the case?

            simmonsja James A Simmons added a comment - Can you tell what snapshot of master this occurred on? What is the file system setup? Peng pointed the potential of a race in the osc/osp module loading. If osp gets loaded first then it conflict with the osc loading. Is this the case?

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: