Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 1.8.9
    • Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
    • CentOS 5 with 1.8.6-WC1 clients and 1.8.4 servers.

    Description

      During running some tests, we found client hang during tests. Further investigation shows that it is because client is looping for ever in ll_readdir_page(). The first ll_dir_dentry is correct but some of following up ll_dir_dentry record is all NULL.

      crash> struct ll_dir_entry 0xffff8105f5818000
      struct ll_dir_entry {
      lde_inode = 748257583,
      lde_rec_len = 12,
      lde_name_len = 1 '\001',
      lde_file_type = 2 '\002',
      lde_name = ".\000\000\000\200\202\230,\f\000\002\002..\000\0000\201\231,\024\000\n\001ssciohb.nrat1\201\231,\020\000\005\001krsni8552\201\231,\020\000\005\001nticpemc3\201\231,\020\000\b\001crn.wole4\201\231,\f\000\003\001lsrt5\201\231,\024\000\n\001loita.hdal2.6\201\231,\024\000\n\001feekg.vsri9\0007\201\231,\024\000\f\001lgaumt.ggesd8\201\231,\024\000\v\001eilltn.ncsr.9\201\231,\020\000\b\001sai.blol:\201\231,\024\000\t\001rnbtmru.sing;\201\231,\020\000\006\001eta.fo64<\201\231,\f\000\003\001rkr.=\201\231,\020\000\005\001aco.d13"
      }

      crash> struct ll_dir_entry 0xffff8105f5818a19
      struct ll_dir_entry {
      lde_inode = 0,
      lde_rec_len = 0,
      lde_name_len = 0 '\0',
      lde_file_type = 0 '\0',
      lde_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
      }

      After applying bellow changes, tests passes smoothly and the debug message is printed a lot.

      diff --git a/lustre/llite/dir.c b/lustre/llite/dir.c
      index 3154d32..3b9779b 100644
      — a/lustre/llite/dir.c
      +++ b/lustre/llite/dir.c
      @@ -327,6 +327,12 @@ static int ll_readdir_page(char *addr, __u64 base, unsigned *offset,
      de = ll_entry_at(addr, *offset);
      end = addr + CFS_PAGE_SIZE - ll_dir_rec_len(1);
      for (nr = 0 ;(char*)de <= end; de = ll_dir_next_entry(de)) {
      + if (de->lde_rec_len == 0)

      { + printk("bergwolf debug\n"); + printk("de %p lde_inode %d lde_rec_len %d lde_name_len %d lde_file_type %d\n", + de, de->lde_inode, de->lde_rec_len, de->lde_name_len, de->lde_file_type); + break; + }

      if (de->lde_inode != 0) {
      nr++;
      *offset = (char *)de - addr;

      It may not be the right fix as I didn't figure out why the page is partially zeroed.

      Attachments

        Activity

          [LU-1322] 1.8 client hang with 1.8.4 server

          This code has been merged.

          keith Keith Mannthey (Inactive) added a comment - This code has been merged.

          This patch has been properly acked for acceptace but not much code is being taken into 1.8 at this point. As this is a Major bug I am sure it is still under consideration.

          keith Keith Mannthey (Inactive) added a comment - This patch has been properly acked for acceptace but not much code is being taken into 1.8 at this point. As this is a Major bug I am sure it is still under consideration.
          bergwolf Peng Tao added a comment -

          patch has been uploaded to http://review.whamcloud.com/#change,3181

          bergwolf Peng Tao added a comment - patch has been uploaded to http://review.whamcloud.com/#change,3181
          bergwolf Peng Tao added a comment -

          This is also reproduced with 1.8.8-WC1 clients.

          From a generated kernel dump, it appears that there are only three valid dentries in the page, starting from offset 0, 12, 24. However, in ll_readdir(), kernel is asking to read a dentry from page offset 90 and therefore read into garbage data.

          crash> struct ll_dir_entry 0xffff88051bd73000
          struct ll_dir_entry {
          lde_inode = 200449640,
          lde_rec_len = 12,
          lde_name_len = 1 '\001',
          lde_file_type = 2 '\002',
          lde_name = ".\000\000\000\001\000\362\v\f\000\002\002..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
          }
          crash> struct ll_dir_entry 0xffff88051bd7300c
          struct ll_dir_entry {
          lde_inode = 200409089,
          lde_rec_len = 12,
          lde_name_len = 2 '\002',
          lde_file_type = 2 '\002',
          lde_name = "..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
          }
          crash> struct ll_dir_entry 0xffff88051bd73018
          struct ll_dir_entry {
          lde_inode = 200449641,
          lde_rec_len = 4072,
          lde_name_len = 13 '\r',
          lde_file_type = 1 '\001',
          lde_name = "validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
          }

          Although it doesn't make much sense for application to seek randomly within dir page, Lustre should really work with the situation.

          bergwolf Peng Tao added a comment - This is also reproduced with 1.8.8-WC1 clients. From a generated kernel dump, it appears that there are only three valid dentries in the page, starting from offset 0, 12, 24. However, in ll_readdir(), kernel is asking to read a dentry from page offset 90 and therefore read into garbage data. crash> struct ll_dir_entry 0xffff88051bd73000 struct ll_dir_entry { lde_inode = 200449640, lde_rec_len = 12, lde_name_len = 1 '\001', lde_file_type = 2 '\002', lde_name = ".\000\000\000\001\000\362\v\f\000\002\002..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" } crash> struct ll_dir_entry 0xffff88051bd7300c struct ll_dir_entry { lde_inode = 200409089, lde_rec_len = 12, lde_name_len = 2 '\002', lde_file_type = 2 '\002', lde_name = "..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" } crash> struct ll_dir_entry 0xffff88051bd73018 struct ll_dir_entry { lde_inode = 200449641, lde_rec_len = 4072, lde_name_len = 13 '\r', lde_file_type = 1 '\001', lde_name = "validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file_2\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" } Although it doesn't make much sense for application to seek randomly within dir page, Lustre should really work with the situation.

          People

            keith Keith Mannthey (Inactive)
            bergwolf Peng Tao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: