Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1322

1.8 client hang with 1.8.4 server

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 1.8.9
    • Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
    • CentOS 5 with 1.8.6-WC1 clients and 1.8.4 servers.

    Description

      During running some tests, we found client hang during tests. Further investigation shows that it is because client is looping for ever in ll_readdir_page(). The first ll_dir_dentry is correct but some of following up ll_dir_dentry record is all NULL.

      crash> struct ll_dir_entry 0xffff8105f5818000
      struct ll_dir_entry {
      lde_inode = 748257583,
      lde_rec_len = 12,
      lde_name_len = 1 '\001',
      lde_file_type = 2 '\002',
      lde_name = ".\000\000\000\200\202\230,\f\000\002\002..\000\0000\201\231,\024\000\n\001ssciohb.nrat1\201\231,\020\000\005\001krsni8552\201\231,\020\000\005\001nticpemc3\201\231,\020\000\b\001crn.wole4\201\231,\f\000\003\001lsrt5\201\231,\024\000\n\001loita.hdal2.6\201\231,\024\000\n\001feekg.vsri9\0007\201\231,\024\000\f\001lgaumt.ggesd8\201\231,\024\000\v\001eilltn.ncsr.9\201\231,\020\000\b\001sai.blol:\201\231,\024\000\t\001rnbtmru.sing;\201\231,\020\000\006\001eta.fo64<\201\231,\f\000\003\001rkr.=\201\231,\020\000\005\001aco.d13"
      }

      crash> struct ll_dir_entry 0xffff8105f5818a19
      struct ll_dir_entry {
      lde_inode = 0,
      lde_rec_len = 0,
      lde_name_len = 0 '\0',
      lde_file_type = 0 '\0',
      lde_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
      }

      After applying bellow changes, tests passes smoothly and the debug message is printed a lot.

      diff --git a/lustre/llite/dir.c b/lustre/llite/dir.c
      index 3154d32..3b9779b 100644
      — a/lustre/llite/dir.c
      +++ b/lustre/llite/dir.c
      @@ -327,6 +327,12 @@ static int ll_readdir_page(char *addr, __u64 base, unsigned *offset,
      de = ll_entry_at(addr, *offset);
      end = addr + CFS_PAGE_SIZE - ll_dir_rec_len(1);
      for (nr = 0 ;(char*)de <= end; de = ll_dir_next_entry(de)) {
      + if (de->lde_rec_len == 0)

      { + printk("bergwolf debug\n"); + printk("de %p lde_inode %d lde_rec_len %d lde_name_len %d lde_file_type %d\n", + de, de->lde_inode, de->lde_rec_len, de->lde_name_len, de->lde_file_type); + break; + }

      if (de->lde_inode != 0) {
      nr++;
      *offset = (char *)de - addr;

      It may not be the right fix as I didn't figure out why the page is partially zeroed.

      Attachments

        Activity

          People

            keith Keith Mannthey (Inactive)
            bergwolf Peng Tao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: