Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-376

Client hangs when listing big directory with ls -la

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.1.0, Lustre 1.8.6
    • None
    • Client: 1.8.5
      Server: 2xMDS, 8xOSS, 24xOST, Lustre 2.0.59, RHEL 5.6

    Description

      We have noticed some interoperability issue between 1.8.5 clients and 2.0.59 server (no other versions tested)
      Clients with 2.0.59 are not affected with the problem.

      How to reproduce problem:

      On client node issue:
      cd /mnt/lustre
      mkdir somebigdir
      for i in `seq 1 10000`; do touch file.$i; done;
      ls -la

      Symptom is trivial - client hangs , when 2.0.59 is used, such kind of listing takes ~4s

      Problem is interconnect independent: tested with @tcp as well as with @o2ib

      Possible log message related to the issue:

      00010000:00010000:10:1306772139.242230:0:3591:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 3/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
      00010000:00010000:10:1306772139.242239:0:3591:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 3 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
      00010000:00010000:10:1306772139.242244:0:3591:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
      00000080:00200000:10:1306772139.242252:0:3591:0:(dir.c:594:ll_dir_readpage_20()) VFS Op:inode=144115238810157057/0(ffff8103f56ef920) off 3590582044
      00000100:00100000:10:1306772139.242259:0:3591:0:(client.c:2084:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37
      00000100:00100000:10:1306772139.242811:0:3591:0:(client.c:2189:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37

      I can provide more information and do provide testing when needed.
      Best Regards

      Lukasz Flis

      Attachments

        Activity

          [LU-376] Client hangs when listing big directory with ls -la
          pjones Peter Jones made changes -
          Affects Version/s New: Lustre 1.8.6 [ 10022 ]
          Affects Version/s Original: Lustre 1.8.x [ 10010 ]
          yong.fan nasf (Inactive) made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
          adilger Andreas Dilger made changes -
          Fix Version/s New: Lustre 1.8.6 [ 10022 ]
          Affects Version/s New: Lustre 2.1.0 [ 10021 ]
          Affects Version/s Original: Lustre 2.0.0 [ 10011 ]
          m.magrys Marek Magrys made changes -
          Comment [ We still get I/O errors with patched server from build #785. Following clients were tested:
          - unpatched 1.8.5
          - patched 1.8.6 (from build #787) ]
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.1.0 [ 10021 ]
          pjones Peter Jones made changes -
          Description Original:
          We have noticed some interoperability issue between 1.8.5 clients and 2.0.59 server (no other versions tested)
          Clients with 2.0.59 are not affected with the problem.

          How to reproduce problem:

          On client node issue:
          cd /mnt/lustre
          mkdir somebigdir
          for i in `seq 1 10000`; do touch file.$i; done;
          ls -la

          Symptom is trivial - client hangs , when 2.0.59 is used, such kind of listing takes ~4s

          Problem is interconnect independent: tested with @tcp as well as with @o2ib

          Possible log message related to the issue:

          00010000:00010000:10:1306772139.242230:0:3591:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 3/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00010000:00010000:10:1306772139.242239:0:3591:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 3 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00010000:00010000:10:1306772139.242244:0:3591:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00000080:00200000:10:1306772139.242252:0:3591:0:(dir.c:594:ll_dir_readpage_20()) VFS Op:inode=144115238810157057/0(ffff8103f56ef920) off 3590582044
          00000100:00100000:10:1306772139.242259:0:3591:0:(client.c:2084:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37
          00000100:00100000:10:1306772139.242811:0:3591:0:(client.c:2189:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37

          I can provide more information and do provide testing when needed.
          Best Regards
          --
          Lukasz Flis
          New: We have noticed some interoperability issue between 1.8.5 clients and 2.0.59 server (no other versions tested)
          Clients with 2.0.59 are not affected with the problem.

          How to reproduce problem:

          On client node issue:
          cd /mnt/lustre
          mkdir somebigdir
          for i in `seq 1 10000`; do touch file.$i; done;
          ls -la

          Symptom is trivial - client hangs , when 2.0.59 is used, such kind of listing takes ~4s

          Problem is interconnect independent: tested with @tcp as well as with @o2ib

          Possible log message related to the issue:

          00010000:00010000:10:1306772139.242230:0:3591:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 3/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00010000:00010000:10:1306772139.242239:0:3591:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 3 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00010000:00010000:10:1306772139.242244:0:3591:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: scratch-MDT0000-mdc-ffff81041677b800 lock: ffff8103f56ec200/0xf6a4fad9013fdffb lrc: 2/1,0 mode: PR/PR res: 8589937616/1 bits 0x3 rrc: 2 type: IBT flags: 0x0 remote: 0x3b122fd677c9380d expref: -99 pid: 1905 timeout: 0
          00000080:00200000:10:1306772139.242252:0:3591:0:(dir.c:594:ll_dir_readpage_20()) VFS Op:inode=144115238810157057/0(ffff8103f56ef920) off 3590582044
          00000100:00100000:10:1306772139.242259:0:3591:0:(client.c:2084:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37
          00000100:00100000:10:1306772139.242811:0:3591:0:(client.c:2189:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc ls:9a637513-e3b6-abe7-b530-d8d413e552d9:3591:x1370249573210902:172.16.193.1@o2ib:37

          I can provide more information and do provide testing when needed.
          Best Regards
          --
          Lukasz Flis
          Priority Original: Minor [ 4 ] New: Blocker [ 1 ]
          yong.fan nasf (Inactive) made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]
          pjones Peter Jones made changes -
          Assignee Original: Robert Read [ rread ] New: nasf [ yong.fan ]
          lflis Lukasz Flis created issue -

          People

            yong.fan nasf (Inactive)
            lflis Lukasz Flis
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: