Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.4.0
    • 2.4.0-19chaos clients and servers
    • 3
    • 12564

    Description

      We have received reports of Lustre clients incorrectly reporting files with 0 length then on a second attempt the correct non-zero length will be reported. This is reminiscent of LU-274. Before digging too far into this, I notice that the fix for LU-274 did not seem to survive the conversion from obdfilter to ofd. Do we need a similar fix in ofd_intent_policy()?

      Here is the reproducer reported by our user:

      1. On cslic3 using lscratch3, I created 20 files of size 1024 bytes
      2. Waited 27 minutes (I tried around 15 minutes and didn't see incorrect size zeros)
      3. On cslic8 listed the lscratch3 directory from step 1, I saw 3 files with size 0
      4. On cslic8 listed the lscratch3 directory again (immediately following step3) and all files listed as 1024 bytes

      Attachments

        Activity

          [LU-4597] inconsistent file size
          pjones Peter Jones added a comment -

          Landed for 2.5.1 and 2.6

          pjones Peter Jones added a comment - Landed for 2.5.1 and 2.6
          jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/9328/

          We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior.

          nedbass Ned Bass (Inactive) added a comment - We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior.
          niu Niu Yawei (Inactive) added a comment - patch for master: http://review.whamcloud.com/9249

          I confirmed that disabling statahead_agl seem to prevent the bug here as well.

          lctl set_param llite.*.statahead_agl=0
          
          nedbass Ned Bass (Inactive) added a comment - I confirmed that disabling statahead_agl seem to prevent the bug here as well. lctl set_param llite.*.statahead_agl=0

          I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further.

          niu Niu Yawei (Inactive) added a comment - I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further.

          FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce.

          nedbass Ned Bass (Inactive) added a comment - FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce.

          In case it helps interpret the debug logs, here are the NIDs of the nodes involved.

          sierra654: 192.168.114.155@o2ib5     # created files
          sierra330: 192.168.113.81@o2ib5      # got zero length for file4
          porter44: 172.19.1.213@o2ib100       # OSS owning object for file4
          porter-mds1: 172.19.1.165@o2ib100    # MDS
          
          nedbass Ned Bass (Inactive) added a comment - In case it helps interpret the debug logs, here are the NIDs of the nodes involved. sierra654: 192.168.114.155@o2ib5 # created files sierra330: 192.168.113.81@o2ib5 # got zero length for file4 porter44: 172.19.1.213@o2ib100 # OSS owning object for file4 porter-mds1: 172.19.1.165@o2ib100 # MDS

          I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance.

          nedbass Ned Bass (Inactive) added a comment - I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance.

          We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple.

          # Create files locally then list them on remote node e8.
          for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd`
          

          Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts.

          I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs.

          nedbass Ned Bass (Inactive) added a comment - We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre ) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple. # Create files locally then list them on remote node e8. for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd` Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts. I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs.

          People

            niu Niu Yawei (Inactive)
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: