Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • 3
    • 7394

    Description

      We are using lustre 2.1.4-3chaos on our server clusters.

      Running a test application, one of our archive storage folks discovered that Lustre's directory listings are rather unreliable. The first thing she noticed is that directory entries can appear multiple times:

      > cd /p/lscratchrza/apotts/divt_rzstagg0/htar_1st_27475
      > find . -type f > ../test.lst0 ; echo $? ; wc -l ../test.lst0
      0
      34339 ../test.lst0
      > find . -type f > ../test.lst1 ; echo $? ; wc -l ../test.lst1
      0
      35006 ../test.lst1
      

      When the two directory listings are sorted and run through uniq, there are only 34339 unique entries.

      One of our sysadmins investigated, and further found that sometimes entry listing are missing altogether. But when the missing files are checked with an ls, they are present.

      This has been noticed with the above find command, and also using "/bin/ls -laR .". Both files and subdirectories have appeared twice in the directory listing.

      The Lustre clients that have reproduced this behaviour are running 2.1.2-4chaos and 1.8.5.0-6chaos.

      Attachments

        Activity

          [LU-3029] Directory listings are unreliable

          Actually, the BG/P systems are 32bit (at least the I/O nodes are)! BG/Q (Sequoia and Vulcan) are the first PPC systems that are fully 64-bit.

          morrone Christopher Morrone (Inactive) added a comment - Actually, the BG/P systems are 32bit (at least the I/O nodes are)! BG/Q (Sequoia and Vulcan) are the first PPC systems that are fully 64-bit.

          So please to apply the patch:

          http://review.whamcloud.com/#change,6138

          yong.fan nasf (Inactive) added a comment - So please to apply the patch: http://review.whamcloud.com/#change,6138

          We do have PPC64 clients on our BG/P systems that are stuck at 1.8. The udawn front end (login) nodes are running 1.8.5.0-5chaos. The IONs are running 1.8.5-8chaos.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We do have PPC64 clients on our BG/P systems that are stuck at 1.8. The udawn front end (login) nodes are running 1.8.5.0-5chaos. The IONs are running 1.8.5-8chaos.
          It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.
          

          Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.

          prakash Prakash Surya (Inactive) added a comment - It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash. Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.
          yong.fan nasf (Inactive) added a comment - - edited

          It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          yong.fan nasf (Inactive) added a comment - - edited It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          No, I don't believe so. The PPC clients are running 2.3.58, all other clients else should be 2.1.x or 1.8.5.

          nedbass Ned Bass (Inactive) added a comment - No, I don't believe so. The PPC clients are running 2.3.58, all other clients else should be 2.1.x or 1.8.5.
          yong.fan nasf (Inactive) added a comment - - edited

          Were there any liblustre clients or old b2_0 clients accessed your system since the last MDS reboot? They do not support 64-bit dirhash.

          yong.fan nasf (Inactive) added a comment - - edited Were there any liblustre clients or old b2_0 clients accessed your system since the last MDS reboot? They do not support 64-bit dirhash.

          The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

          The 1.8 clients have been running the same version since long before the MDS was last rebooted.

          nedbass Ned Bass (Inactive) added a comment - The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues. The 1.8 clients have been running the same version since long before the MDS was last rebooted.

          The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

          As for the race checking, I am not sure. Since it is race, it should be rare. So means most RPC service threads should NOT set "FMODE_32BITHASH". But the results does not support such conclusion. On the other hand, if the "race" exist, it is possible for me to reproduce it locally. But I have tested with your branches thousand times, never reproduced. So the "race" is so rare...

          yong.fan nasf (Inactive) added a comment - The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues. As for the race checking, I am not sure. Since it is race, it should be rare. So means most RPC service threads should NOT set "FMODE_32BITHASH". But the results does not support such conclusion. On the other hand, if the "race" exist, it is possible for me to reproduce it locally. But I have tested with your branches thousand times, never reproduced. So the "race" is so rare...

          nasf, thanks for the analysis. It does seem plausible, except that our oldest client is lustre-1.8.5.0-6chaos, and it seems to claim OBD_CONNECT_64BITHASH:

          https://github.com/chaos/lustre/blob/1.8.5.0-6chaos/lustre/llite/llite_lib.c#L284

          Another unsubstantiated theory is that a swabbing bug could make a PPC client look like 32-bit. But, I checked the import file in /proc for a PPC client and it said 64bithash.

          The only other idea I have is maybe there is some race checking the exp_connect_flags here:

          https://github.com/chaos/lustre/blob/2.1.4-4chaos/lustre/mdt/mdt_handler.c#L1488

          nedbass Ned Bass (Inactive) added a comment - nasf, thanks for the analysis. It does seem plausible, except that our oldest client is lustre-1.8.5.0-6chaos, and it seems to claim OBD_CONNECT_64BITHASH: https://github.com/chaos/lustre/blob/1.8.5.0-6chaos/lustre/llite/llite_lib.c#L284 Another unsubstantiated theory is that a swabbing bug could make a PPC client look like 32-bit. But, I checked the import file in /proc for a PPC client and it said 64bithash. The only other idea I have is maybe there is some race checking the exp_connect_flags here: https://github.com/chaos/lustre/blob/2.1.4-4chaos/lustre/mdt/mdt_handler.c#L1488
          yong.fan nasf (Inactive) added a comment - - edited

          The situation should be like this:

          1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c, the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing.

          2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_file::f_mode" to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS.

          3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned.

          4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_file::f_mode" have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned.

          5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position.

          6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.

          yong.fan nasf (Inactive) added a comment - - edited The situation should be like this: 1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c , the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing. 2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_ file::f_mode " to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS. 3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned. 4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_ file::f_mode " have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned. 5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position. 6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: