Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8100

Missing MDTs in /proc/fs/lustre/lmv/lustre-clilmv-.../target_obd

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • lustre-2.8.0_14_gd0cbf68-1.x86_64
      2.6.32-573.22.1.1chaos.ch5.4.x86_64
    • 3
    • 9223372036854775807

    Description

      Our test setup contains 16 MDTs. Our clients only ever see 10 MDTs. All output below is from a client with the file system mounted. All were formatted at the same time with the same script and same mgs nid.

      Attempts to use any MDT that is not listed fails (i.e. lfs mkdir --index=10 will fail).

      [root@catalyst320:mdc]# lfs mkdir --index=12 /p/lustre/dinatale/testdir
      error on LL_IOC_LMV_SETSTRIPE '/p/lustre/dinatale/testdir' (3): No such device
      error: mkdir: create stripe dir '/p/lustre/dinatale/testdir' failed
      [root@catalyst320:mdc]# lfs mkdir --index=8 /p/lustre/dinatale/testdir
      [root@catalyst320:mdc]# lfs getdirstripe /p/lustre/dinatale/testdir/        
      /p/lustre/dinatale/testdir/
      lmv_stripe_count: 0 lmv_stripe_offset: 8
      
      [root@catalyst320:mdc]# cat /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obd
      0: lustre-MDT0000_UUID ACTIVE
      1: lustre-MDT0001_UUID ACTIVE
      2: lustre-MDT0002_UUID ACTIVE
      3: lustre-MDT0003_UUID ACTIVE
      4: lustre-MDT0004_UUID ACTIVE
      5: lustre-MDT0005_UUID ACTIVE
      6: lustre-MDT0006_UUID ACTIVE
      7: lustre-MDT0007_UUID ACTIVE
      8: lustre-MDT0008_UUID ACTIVE
      9: lustre-MDT0009_UUID ACTIVE
      
      ls /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obds/
      lustre-MDT0000-mdc-ffff881003e14400  lustre-MDT0006-mdc-ffff881003e14400  lustre-MDT0012-mdc-ffff881003e14400
      lustre-MDT0001-mdc-ffff881003e14400  lustre-MDT0007-mdc-ffff881003e14400  lustre-MDT0013-mdc-ffff881003e14400
      lustre-MDT0002-mdc-ffff881003e14400  lustre-MDT0008-mdc-ffff881003e14400  lustre-MDT0014-mdc-ffff881003e14400
      lustre-MDT0003-mdc-ffff881003e14400  lustre-MDT0009-mdc-ffff881003e14400  lustre-MDT0015-mdc-ffff881003e14400
      lustre-MDT0004-mdc-ffff881003e14400  lustre-MDT0010-mdc-ffff881003e14400
      lustre-MDT0005-mdc-ffff881003e14400  lustre-MDT0011-mdc-ffff881003e14400
      
      [root@catalyst320:mdc]# grep current_state */state
      lustre-MDT0000-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0001-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0002-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0003-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0004-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0005-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0006-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0007-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0008-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0009-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0010-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0011-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0012-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0013-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0014-mdc-ffff881003e14400/state:current_state: FULL
      lustre-MDT0015-mdc-ffff881003e14400/state:current_state: FULL
      

      Attachments

        Activity

          [LU-8100] Missing MDTs in /proc/fs/lustre/lmv/lustre-clilmv-.../target_obd

          Giuseppe Di Natale (dinatale2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/20336
          Subject: LU-8100 lmv: Correctly generate target_obd
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: d8c2209167efb0fcbba5a1a390e07e800e938fec

          gerrit Gerrit Updater added a comment - Giuseppe Di Natale (dinatale2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/20336 Subject: LU-8100 lmv: Correctly generate target_obd Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d8c2209167efb0fcbba5a1a390e07e800e938fec
          laisiyao Lai Siyao added a comment -

          No, thanks for your work!

          laisiyao Lai Siyao added a comment - No, thanks for your work!

          Just to make sure no one else is working on this, I'll be submitting a patch soon.

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Just to make sure no one else is working on this, I'll be submitting a patch soon.

          I found the portion of the code you are talking about. I might already have a patch.

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - I found the portion of the code you are talking about. I might already have a patch.
          laisiyao Lai Siyao added a comment -

          yes, current LMV code stores targets in an array, and the index can not exceed total count, so only the targets whose index is below the total count are listed.

          I'll make a fix later.

          laisiyao Lai Siyao added a comment - yes, current LMV code stores targets in an array, and the index can not exceed total count, so only the targets whose index is below the total count are listed. I'll make a fix later.

          Lai, I just launched a new test setup to do more testing. lfs mkdir was successful like you suggested it would be. Would this mean that there is a bug somewhere in the proc handler for /proc/fs/lustre/lmv/lustre-clilmv-.../target_obd since it doesn't contain all active MDTs in this case?

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Lai, I just launched a new test setup to do more testing. lfs mkdir was successful like you suggested it would be. Would this mean that there is a bug somewhere in the proc handler for /proc/fs/lustre/lmv/lustre-clilmv-.../target_obd since it doesn't contain all active MDTs in this case?
          ofaaland Olaf Faaland added a comment -

          Lai,

          I see what you mean. That explains why our mkdir failed. However the proc files seem not to be consistent, which seems like a separate problem.

          On the client, MDTs with indexes 0x10-0x15 are missing from the listing in /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obd, even though they are all present in /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obds/. You can see this in the description, above. Why would that be?

          thanks,
          Olaf

          ofaaland Olaf Faaland added a comment - Lai, I see what you mean. That explains why our mkdir failed. However the proc files seem not to be consistent, which seems like a separate problem. On the client, MDTs with indexes 0x10-0x15 are missing from the listing in /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obd, even though they are all present in /proc/fs/lustre/lmv/lustre-clilmv-ffff881003e14400/target_obds/. You can see this in the description, above. Why would that be? thanks, Olaf
          laisiyao Lai Siyao added a comment -

          This looks to be just as designed, because during format "--index" specifies the target index in the system. So in your original setup, `lfs mkdir --index 10 ...` will fail, but `lfs mkdir --index 16 ...` should succeed, because the MDT with index 10 doesn't exist, but 16 exists.

          laisiyao Lai Siyao added a comment - This looks to be just as designed, because during format "--index" specifies the target index in the system. So in your original setup, `lfs mkdir --index 10 ...` will fail, but `lfs mkdir --index 16 ...` should succeed, because the MDT with index 10 doesn't exist, but 16 exists.

          I was able to confirm Olaf's speculation on the naming convention. It appears the naming may be the source of the problem. I went ahead and redeployed a test file system where the MDT names ranged from 0000-000F and the client was able to connect to all MDTs. For completeness, lots of info below.

          [root@catalyst100:~]# lfs mkdir --index=12 /p/lustre/dinatale/testdir
          [root@catalyst100:~]# lfs getdirstripe /p/lustre/dinatale/testdir
          /p/lustre/dinatale/testdir
          lmv_stripe_count: 0 lmv_stripe_offset: 12
          
          [root@catalyst100:mdc]# grep current_state */state
          lustre-MDT0000-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0001-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0002-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0003-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0004-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0005-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0006-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0007-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0008-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT0009-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000a-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000b-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000c-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000d-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000e-mdc-ffff880fbb04f400/state:current_state: FULL
          lustre-MDT000f-mdc-ffff880fbb04f400/state:current_state: FULL
          
          [root@catalyst100:~]# ls /proc/fs/lustre/lmv/lustre-clilmv-ffff880fbb04f400/target_obds/
          lustre-MDT0000-mdc-ffff880fbb04f400  lustre-MDT0004-mdc-ffff880fbb04f400  lustre-MDT0008-mdc-ffff880fbb04f400  lustre-MDT000c-mdc-ffff880fbb04f400
          lustre-MDT0001-mdc-ffff880fbb04f400  lustre-MDT0005-mdc-ffff880fbb04f400  lustre-MDT0009-mdc-ffff880fbb04f400  lustre-MDT000d-mdc-ffff880fbb04f400
          lustre-MDT0002-mdc-ffff880fbb04f400  lustre-MDT0006-mdc-ffff880fbb04f400  lustre-MDT000a-mdc-ffff880fbb04f400  lustre-MDT000e-mdc-ffff880fbb04f400
          lustre-MDT0003-mdc-ffff880fbb04f400  lustre-MDT0007-mdc-ffff880fbb04f400  lustre-MDT000b-mdc-ffff880fbb04f400  lustre-MDT000f-mdc-ffff880fbb04f400
          
          [root@catalyst100:~]# lfs mdts
          MDTS:
          0: lustre-MDT0000_UUID ACTIVE
          1: lustre-MDT0001_UUID ACTIVE
          2: lustre-MDT0002_UUID ACTIVE
          3: lustre-MDT0003_UUID ACTIVE
          4: lustre-MDT0004_UUID ACTIVE
          5: lustre-MDT0005_UUID ACTIVE
          6: lustre-MDT0006_UUID ACTIVE
          7: lustre-MDT0007_UUID ACTIVE
          8: lustre-MDT0008_UUID ACTIVE
          9: lustre-MDT0009_UUID ACTIVE
          10: lustre-MDT000a_UUID ACTIVE
          11: lustre-MDT000b_UUID ACTIVE
          12: lustre-MDT000c_UUID ACTIVE
          13: lustre-MDT000d_UUID ACTIVE
          14: lustre-MDT000e_UUID ACTIVE
          15: lustre-MDT000f_UUID ACTIVE
          
          dinatale2 Giuseppe Di Natale (Inactive) added a comment - - edited I was able to confirm Olaf's speculation on the naming convention. It appears the naming may be the source of the problem. I went ahead and redeployed a test file system where the MDT names ranged from 0000-000F and the client was able to connect to all MDTs. For completeness, lots of info below. [root@catalyst100:~]# lfs mkdir --index=12 /p/lustre/dinatale/testdir [root@catalyst100:~]# lfs getdirstripe /p/lustre/dinatale/testdir /p/lustre/dinatale/testdir lmv_stripe_count: 0 lmv_stripe_offset: 12 [root@catalyst100:mdc]# grep current_state */state lustre-MDT0000-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0001-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0002-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0003-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0004-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0005-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0006-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0007-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0008-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT0009-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000a-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000b-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000c-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000d-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000e-mdc-ffff880fbb04f400/state:current_state: FULL lustre-MDT000f-mdc-ffff880fbb04f400/state:current_state: FULL [root@catalyst100:~]# ls /proc/fs/lustre/lmv/lustre-clilmv-ffff880fbb04f400/target_obds/ lustre-MDT0000-mdc-ffff880fbb04f400 lustre-MDT0004-mdc-ffff880fbb04f400 lustre-MDT0008-mdc-ffff880fbb04f400 lustre-MDT000c-mdc-ffff880fbb04f400 lustre-MDT0001-mdc-ffff880fbb04f400 lustre-MDT0005-mdc-ffff880fbb04f400 lustre-MDT0009-mdc-ffff880fbb04f400 lustre-MDT000d-mdc-ffff880fbb04f400 lustre-MDT0002-mdc-ffff880fbb04f400 lustre-MDT0006-mdc-ffff880fbb04f400 lustre-MDT000a-mdc-ffff880fbb04f400 lustre-MDT000e-mdc-ffff880fbb04f400 lustre-MDT0003-mdc-ffff880fbb04f400 lustre-MDT0007-mdc-ffff880fbb04f400 lustre-MDT000b-mdc-ffff880fbb04f400 lustre-MDT000f-mdc-ffff880fbb04f400 [root@catalyst100:~]# lfs mdts MDTS: 0: lustre-MDT0000_UUID ACTIVE 1: lustre-MDT0001_UUID ACTIVE 2: lustre-MDT0002_UUID ACTIVE 3: lustre-MDT0003_UUID ACTIVE 4: lustre-MDT0004_UUID ACTIVE 5: lustre-MDT0005_UUID ACTIVE 6: lustre-MDT0006_UUID ACTIVE 7: lustre-MDT0007_UUID ACTIVE 8: lustre-MDT0008_UUID ACTIVE 9: lustre-MDT0009_UUID ACTIVE 10: lustre-MDT000a_UUID ACTIVE 11: lustre-MDT000b_UUID ACTIVE 12: lustre-MDT000c_UUID ACTIVE 13: lustre-MDT000d_UUID ACTIVE 14: lustre-MDT000e_UUID ACTIVE 15: lustre-MDT000f_UUID ACTIVE

          Was able to reproduce the issue. Collected a log from a client involving a mount and the lfs command you requested. Let me know if you need anything else. The log file is called "debug_client_missing_mdts.log".

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Was able to reproduce the issue. Collected a log from a client involving a mount and the lfs command you requested. Let me know if you need anything else. The log file is called "debug_client_missing_mdts.log".

          Unfortunately, the file system we were testing with was only a temporary set up on one of our clusters. It no longer exists, so we can't collect any logs for a client for that specific file system. I may be able to try and reproduce the issue today and get you logs if I am successful.

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Unfortunately, the file system we were testing with was only a temporary set up on one of our clusters. It no longer exists, so we can't collect any logs for a client for that specific file system. I may be able to try and reproduce the issue today and get you logs if I am successful.

          People

            laisiyao Lai Siyao
            dinatale2 Giuseppe Di Natale (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: