Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4554

OI scrub always runs on ldiskfs MDS start up

Details

    • 3
    • 12442

    Description

      We are running Lustre 2.4.0-21chaos (see http://github/com/chaos/lustre), and most likely of particular interest are these two patches that we are carrying:

      • LU-3934 scrub: detect upgraded from 1.8 correctly
      • LU-3420 scrub: trigger OI scrub properly

      We now find that, at least on the ldiskfs MDS nodes, OI scrub runs on every start up of the MDS. The console message looks something like this:

      2014-01-28 09:27:28 sumom-mds1 login: LustreError: 0-0: lsc-MDT0000: trigger OI scrub by RPC for [0x7e4d2310f09:0x2ddf:0x0], rc = 0 [1]

      Given the frequency of MDS reboots (i.e. often) required lately for other bugs, OI scrub is running far too much.

      Attachments

        Activity

          [LU-4554] OI scrub always runs on ldiskfs MDS start up
          nedbass Ned Bass (Inactive) added a comment - Patch for master: http://review.whamcloud.com/#/c/9067/

          I peeked at the /OI_scrub file while an auto-scrub was running. It showed that bit 0 was set in sf->sf_oi_bitmap. This is wrong, because the OI already exists and OI_scrub has already run to completion several times.

          I think I see the problem in osd_oi_table_open(). Note the format string assumes the OI containers have names like oi.16.0, oi.16.1, and so on. However, for our upgraded filesystems we have only one OI container named oi.16. So osd_oi_open() returns ENOENT and we proceed to set the "recreated" bit in the bitmap.

          nedbass Ned Bass (Inactive) added a comment - I peeked at the /OI_scrub file while an auto-scrub was running. It showed that bit 0 was set in sf->sf_oi_bitmap. This is wrong, because the OI already exists and OI_scrub has already run to completion several times. I think I see the problem in osd_oi_table_open() . Note the format string assumes the OI containers have names like oi.16.0, oi.16.1, and so on. However, for our upgraded filesystems we have only one OI container named oi.16. So osd_oi_open() returns ENOENT and we proceed to set the "recreated" bit in the bitmap.

          I also notice osd_fid_lookup() starts the scrub using osd_scrub_start(dev), which only enables the flag SS_AUTO. So even though (ldiskfs_test_bit(osd_oi_fid2idx(dev,fid), sf->sf_oi_bitmap)) is true, (unless I misunderstand something) we would not see the "recreated" flag in oi_scrub.

          nedbass Ned Bass (Inactive) added a comment - I also notice osd_fid_lookup() starts the scrub using osd_scrub_start(dev) , which only enables the flag SS_AUTO . So even though (ldiskfs_test_bit(osd_oi_fid2idx(dev,fid), sf->sf_oi_bitmap)) is true, (unless I misunderstand something) we would not see the "recreated" flag in oi_scrub .

          Fan Yong, the -1 debug log is from a classified system, so I can't send it, but if you have specific questions about it I can look for you. While the scrub was in progress, the flags field only had 'auto'. An example FID from that system that followed the "trigger" path was [0x1a89082ad98:0x4d:0x0].

          nedbass Ned Bass (Inactive) added a comment - Fan Yong, the -1 debug log is from a classified system, so I can't send it, but if you have specific questions about it I can look for you. While the scrub was in progress, the flags field only had 'auto'. An example FID from that system that followed the "trigger" path was [0x1a89082ad98:0x4d:0x0] .

          Then it is strange, the code should not jump from such place. Is there any (-1 level) debug log when the OI scrub triggered?

          yong.fan nasf (Inactive) added a comment - Then it is strange, the code should not jump from such place. Is there any (-1 level) debug log when the OI scrub triggered?

          Here is the oi_scrub output from one of the MDS nodes that was rebooted on our open network this morning:

          > cat osd-ldiskfs/lsc-MDT0000/oi_scrub
          name: OI_scrub
          magic: 0x4c5fd252
          oi_files: 1
          status: completed
          flags:
          param:
          time_since_last_completed: 28001 seconds
          time_since_latest_start: 36680 seconds
          time_since_last_checkpoint: 28001 seconds
          latest_start_position: 12
          last_checkpoint_position: 991133697
          first_failure_position: N/A
          checked: 170974399
          updated: 0
          failed: 0
          prior_updated: 0
          noscrub: 2284
          igif: 12381456
          success_count: 12
          run_time: 8678 seconds
          average_speed: 19702 objects/sec
          real-time_speed: N/A
          current_position: N/A
          
          morrone Christopher Morrone (Inactive) added a comment - Here is the oi_scrub output from one of the MDS nodes that was rebooted on our open network this morning: > cat osd-ldiskfs/lsc-MDT0000/oi_scrub name: OI_scrub magic: 0x4c5fd252 oi_files: 1 status: completed flags: param: time_since_last_completed: 28001 seconds time_since_latest_start: 36680 seconds time_since_last_checkpoint: 28001 seconds latest_start_position: 12 last_checkpoint_position: 991133697 first_failure_position: N/A checked: 170974399 updated: 0 failed: 0 prior_updated: 0 noscrub: 2284 igif: 12381456 success_count: 12 run_time: 8678 seconds average_speed: 19702 objects/sec real-time_speed: N/A current_position: N/A

          The FID [0x7e4d2310f09:0x2ddf:0x0] is an normal FID, according to above logic, the condition:

          "(ldiskfs_test_bit(osd_oi_fid2idx(dev,fid), sf->sf_oi_bitmap))" should true.

          That means the OSD thinks that it is in OI files re-creating, so it is not sure whether it is because the object does not non-exist or because related OI mapping is not re-inserted yet.

          So would you please to show the OI_scrub file under /proc to check whether the system is really in re-creating the OI files or not? Thanks!

          yong.fan nasf (Inactive) added a comment - The FID [0x7e4d2310f09:0x2ddf:0x0] is an normal FID, according to above logic, the condition: "(ldiskfs_test_bit(osd_oi_fid2idx(dev,fid), sf->sf_oi_bitmap))" should true. That means the OSD thinks that it is in OI files re-creating, so it is not sure whether it is because the object does not non-exist or because related OI mapping is not re-inserted yet. So would you please to show the OI_scrub file under /proc to check whether the system is really in re-creating the OI files or not? Thanks!

          I'll note that in the oi_scrub file, the Updated count is 0 and the Success count is 5. From the current scan position relative to number of inodes used, I think it is nearing completion. So it seems the OI scrub is not finding anything to repair, yet certain client requests trigger the scrub. From server-side debug logs, it seems that the trigger is happening here in osd_fid_lookup():

           383         /* Search order: 3. OI files. */                                        
           384         result = osd_oi_lookup(info, dev, fid, id, true);                       
           385         if (result == -ENOENT) {                                                
           386                 if (!fid_is_norm(fid) || fid_is_on_ost(info, dev, fid) ||       
           387                     !ldiskfs_test_bit(osd_oi_fid2idx(dev,fid),                  
           388                                       sf->sf_oi_bitmap))                        
           389                         GOTO(out, result = 0);                                  
           390                                                                                 
           391                 goto trigger;                                                   
           392         }
          

          There are no debug log entries from osd_iget() so it must be jumping past the iget: goto label, which means it must take the path above.

          nedbass Ned Bass (Inactive) added a comment - I'll note that in the oi_scrub file, the Updated count is 0 and the Success count is 5. From the current scan position relative to number of inodes used, I think it is nearing completion. So it seems the OI scrub is not finding anything to repair, yet certain client requests trigger the scrub. From server-side debug logs, it seems that the trigger is happening here in osd_fid_lookup() : 383 /* Search order: 3. OI files. */ 384 result = osd_oi_lookup(info, dev, fid, id, true ); 385 if (result == -ENOENT) { 386 if (!fid_is_norm(fid) || fid_is_on_ost(info, dev, fid) || 387 !ldiskfs_test_bit(osd_oi_fid2idx(dev,fid), 388 sf->sf_oi_bitmap)) 389 GOTO(out, result = 0); 390 391 goto trigger; 392 } There are no debug log entries from osd_iget() so it must be jumping past the iget: goto label, which means it must take the path above.

          We need to increase the priority on this ticket. The OI scrub is introducing a lot of production problems for us.

          First of all, while OI scrub is running, many client nodes hang while holding the mdc lock. Apparently they are doing an operation that triggers osd_fid_lookup() on the server. The FID lookup fails and the server decides to start OI scrub (or OI scrub is already running) and returns EINPROGRESS to the client.

          Unfortunately, the client is now pretty much unusable to anyone until OI scrub complete.

          Furthermore, it would appear that OI scrub will not respond to the stop command. I'm told that running "lctl lfsck_stop -M ls5-MDT0000" returns "operation already in progress".

          morrone Christopher Morrone (Inactive) added a comment - We need to increase the priority on this ticket. The OI scrub is introducing a lot of production problems for us. First of all, while OI scrub is running, many client nodes hang while holding the mdc lock. Apparently they are doing an operation that triggers osd_fid_lookup() on the server. The FID lookup fails and the server decides to start OI scrub (or OI scrub is already running) and returns EINPROGRESS to the client. Unfortunately, the client is now pretty much unusable to anyone until OI scrub complete. Furthermore, it would appear that OI scrub will not respond to the stop command. I'm told that running "lctl lfsck_stop -M ls5-MDT0000" returns "operation already in progress".
          pjones Peter Jones added a comment -

          Lai

          Could you please consult with Fan Yong and respond on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please consult with Fan Yong and respond on this one? Thanks Peter

          People

            jamesanunez James Nunez (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: