Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8018

lov_init_raid0() ASSERTION( subdev != NULL ) failed: not init ost 0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • Lustre 2.8.0 + llnl patches, both client and server
      TOSS 2 (RHEL 6.7 based)
      Kernel 2.6.32-573.22.1.1chaos.ch5.4.x86_64
    • 3
    • 9223372036854775807

    Description

      Client crashes during sys_creat() call on LBUG.

      LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) ASSERTION( subdev != NULL ) failed: not init ost 0
      LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) LBUG                                              
      Pid: 12025, comm: mdtest                                                                                   
      
      Call Trace:
       [<ffffffffa04a48d5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa04a4ed7>] lbug_with_loc+0x47/0xb0 [libcfs]         
       [<ffffffffa0dcc729>] lov_init_raid0+0xde9/0x1140 [lov]        
       [<ffffffffa0dc9394>] lov_object_init+0x124/0x300 [lov]        
       [<ffffffffa09347ec>] ? lu_object_add+0x2c/0x30 [obdclass]     
       [<ffffffffa09371a8>] lu_object_alloc+0xd8/0x320 [obdclass]    
       [<ffffffffa093851d>] lu_object_find_try+0xcd/0x260 [obdclass] 
       [<ffffffffa04b48d2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
       [<ffffffffa0938761>] lu_object_find_at+0xb1/0xe0 [obdclass]     
       [<ffffffffa04b7f7a>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs] 
       [<ffffffffa09387cf>] lu_object_find_slice+0x1f/0x80 [obdclass]  
       [<ffffffffa093e455>] cl_object_find+0x55/0xc0 [obdclass]        
       [<ffffffffa0e6929e>] cl_file_inode_init+0x22e/0x340 [lustre]    
       [<ffffffffa0e368b4>] ll_update_inode+0x474/0x1d90 [lustre]      
       [<ffffffffa0915765>] ? lprocfs_counter_add+0x165/0x1c0 [obdclass]
       [<ffffffffa0b01f35>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]         
       [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]    
       [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]    
       [<ffffffff812352a0>] ? security_inode_alloc+0x40/0x60            
       [<ffffffffa0e32cc9>] ? ll_lli_init+0x209/0x2a0 [lustre]          
       [<ffffffffa0e3823d>] ll_read_inode2+0x6d/0x410 [lustre]          
       [<ffffffffa0e4eceb>] ll_iget+0x12b/0x330 [lustre]                
       [<ffffffffa0e3a0ed>] ll_prep_inode+0x5bd/0xc40 [lustre]          
       [<ffffffffa0e536de>] ? ll_lookup_it+0x63e/0xda0 [lustre]         
       [<ffffffffa0e546e8>] ll_create_nd+0x3b8/0xf20 [lustre]           
       [<ffffffffa0e247c6>] ? ll_inode_permission+0x136/0x3e0 [lustre]  
       [<ffffffff811a2b76>] vfs_create+0xe6/0x110                       
       [<ffffffff811a626e>] do_filp_open+0xa8e/0xd20                    
       [<ffffffff810f3a85>] ? call_rcu_sched+0x15/0x20                  
       [<ffffffff812a07cd>] ? strncpy_from_user+0x5d/0xa0               
       [<ffffffff811b3572>] ? alloc_fd+0x92/0x160                       
       [<ffffffff81190007>] do_sys_open+0x67/0x130                      
       [<ffffffff8118fe6d>] ? filp_close+0x5d/0x90                      
       [<ffffffff81190110>] sys_open+0x20/0x30                          
       [<ffffffff81190135>] sys_creat+0x15/0x20                         
       [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-8018] lov_init_raid0() ASSERTION( subdev != NULL ) failed: not init ost 0
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21411/
          Subject: LU-8018 lov: ld_target could be NULL
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 916b5f7f4672c1070e27a2fe0bfae371b0a729d6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21411/ Subject: LU-8018 lov: ld_target could be NULL Project: fs/lustre-release Branch: master Current Patch Set: Commit: 916b5f7f4672c1070e27a2fe0bfae371b0a729d6
          gerrit Gerrit Updater added a comment - - edited

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21411
          Subject: LU-8018 lov: ld_target could be NULL
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 6
          Commit: d3dceaca9964e65f0da4d05bbb3e93ee31a8a752

          gerrit Gerrit Updater added a comment - - edited Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21411 Subject: LU-8018 lov: ld_target could be NULL Project: fs/lustre-release Branch: master Current Patch Set: 6 Commit: d3dceaca9964e65f0da4d05bbb3e93ee31a8a752
          ofaaland Olaf Faaland added a comment -

          Zhenyu,

          Those "Failing over" messages are a result of us umount-ing the server targets and then mounting them again. The first time at 8:34 as I recall was to see if OSTs would make the missing connections. In retrospect I should have looked at the config logs. The second time I don't recall the reason for.

          -Olaf

          ofaaland Olaf Faaland added a comment - Zhenyu, Those "Failing over" messages are a result of us umount-ing the server targets and then mounting them again. The first time at 8:34 as I recall was to see if OSTs would make the missing connections. In retrospect I should have looked at the config logs. The second time I don't recall the reason for. -Olaf
          bobijam Zhenyu Xu added a comment -

          From the log, the system was not stable and have several failover happened between 8:19 and 9:01

          [~/tmp/logs/lu8012/con-logs/] $ grep Failing *
          199.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0000
          199.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0000
          200.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0001
          201.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0002
          202.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0003
          203.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0004
          204.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0005
          204.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0005
          205.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0006
          206.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0007
          207.log:22:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0008
          208.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0009
          

          Can't tell what caused the failover though.

          bobijam Zhenyu Xu added a comment - From the log, the system was not stable and have several failover happened between 8:19 and 9:01 [~/tmp/logs/lu8012/con-logs/] $ grep Failing * 199.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0000 199.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0000 200.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0001 201.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0002 202.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0003 203.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0004 204.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0005 204.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0005 205.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0006 206.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0007 207.log:22:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0008 208.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0009 Can't tell what caused the failover though.

          Hi Bobijam,
          Can you also follow up with this ticket?
          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Bobijam, Can you also follow up with this ticket? Thanks. Joe
          ofaaland Olaf Faaland added a comment -

          Attached the console logs as con-logs.tgz

          ofaaland Olaf Faaland added a comment - Attached the console logs as con-logs.tgz

          Olaf, any chance of posting the server console logs, so we can see why the initial MDS-OSS connections failed?

          adilger Andreas Dilger added a comment - Olaf, any chance of posting the server console logs, so we can see why the initial MDS-OSS connections failed?

          Andreas,

          The mdtest options were probably this:
          mdtest -F -u -i 3 -n 16000 -p 60 -v -d /p/lustre/faaland1/count10_index0

          Yes, the directory was striped across all 10 MDTs in the file system, with --default_stripe option set to match.

          The MDTs and OSTs were all started at 08:19:48, and the client mounted at 08:20:40. The job wasn't run until about 40 minutes later, which is when the clients hit the LBUG at 09:01:40. I would think that object precreation was long since finished.

          I realize now that I do have the console logs for the servers, which may have clues. So I will go through and find what is there between 8:19 and 9:01 and report back.

          Yes, based on the output of dmesg in crash, the lov_ea.c:195:lsm_unpackmd_common() error message appears first, and the LBUG follows. As you say, I saw the two OSTs were not properly connected, but I expected no objects on them would be allocated and they were just sit idle.

          ofaaland Olaf Faaland added a comment - Andreas, The mdtest options were probably this: mdtest -F -u -i 3 -n 16000 -p 60 -v -d /p/lustre/faaland1/count10_index0 Yes, the directory was striped across all 10 MDTs in the file system, with --default_stripe option set to match. The MDTs and OSTs were all started at 08:19:48, and the client mounted at 08:20:40. The job wasn't run until about 40 minutes later, which is when the clients hit the LBUG at 09:01:40. I would think that object precreation was long since finished. I realize now that I do have the console logs for the servers, which may have clues. So I will go through and find what is there between 8:19 and 9:01 and report back. Yes, based on the output of dmesg in crash, the lov_ea.c:195:lsm_unpackmd_common() error message appears first, and the LBUG follows. As you say, I saw the two OSTs were not properly connected, but I expected no objects on them would be allocated and they were just sit idle.

          Olaf, what options were used to run mdtest? Was the test run inside a striped directory, and if yes, presumably stripe count of 10? Had the filesystem _just_been mounted? I recall one problem during pre-release DNE testing with 128 MDTs that it took some time for all the MDTs to connect to each other, and I wonder if there was still MDT-OST object procreation underway when the test started?

          The LBUG and lov console messages do appear related. Presumably the lam_unoackmd() error was first, and LBUG second? The LBUG could just be bad error handling during the cleanup path, but it doesn't explain why the client was given a bad layout from the MDS. It is not unusual for an OST to be disconnected from the MDS, and it should just be skipped in this case.

          adilger Andreas Dilger added a comment - Olaf, what options were used to run mdtest? Was the test run inside a striped directory, and if yes, presumably stripe count of 10? Had the filesystem _just_been mounted? I recall one problem during pre-release DNE testing with 128 MDTs that it took some time for all the MDTs to connect to each other, and I wonder if there was still MDT-OST object procreation underway when the test started? The LBUG and lov console messages do appear related. Presumably the lam_unoackmd() error was first, and LBUG second? The LBUG could just be bad error handling during the cleanup path, but it doesn't explain why the client was given a bad layout from the MDS. It is not unusual for an OST to be disconnected from the MDS, and it should just be skipped in this case.

          People

            bobijam Zhenyu Xu
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: