Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8018

lov_init_raid0() ASSERTION( subdev != NULL ) failed: not init ost 0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • Lustre 2.8.0 + llnl patches, both client and server
      TOSS 2 (RHEL 6.7 based)
      Kernel 2.6.32-573.22.1.1chaos.ch5.4.x86_64
    • 3
    • 9223372036854775807

    Description

      Client crashes during sys_creat() call on LBUG.

      LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) ASSERTION( subdev != NULL ) failed: not init ost 0
      LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) LBUG                                              
      Pid: 12025, comm: mdtest                                                                                   
      
      Call Trace:
       [<ffffffffa04a48d5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa04a4ed7>] lbug_with_loc+0x47/0xb0 [libcfs]         
       [<ffffffffa0dcc729>] lov_init_raid0+0xde9/0x1140 [lov]        
       [<ffffffffa0dc9394>] lov_object_init+0x124/0x300 [lov]        
       [<ffffffffa09347ec>] ? lu_object_add+0x2c/0x30 [obdclass]     
       [<ffffffffa09371a8>] lu_object_alloc+0xd8/0x320 [obdclass]    
       [<ffffffffa093851d>] lu_object_find_try+0xcd/0x260 [obdclass] 
       [<ffffffffa04b48d2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
       [<ffffffffa0938761>] lu_object_find_at+0xb1/0xe0 [obdclass]     
       [<ffffffffa04b7f7a>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs] 
       [<ffffffffa09387cf>] lu_object_find_slice+0x1f/0x80 [obdclass]  
       [<ffffffffa093e455>] cl_object_find+0x55/0xc0 [obdclass]        
       [<ffffffffa0e6929e>] cl_file_inode_init+0x22e/0x340 [lustre]    
       [<ffffffffa0e368b4>] ll_update_inode+0x474/0x1d90 [lustre]      
       [<ffffffffa0915765>] ? lprocfs_counter_add+0x165/0x1c0 [obdclass]
       [<ffffffffa0b01f35>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]         
       [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]    
       [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]    
       [<ffffffff812352a0>] ? security_inode_alloc+0x40/0x60            
       [<ffffffffa0e32cc9>] ? ll_lli_init+0x209/0x2a0 [lustre]          
       [<ffffffffa0e3823d>] ll_read_inode2+0x6d/0x410 [lustre]          
       [<ffffffffa0e4eceb>] ll_iget+0x12b/0x330 [lustre]                
       [<ffffffffa0e3a0ed>] ll_prep_inode+0x5bd/0xc40 [lustre]          
       [<ffffffffa0e536de>] ? ll_lookup_it+0x63e/0xda0 [lustre]         
       [<ffffffffa0e546e8>] ll_create_nd+0x3b8/0xf20 [lustre]           
       [<ffffffffa0e247c6>] ? ll_inode_permission+0x136/0x3e0 [lustre]  
       [<ffffffff811a2b76>] vfs_create+0xe6/0x110                       
       [<ffffffff811a626e>] do_filp_open+0xa8e/0xd20                    
       [<ffffffff810f3a85>] ? call_rcu_sched+0x15/0x20                  
       [<ffffffff812a07cd>] ? strncpy_from_user+0x5d/0xa0               
       [<ffffffff811b3572>] ? alloc_fd+0x92/0x160                       
       [<ffffffff81190007>] do_sys_open+0x67/0x130                      
       [<ffffffff8118fe6d>] ? filp_close+0x5d/0x90                      
       [<ffffffff81190110>] sys_open+0x20/0x30                          
       [<ffffffff81190135>] sys_creat+0x15/0x20                         
       [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-8018] lov_init_raid0() ASSERTION( subdev != NULL ) failed: not init ost 0

          Hi Bobijam,
          Can you also follow up with this ticket?
          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Bobijam, Can you also follow up with this ticket? Thanks. Joe
          ofaaland Olaf Faaland added a comment -

          Attached the console logs as con-logs.tgz

          ofaaland Olaf Faaland added a comment - Attached the console logs as con-logs.tgz

          Olaf, any chance of posting the server console logs, so we can see why the initial MDS-OSS connections failed?

          adilger Andreas Dilger added a comment - Olaf, any chance of posting the server console logs, so we can see why the initial MDS-OSS connections failed?

          Andreas,

          The mdtest options were probably this:
          mdtest -F -u -i 3 -n 16000 -p 60 -v -d /p/lustre/faaland1/count10_index0

          Yes, the directory was striped across all 10 MDTs in the file system, with --default_stripe option set to match.

          The MDTs and OSTs were all started at 08:19:48, and the client mounted at 08:20:40. The job wasn't run until about 40 minutes later, which is when the clients hit the LBUG at 09:01:40. I would think that object precreation was long since finished.

          I realize now that I do have the console logs for the servers, which may have clues. So I will go through and find what is there between 8:19 and 9:01 and report back.

          Yes, based on the output of dmesg in crash, the lov_ea.c:195:lsm_unpackmd_common() error message appears first, and the LBUG follows. As you say, I saw the two OSTs were not properly connected, but I expected no objects on them would be allocated and they were just sit idle.

          ofaaland Olaf Faaland added a comment - Andreas, The mdtest options were probably this: mdtest -F -u -i 3 -n 16000 -p 60 -v -d /p/lustre/faaland1/count10_index0 Yes, the directory was striped across all 10 MDTs in the file system, with --default_stripe option set to match. The MDTs and OSTs were all started at 08:19:48, and the client mounted at 08:20:40. The job wasn't run until about 40 minutes later, which is when the clients hit the LBUG at 09:01:40. I would think that object precreation was long since finished. I realize now that I do have the console logs for the servers, which may have clues. So I will go through and find what is there between 8:19 and 9:01 and report back. Yes, based on the output of dmesg in crash, the lov_ea.c:195:lsm_unpackmd_common() error message appears first, and the LBUG follows. As you say, I saw the two OSTs were not properly connected, but I expected no objects on them would be allocated and they were just sit idle.

          Olaf, what options were used to run mdtest? Was the test run inside a striped directory, and if yes, presumably stripe count of 10? Had the filesystem _just_been mounted? I recall one problem during pre-release DNE testing with 128 MDTs that it took some time for all the MDTs to connect to each other, and I wonder if there was still MDT-OST object procreation underway when the test started?

          The LBUG and lov console messages do appear related. Presumably the lam_unoackmd() error was first, and LBUG second? The LBUG could just be bad error handling during the cleanup path, but it doesn't explain why the client was given a bad layout from the MDS. It is not unusual for an OST to be disconnected from the MDS, and it should just be skipped in this case.

          adilger Andreas Dilger added a comment - Olaf, what options were used to run mdtest? Was the test run inside a striped directory, and if yes, presumably stripe count of 10? Had the filesystem _just_been mounted? I recall one problem during pre-release DNE testing with 128 MDTs that it took some time for all the MDTs to connect to each other, and I wonder if there was still MDT-OST object procreation underway when the test started? The LBUG and lov console messages do appear related. Presumably the lam_unoackmd() error was first, and LBUG second? The LBUG could just be bad error handling during the cleanup path, but it doesn't explain why the client was given a bad layout from the MDS. It is not unusual for an OST to be disconnected from the MDS, and it should just be skipped in this case.
          ofaaland Olaf Faaland added a comment - - edited

          The filesystem existed only temporarily during a DAT on Catalyst, and it's gone now. However I have crash dumps from the two client nodes that crashed. Both had just started to run mdtest.

          I do not have any records from the two OSTs which were missing connections. I should have crashed one and/or gathered more information. My mistake.

          ofaaland Olaf Faaland added a comment - - edited The filesystem existed only temporarily during a DAT on Catalyst, and it's gone now. However I have crash dumps from the two client nodes that crashed. Both had just started to run mdtest. I do not have any records from the two OSTs which were missing connections. I should have crashed one and/or gathered more information. My mistake.
          ofaaland Olaf Faaland added a comment -

          I don't have a record to prove it, but I believe that OST 0 was one of the two OSTs which did not report connections to all the MDTs.

          ofaaland Olaf Faaland added a comment - I don't have a record to prove it, but I believe that OST 0 was one of the two OSTs which did not report connections to all the MDTs.
          ofaaland Olaf Faaland added a comment -

          Our local patches are mostly build-related and I do not believe any of them are related to this issue.

          ofaaland Olaf Faaland added a comment - Our local patches are mostly build-related and I do not believe any of them are related to this issue.
          ofaaland Olaf Faaland added a comment -

          The console log contains the following messages that occurred within the same second of the LBUG:

          LustreError: 12027:0:(lov_ea.c:195:lsm_unpackmd_common()) lustre-clilov_UUID: OST index 0 missing   
          Lustre: 12027:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x4:2147484700, magic 0x0bd10bd0, pattern 0x1
          Lustre: 12025:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x8:1073742858, magic 0x0bd10bd0, pattern 0x1
          Lustre: 12025:0:(lov_pack.c:69:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 1, layout_gen 0  
          Lustre: 12025:0:(lov_pack.c:88:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x940000400:34
          
          ofaaland Olaf Faaland added a comment - The console log contains the following messages that occurred within the same second of the LBUG: LustreError: 12027:0:(lov_ea.c:195:lsm_unpackmd_common()) lustre-clilov_UUID: OST index 0 missing Lustre: 12027:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x4:2147484700, magic 0x0bd10bd0, pattern 0x1 Lustre: 12025:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x8:1073742858, magic 0x0bd10bd0, pattern 0x1 Lustre: 12025:0:(lov_pack.c:69:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 1, layout_gen 0 Lustre: 12025:0:(lov_pack.c:88:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x940000400:34
          ofaaland Olaf Faaland added a comment -

          There were 10 MDTs, 20 OSTs at the time. Of the 20 OSTs, 2 reported only 5 exports in /proc/fs/lustre/obdfilter/lustre-OST0000/num_exports. The remaining 18 OSTs reported 11 exports.

          ofaaland Olaf Faaland added a comment - There were 10 MDTs, 20 OSTs at the time. Of the 20 OSTs, 2 reported only 5 exports in /proc/fs/lustre/obdfilter/lustre-OST0000/num_exports. The remaining 18 OSTs reported 11 exports.
          ofaaland Olaf Faaland added a comment -

          For specific patch set, see branch:2.8-llnl-preview1

          ofaaland Olaf Faaland added a comment - For specific patch set, see branch:2.8-llnl-preview1

          People

            bobijam Zhenyu Xu
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: