[LU-8018] lov_init_raid0() ASSERTION( subdev != NULL ) failed: not init ost 0 Created: 13/Apr/16 Updated: 29/Nov/16 Resolved: 26/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, llnlfixready | ||
| Environment: |
Lustre 2.8.0 + llnl patches, both client and server |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Client crashes during sys_creat() call on LBUG. LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) ASSERTION( subdev != NULL ) failed: not init ost 0 LustreError: 12025:0:(lov_object.c:278:lov_init_raid0()) LBUG Pid: 12025, comm: mdtest Call Trace: [<ffffffffa04a48d5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa04a4ed7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0dcc729>] lov_init_raid0+0xde9/0x1140 [lov] [<ffffffffa0dc9394>] lov_object_init+0x124/0x300 [lov] [<ffffffffa09347ec>] ? lu_object_add+0x2c/0x30 [obdclass] [<ffffffffa09371a8>] lu_object_alloc+0xd8/0x320 [obdclass] [<ffffffffa093851d>] lu_object_find_try+0xcd/0x260 [obdclass] [<ffffffffa04b48d2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] [<ffffffffa0938761>] lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffffa04b7f7a>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs] [<ffffffffa09387cf>] lu_object_find_slice+0x1f/0x80 [obdclass] [<ffffffffa093e455>] cl_object_find+0x55/0xc0 [obdclass] [<ffffffffa0e6929e>] cl_file_inode_init+0x22e/0x340 [lustre] [<ffffffffa0e368b4>] ll_update_inode+0x474/0x1d90 [lustre] [<ffffffffa0915765>] ? lprocfs_counter_add+0x165/0x1c0 [obdclass] [<ffffffffa0b01f35>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffffa0b29372>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffff812352a0>] ? security_inode_alloc+0x40/0x60 [<ffffffffa0e32cc9>] ? ll_lli_init+0x209/0x2a0 [lustre] [<ffffffffa0e3823d>] ll_read_inode2+0x6d/0x410 [lustre] [<ffffffffa0e4eceb>] ll_iget+0x12b/0x330 [lustre] [<ffffffffa0e3a0ed>] ll_prep_inode+0x5bd/0xc40 [lustre] [<ffffffffa0e536de>] ? ll_lookup_it+0x63e/0xda0 [lustre] [<ffffffffa0e546e8>] ll_create_nd+0x3b8/0xf20 [lustre] [<ffffffffa0e247c6>] ? ll_inode_permission+0x136/0x3e0 [lustre] [<ffffffff811a2b76>] vfs_create+0xe6/0x110 [<ffffffff811a626e>] do_filp_open+0xa8e/0xd20 [<ffffffff810f3a85>] ? call_rcu_sched+0x15/0x20 [<ffffffff812a07cd>] ? strncpy_from_user+0x5d/0xa0 [<ffffffff811b3572>] ? alloc_fd+0x92/0x160 [<ffffffff81190007>] do_sys_open+0x67/0x130 [<ffffffff8118fe6d>] ? filp_close+0x5d/0x90 [<ffffffff81190110>] sys_open+0x20/0x30 [<ffffffff81190135>] sys_creat+0x15/0x20 [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
For specific patch set, see branch:2.8-llnl-preview1 |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
There were 10 MDTs, 20 OSTs at the time. Of the 20 OSTs, 2 reported only 5 exports in /proc/fs/lustre/obdfilter/lustre-OST0000/num_exports. The remaining 18 OSTs reported 11 exports. |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
The console log contains the following messages that occurred within the same second of the LBUG: LustreError: 12027:0:(lov_ea.c:195:lsm_unpackmd_common()) lustre-clilov_UUID: OST index 0 missing Lustre: 12027:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x4:2147484700, magic 0x0bd10bd0, pattern 0x1 Lustre: 12025:0:(lov_pack.c:65:lov_dump_lmm_common()) objid 0x8:1073742858, magic 0x0bd10bd0, pattern 0x1 Lustre: 12025:0:(lov_pack.c:69:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 1, layout_gen 0 Lustre: 12025:0:(lov_pack.c:88:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x940000400:34 |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
Our local patches are mostly build-related and I do not believe any of them are related to this issue. |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
I don't have a record to prove it, but I believe that OST 0 was one of the two OSTs which did not report connections to all the MDTs. |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
The filesystem existed only temporarily during a DAT on Catalyst, and it's gone now. However I have crash dumps from the two client nodes that crashed. Both had just started to run mdtest. I do not have any records from the two OSTs which were missing connections. I should have crashed one and/or gathered more information. My mistake. |
| Comment by Andreas Dilger [ 13/Apr/16 ] |
|
Olaf, what options were used to run mdtest? Was the test run inside a striped directory, and if yes, presumably stripe count of 10? Had the filesystem _just_been mounted? I recall one problem during pre-release DNE testing with 128 MDTs that it took some time for all the MDTs to connect to each other, and I wonder if there was still MDT-OST object procreation underway when the test started? The LBUG and lov console messages do appear related. Presumably the lam_unoackmd() error was first, and LBUG second? The LBUG could just be bad error handling during the cleanup path, but it doesn't explain why the client was given a bad layout from the MDS. It is not unusual for an OST to be disconnected from the MDS, and it should just be skipped in this case. |
| Comment by Olaf Faaland [ 13/Apr/16 ] |
|
Andreas, The mdtest options were probably this: Yes, the directory was striped across all 10 MDTs in the file system, with --default_stripe option set to match. The MDTs and OSTs were all started at 08:19:48, and the client mounted at 08:20:40. The job wasn't run until about 40 minutes later, which is when the clients hit the LBUG at 09:01:40. I would think that object precreation was long since finished. I realize now that I do have the console logs for the servers, which may have clues. So I will go through and find what is there between 8:19 and 9:01 and report back. Yes, based on the output of dmesg in crash, the lov_ea.c:195:lsm_unpackmd_common() error message appears first, and the LBUG follows. As you say, I saw the two OSTs were not properly connected, but I expected no objects on them would be allocated and they were just sit idle. |
| Comment by Andreas Dilger [ 15/Apr/16 ] |
|
Olaf, any chance of posting the server console logs, so we can see why the initial MDS-OSS connections failed? |
| Comment by Olaf Faaland [ 15/Apr/16 ] |
|
Attached the console logs as con-logs.tgz |
| Comment by Joseph Gmitter (Inactive) [ 18/Apr/16 ] |
|
Hi Bobijam, |
| Comment by Zhenyu Xu [ 20/Apr/16 ] |
|
From the log, the system was not stable and have several failover happened between 8:19 and 9:01 [~/tmp/logs/lu8012/con-logs/] $ grep Failing * 199.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0000 199.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0000 200.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0001 201.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0002 202.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0003 203.log:19:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0004 204.log:16:2016-04-12 08:34:22 Lustre: Failing over lustre-OST0005 204.log:26:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0005 205.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0006 206.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0007 207.log:22:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0008 208.log:23:2016-04-12 09:05:47 Lustre: Failing over lustre-OST0009 Can't tell what caused the failover though. |
| Comment by Olaf Faaland [ 20/Apr/16 ] |
|
Zhenyu, Those "Failing over" messages are a result of us umount-ing the server targets and then mounting them again. The first time at 8:34 as I recall was to see if OSTs would make the missing connections. In retrospect I should have looked at the config logs. The second time I don't recall the reason for. -Olaf |
| Comment by Gerrit Updater [ 19/Jul/16 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21411 |
| Comment by Gerrit Updater [ 26/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21411/ |
| Comment by Peter Jones [ 26/Sep/16 ] |
|
Landed for 2.9 |