[LU-8857] sptlrpc_target_local_read_conf() missing llog context Created: 22/Nov/16 Updated: 14/Mar/17 Resolved: 14/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.8.0_5.chaos |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Booting the MDTs at the same time, MDT0001 (on zinc2) failed to exit recovery. The console log reports 2016-11-15 16:29:26 [ 2960.562708] LustreError: 144169:0 (sec_config.c:1103:sptlrpc_target_local_read_conf()) missing llog context about 20 seconds after Lustre reported Build Version in log. No intervening errors. Dumped the debug logs from that MDT. The thread that reported the error recorded the following in the debug log: 1 2016-11-15 16:29:26.845220 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x8 mark_flg=0x1 2 2016-11-15 16:29:26.845227 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010 3 2016-11-15 16:29:26.845228 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 8 (0x1) lsh-MDT0001-mdtl lov setup 4 2016-11-15 16:29:26.845230 00000020:01000000:24.0::0:144169:0:(obd_config.c:1562:class_config_llog_handler()) For 2.x interoperability, rename obd type from lov to lod (lsh-MDT0001) 5 2016-11-15 16:29:26.845231 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf001 6 2016-11-15 16:29:26.845233 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type lod name: lsh-MDT0001-mdtlov uuid: lsh-MDT0001-mdtlov_UUID 7 2016-11-15 16:29:26.857650 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDT0001-mdtlov (ffff887f01a68000) 8 2016-11-15 16:29:26.857653 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 3 attached type lod with refcount 1 9 2016-11-15 16:29:26.857656 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf003 10 2016-11-15 16:29:26.857701 00080000:01000000:24.0::0:144169:0:(osd_handler.c:1248:osd_obd_connect()) connect #1 11 2016-11-15 16:29:26.857703 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-osd_UUID, cookie 0xd50fdc70539570e9 12 2016-11-15 16:29:26.857727 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDT0001-mdtlov (uuid lsh-MDT0001-mdtlov_UUID) 13 2016-11-15 16:29:26.857730 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x2 mark_flg=0x2 14 2016-11-15 16:29:26.857731 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010 15 2016-11-15 16:29:26.857731 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 8 (0x2) lsh-MDT0001-mdtl lov setup 16 2016-11-15 16:29:26.857732 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x0 mark_flg=0x1 17 2016-11-15 16:29:26.857733 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010 18 2016-11-15 16:29:26.857734 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 9 (0x1) lsh-MDT0001 add mdt 19 2016-11-15 16:29:26.857735 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf001 20 2016-11-15 16:29:26.857735 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type mdt name: lsh-MDT0001 uuid: lsh-MDT0001_UUID 21 2016-11-15 16:29:26.857754 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDT0001 (ffff887f01a68f28) 22 2016-11-15 16:29:26.857755 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 4 attached type mdt with refcount 1 23 2016-11-15 16:29:26.857756 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf007 24 2016-11-15 16:29:26.857757 00000020:00000080:24.0::0:144169:0:(obd_config.c:1176:class_process_config()) mountopt: profile lsh-MDT0001 osc lsh-MDT0001-mdtlov mdc (null) 25 2016-11-15 16:29:26.857758 00000020:01000000:24.0::0:144169:0:(obd_config.c:873:class_add_profile()) Add profile lsh-MDT0001 26 2016-11-15 16:29:26.857760 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf003 27 2016-11-15 16:29:26.857772 00000020:01000004:24.0::0:144169:0:(obd_mount_server.c:175:server_get_mount()) get mount ffff883f70083800 from lsh-MDT0001, refs=2 28 2016-11-15 16:29:26.857776 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type mdd name: lsh-MDD0001 uuid: lsh-MDD0001_UUID 29 2016-11-15 16:29:26.868373 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDD0001 (ffff887f01a69e50) 30 2016-11-15 16:29:26.868376 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 5 attached type mdd with refcount 1 31 2016-11-15 16:29:26.868397 00000004:01000000:24.0::0:144169:0:(lod_dev.c:1687:lod_obd_connect()) connect #0 32 2016-11-15 16:29:26.868399 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-mdtlov_UUID, cookie 0xd50fdc70539570fe 33 2016-11-15 16:29:26.868412 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDD0001 (uuid lsh-MDD0001_UUID) 34 2016-11-15 16:29:26.868420 00000004:01000000:24.0::0:144169:0:(mdd_device.c:1178:mdd_obd_connect()) connect #0 35 2016-11-15 16:29:26.868422 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDD0001_UUID, cookie 0xd50fdc7053957105 36 2016-11-15 16:29:26.868423 00080000:01000000:24.0::0:144169:0:(osd_handler.c:1248:osd_obd_connect()) connect #2 37 2016-11-15 16:29:26.868425 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-osd_UUID, cookie 0xd50fdc705395710c 38 2016-11-15 16:29:26.920911 02000000:00020000:24.0::0:144169:0:(sec_config.c:1103:sptlrpc_target_local_read_conf()) missing llog context 39 2016-11-15 16:29:26.931932 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDT0001 (uuid lsh-MDT0001_UUID) |
| Comments |
| Comment by Olaf Faaland [ 22/Nov/16 ] |
|
Attached complete debug logs: dk.zinc2.1479316998.gz |
| Comment by Olaf Faaland [ 22/Nov/16 ] |
|
We saw the same error message earlier today on a newly formatted DNE file system (lquake), before the file system had been mounted by clients. |
| Comment by Peter Jones [ 22/Nov/16 ] |
|
Lai Could you please assist with this one? Peter |
| Comment by Olaf Faaland [ 22/Nov/16 ] |
|
Just realized that 11 of the 16 MDTs reported this error. MDT0001 was just one of them. |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
As indicated on 22 Nov, the "missing llog context" error message appeared on many nodes within the cluster. Most of those nodes successfully exited recovery, but MDT0001 did not. So we have two symptoms:
Are they related or independent? |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Increased priority and marked topllnl because this file system is down until this is resolved. It is one of only 2 DNE test clusters available to us. |
| Comment by Di Wang [ 28/Nov/16 ] |
|
Olaf: You can ignore "missing llog context", that is noise, which should be un-related with the endless recovery. According to the debug log, it looks like MDT0001 is keep waiting for all of clients to connect. Olaf, all clients and other MDTs are up at that moment? and also it seems recovery debug mask (D_HA 0x00080000) is not enabled? 0010000:02000400:0.0:1479257111.133403:0:96245:0:(ldlm_lib.c:779:target_handle_reconnect()) lsh-MDT0001: Client lsh-MDT0008-mdtlov_UUID (at 172.19.3.9@o2ib600) reconnecting, waiting for 1187 clients in recovery for 0:00 00000100:02000000:0.0:1479257111.133417:0:96245:0:(import.c:1539:ptlrpc_import_recovery_state_machine()) lsh-MDT0001: Connection restored to (at 172.19.3.9@o2ib600) 00010000:02000400:6.0:1479257111.467229:0:143823:0:(ldlm_lib.c:779:target_handle_reconnect()) lsh-MDT0001: Client lsh-MDT0005-mdtlov_UUID (at 172.19.3.6@o2ib600) reconnecting, waiting for 1187 clients in recovery for 0:00 Olaf: could you please also post the stack trace for "tgt_recover_1"? thanks. |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 29/Nov/16 ] |
|
Removed topllnl label because the log message does not reflect an actual error. |
| Comment by Lai Siyao [ 30/Nov/16 ] |
|
“missing log context" looks to be a bug introduced in server code refactor long ago, and after I tweaked the code a bit it's reproduced, but I still need to understand the refactor more to fix it. |
| Comment by Olaf Faaland [ 30/Nov/16 ] |
|
Lai, |
| Comment by Gerrit Updater [ 04/Dec/16 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/24119 |
| Comment by Olaf Faaland [ 15/Dec/16 ] |
|
Hi all, |
| Comment by Olaf Faaland [ 15/Dec/16 ] |
|
Updated priority which had been set to Critical when I thought the error message was related to recovery problems, and not updated after I learned otherwise. |
| Comment by Gerrit Updater [ 14/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24119/ |
| Comment by Peter Jones [ 14/Mar/17 ] |
|
Landed for 2.10 |