[LU-8857] sptlrpc_target_local_read_conf() missing llog context Created: 22/Nov/16  Updated: 14/Mar/17  Resolved: 14/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.8.0_5.chaos
kernel-3.10.0-510.0.0.2chaos.ch6.x86_64
zfs-0.7.0-0.5llnl.ch6.x86_64
DNE file system with 16 MDTs


Attachments: File dk.zinc2.1479316998.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Booting the MDTs at the same time, MDT0001 (on zinc2) failed to exit recovery. The console log reports
Console log shows:

2016-11-15 16:29:26 [ 2960.562708] LustreError: 144169:0 (sec_config.c:1103:sptlrpc_target_local_read_conf()) missing llog context

about 20 seconds after Lustre reported Build Version in log. No intervening errors.

Dumped the debug logs from that MDT. The thread that reported the error recorded the following in the debug log:

   1 2016-11-15 16:29:26.845220 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x8 mark_flg=0x1
   2 2016-11-15 16:29:26.845227 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010
   3 2016-11-15 16:29:26.845228 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 8 (0x1) lsh-MDT0001-mdtl lov setup
   4 2016-11-15 16:29:26.845230 00000020:01000000:24.0::0:144169:0:(obd_config.c:1562:class_config_llog_handler()) For 2.x interoperability, rename obd type from lov to lod (lsh-MDT0001)
   5 2016-11-15 16:29:26.845231 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf001
   6 2016-11-15 16:29:26.845233 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type lod name: lsh-MDT0001-mdtlov uuid: lsh-MDT0001-mdtlov_UUID
   7 2016-11-15 16:29:26.857650 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDT0001-mdtlov (ffff887f01a68000)
   8 2016-11-15 16:29:26.857653 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 3 attached type lod with refcount 1
   9 2016-11-15 16:29:26.857656 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf003
  10 2016-11-15 16:29:26.857701 00080000:01000000:24.0::0:144169:0:(osd_handler.c:1248:osd_obd_connect()) connect #1
  11 2016-11-15 16:29:26.857703 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-osd_UUID, cookie 0xd50fdc70539570e9
  12 2016-11-15 16:29:26.857727 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDT0001-mdtlov (uuid lsh-MDT0001-mdtlov_UUID)
  13 2016-11-15 16:29:26.857730 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x2 mark_flg=0x2
  14 2016-11-15 16:29:26.857731 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010
  15 2016-11-15 16:29:26.857731 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 8 (0x2) lsh-MDT0001-mdtl lov setup
  16 2016-11-15 16:29:26.857732 00000020:01000000:24.0::0:144169:0:(obd_config.c:1487:class_config_llog_handler()) Marker, inst_flg=0x0 mark_flg=0x1
  17 2016-11-15 16:29:26.857733 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf010
  18 2016-11-15 16:29:26.857734 00000020:00000080:24.0::0:144169:0:(obd_config.c:1218:class_process_config()) marker 9 (0x1) lsh-MDT0001 add mdt
  19 2016-11-15 16:29:26.857735 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf001
  20 2016-11-15 16:29:26.857735 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type mdt name: lsh-MDT0001 uuid: lsh-MDT0001_UUID
  21 2016-11-15 16:29:26.857754 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDT0001 (ffff887f01a68f28)
  22 2016-11-15 16:29:26.857755 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 4 attached type mdt with refcount 1
  23 2016-11-15 16:29:26.857756 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf007
  24 2016-11-15 16:29:26.857757 00000020:00000080:24.0::0:144169:0:(obd_config.c:1176:class_process_config()) mountopt: profile lsh-MDT0001 osc lsh-MDT0001-mdtlov mdc (null)
  25 2016-11-15 16:29:26.857758 00000020:01000000:24.0::0:144169:0:(obd_config.c:873:class_add_profile()) Add profile lsh-MDT0001
  26 2016-11-15 16:29:26.857760 00000020:00000080:24.0::0:144169:0:(obd_config.c:1148:class_process_config()) processing cmd: cf003
  27 2016-11-15 16:29:26.857772 00000020:01000004:24.0::0:144169:0:(obd_mount_server.c:175:server_get_mount()) get mount ffff883f70083800 from lsh-MDT0001, refs=2
  28 2016-11-15 16:29:26.857776 00000020:00000080:24.0::0:144169:0:(obd_config.c:362:class_attach()) attach type mdd name: lsh-MDD0001 uuid: lsh-MDD0001_UUID
  29 2016-11-15 16:29:26.868373 00000020:00000080:24.0::0:144169:0:(genops.c:371:class_newdev()) Adding new device lsh-MDD0001 (ffff887f01a69e50)
  30 2016-11-15 16:29:26.868376 00000020:00000080:24.0::0:144169:0:(obd_config.c:432:class_attach()) OBD: dev 5 attached type mdd with refcount 1
  31 2016-11-15 16:29:26.868397 00000004:01000000:24.0::0:144169:0:(lod_dev.c:1687:lod_obd_connect()) connect #0
  32 2016-11-15 16:29:26.868399 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-mdtlov_UUID, cookie 0xd50fdc70539570fe
  33 2016-11-15 16:29:26.868412 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDD0001 (uuid lsh-MDD0001_UUID)
  34 2016-11-15 16:29:26.868420 00000004:01000000:24.0::0:144169:0:(mdd_device.c:1178:mdd_obd_connect()) connect #0
  35 2016-11-15 16:29:26.868422 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDD0001_UUID, cookie 0xd50fdc7053957105
  36 2016-11-15 16:29:26.868423 00080000:01000000:24.0::0:144169:0:(osd_handler.c:1248:osd_obd_connect()) connect #2
  37 2016-11-15 16:29:26.868425 00000020:00000080:24.0::0:144169:0:(genops.c:1167:class_connect()) connect: client lsh-MDT0001-osd_UUID, cookie 0xd50fdc705395710c
  38 2016-11-15 16:29:26.920911 02000000:00020000:24.0::0:144169:0:(sec_config.c:1103:sptlrpc_target_local_read_conf()) missing llog context
  39 2016-11-15 16:29:26.931932 00000020:00000080:24.0::0:144169:0:(obd_config.c:552:class_setup()) finished setup of obd lsh-MDT0001 (uuid lsh-MDT0001_UUID)


 Comments   
Comment by Olaf Faaland [ 22/Nov/16 ]

Attached complete debug logs: dk.zinc2.1479316998.gz

Comment by Olaf Faaland [ 22/Nov/16 ]

We saw the same error message earlier today on a newly formatted DNE file system (lquake), before the file system had been mounted by clients.
We have debug logs and can attach them if that would be helpful.

Comment by Peter Jones [ 22/Nov/16 ]

Lai

Could you please assist with this one?

Peter

Comment by Olaf Faaland [ 22/Nov/16 ]

Just realized that 11 of the 16 MDTs reported this error. MDT0001 was just one of them.
zinc[3-6,8,10-11,13-16]

Comment by Olaf Faaland [ 28/Nov/16 ]

As indicated on 22 Nov, the "missing llog context" error message appeared on many nodes within the cluster. Most of those nodes successfully exited recovery, but MDT0001 did not.

So we have two symptoms:

  • MDT fails to exit recovery
  • MDT reports "missing llog context" error message

Are they related or independent?

Comment by Olaf Faaland [ 28/Nov/16 ]

Increased priority and marked topllnl because this file system is down until this is resolved. It is one of only 2 DNE test clusters available to us.

Comment by Di Wang [ 28/Nov/16 ]

Olaf: You can ignore "missing llog context", that is noise, which should be un-related with the endless recovery.

According to the debug log, it looks like MDT0001 is keep waiting for all of clients to connect. Olaf, all clients and other MDTs are up at that moment? and also it seems recovery debug mask (D_HA 0x00080000) is not enabled?

0010000:02000400:0.0:1479257111.133403:0:96245:0:(ldlm_lib.c:779:target_handle_reconnect()) lsh-MDT0001: Client lsh-MDT0008-mdtlov_UUID (at 172.19.3.9@o2ib600) reconnecting, waiting for 1187 clients in recovery for 0:00
00000100:02000000:0.0:1479257111.133417:0:96245:0:(import.c:1539:ptlrpc_import_recovery_state_machine()) lsh-MDT0001: Connection restored to  (at 172.19.3.9@o2ib600)
00010000:02000400:6.0:1479257111.467229:0:143823:0:(ldlm_lib.c:779:target_handle_reconnect()) lsh-MDT0001: Client lsh-MDT0005-mdtlov_UUID (at 172.19.3.6@o2ib600) reconnecting, waiting for 1187 clients in recovery for 0:00

Olaf: could you please also post the stack trace for "tgt_recover_1"? thanks.

Comment by Olaf Faaland [ 28/Nov/16 ]

Di,
Thanks. The servers have been restarted while I looked into the message, so I lost the state you are looking for. I'll use this ticket for a patch to change the debug mask on the "missing llog context" message.

Comment by Olaf Faaland [ 29/Nov/16 ]

Removed topllnl label because the log message does not reflect an actual error.

Comment by Lai Siyao [ 30/Nov/16 ]

“missing log context" looks to be a bug introduced in server code refactor long ago, and after I tweaked the code a bit it's reproduced, but I still need to understand the refactor more to fix it.

Comment by Olaf Faaland [ 30/Nov/16 ]

Lai,
OK, thanks for the update.

Comment by Gerrit Updater [ 04/Dec/16 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/24119
Subject: LU-8857 llog: init llog context for target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2cd184c00ed99b250ce4ee636b78716738f29bdd

Comment by Olaf Faaland [ 15/Dec/16 ]

Hi all,
Can we get some review of Lai's patch?
thanks,
Olaf

Comment by Olaf Faaland [ 15/Dec/16 ]

Updated priority which had been set to Critical when I thought the error message was related to recovery problems, and not updated after I learned otherwise.

Comment by Gerrit Updater [ 14/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24119/
Subject: LU-8857 config: refactor sptlrpc config process
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df00ee15534419070c559acdb913898fc9dc3a00

Comment by Peter Jones [ 14/Mar/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:21:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.