[LU-4851] Lustre kernel panic when using Intel Vtune Amplier XE 2013 Created: 02/Apr/14  Updated: 16/May/14  Resolved: 16/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Rajeshwaran Ganesan Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File AWE_sprig3_dmesg.txt     Text File sprig_vtune_messages.txt    
Severity: 2
Rank (Obsolete): 13382

 Description   

How to trigger the kernel panic:

1. start-up amplxe-gui and create a new project, e.g. "test".

2. Then set the executable to use to say /bin/ls.

3. Now create a new analysis and select any of the ones available to your process architecture and run it.

4 .Once the simulation has completed, exit amplxe-gui.

5. Load amplxe-gui again and re-run the simulation in step 3.

6. Exit amplxe-gui once the simulation completes.

7. Repeat steps 5 to 6 until the node kernel panics (normally takes three or four attempts)

It should be noted that users' home directories are stored on Lustre and also the environment variable $TMPDIR is set to a directory within users' homespaces on Lustre. The Vtune "project files" are therefore stored on Lustre. I suspect that the reading or writing of these files by Vtune could be the cause of the kernel panic.



 Comments   
Comment by Rajeshwaran Ganesan [ 02/Apr/14 ]

This issue occurs only on the client,

here is the lustre versions

The kernel panic only occurs on the client where Vtune was being executed.

The MDS and OSS nodes do not suffer any issues.

The clients are running SLES 11 SP3:

lustre-client-2.4.3-3.0.93_0.8_default_gfc544a1
lustre-client-modules-2.4.3-3.0.93_0.8_default_gfc544a1
lustre-client-tests-2.4.3-3.0.93_0.8_default_gfc544a1
lustre-iokit-1.4.0-1

Kernel: 3.0.93-0.8-default

The MDS and OSS nodes are running CentOS:

kernel-2.6.32-358.18.1.el6_lustre.es50.x86_64
kernel-devel-2.6.32-358.18.1.el6_lustre.es50.x86_64
kernel-firmware-2.6.32-358.18.1.el6_lustre.es50.x86_64
kernel-headers-2.6.32-358.18.1.el6_lustre.es50.x86_64
kernel-ib-1.5.3-2.6.32_358.18.1.el6_lustre.es50.x86_64_2.6.32_358.18.1.el6_lustre.es50.x86_64
kernel-ib-devel-1.5.3-2.6.32_358.18.1.el6_lustre.es50.x86_64_2.6.32_358.18.1.el6_lustre.es50.x86_64
lustre-2.4.1-ddn1.0_2.6.32_358.18.1.el6_lustre.es50.x86_64_ES.x86_64
lustre-ldiskfs-4.1.0-2.6.32_358.18.1.el6_lustre.es50.x86_64.x86_64
lustre-modules-2.4.1-ddn1.0_2.6.32_358.18.1.el6_lustre.es50.x86_64_ES.x86_64
lustre-osd-ldisfs-2.4.1-ddn1.0_2.6.32_358.18.1.el6_lustre.es50.x86_64_ES.x86_64
lustre-source-2.4.1-ddn1.0_2.6.32_358.18.1.el6_lustre.es50.x86_64_ES.x86_64

Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ]

Hello Rajeshwaran,
Is there a crash-dump available for one of the occurrences, and if yes can you provide it ??
Also was there some Lustre debug (like rpctrace, dlmtrace,...) enabled at the time of the crashes ?

Comment by Oleg Drokin [ 02/Apr/14 ]

This actually might be a dup of LU-4403

Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ]

Oleg, even if LU-4403 occurs on Server/MDS side ??

Comment by Oleg Drokin [ 02/Apr/14 ]

actually, with a client crash, it might be something else. We definitely need a full backtrace from the crash here at the very least.

Comment by Bruno Faccini (Inactive) [ 03/Apr/14 ]

Hello Rajeshwaran,
As we agreed during the conf-call, I am trying to reproduce the issue in-house. I am currently unable to reproduce the LBUG ("(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed:") you encountered.
BTW, you seem to use VTune's GUI, but did you try to run with command-line only ("amplxe-cl -[collect,report] ..."), just to see if we can simplify the reproducer ??

Comment by John Fuchs-Chesney (Inactive) [ 09/Apr/14 ]

Hello Rajeshwaran,
Could you please have a shot at running VTune using just the command line interface? (As requested above).
If you are able to provide us with a reproducer from a command line run, it will improve our chances that we will be successful in reproducing this issue, on our in-house platform.

Many thanks,
~ jfc.

Comment by Bruno Faccini (Inactive) [ 09/Apr/14 ]

Yes, it would be helpful and easier to reproduce in-house if you could confirm that you also reproduce by running VTune via its command-line interface. So, if I try to mimic your GUI's actions to reproduce wit command-line, it should be something like :

for i in `seq 1 100` ; do 
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect hotspots -result-dir=/mnt/lustre/vtune/intel/amplxe/projects/ls_lustre2/r${i}hs -app-working-dir=/mnt/lustre/vtune/ ls -laR /mnt/lustre
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -report hotspots -r /mnt/lustre/vtune/intel/amplxe/projects/ls_lustre2/r${i}hs
done

On the other hand, I am still trying and unable to reproduce even using the GUI interface as you reported ... So, I begin to think that it is more configuration dependent (OSTs number, default/used striping, ...) and not only a pure VTune behavior consequence ...

BTW, I forgot to ask which mount options are used on your Client ? And particularly, do you mount Lustre with flock/localflock/noflock ?

Comment by Rajeshwaran Ganesan [ 09/Apr/14 ]

Hello,

Lustre is mounted on our login nodes with the following options: “rw”, “_netdev” and “flock”

We mount it under /scratch

The mount definitions for Lustre are stored in /etc/fstab and we manually mount the file system once the node has been booted, e.g.:

mount /scratch

No output from the mount command is produced. In dmesg we see:

Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-22597-gfc544a1-PRISTINE-../lustre/scripts

LNet: Added LNI XXX.XXX.XXX.XXX@o2ib [YYY/YYY/YYY/YYY]

Lustre: Layout lock feature supported.

Lustre: Mounted scratch-client

Note: IP info redacted from the output above.

The default stripe count is 12 (all OSTs).

Comment by Rajeshwaran Ganesan [ 09/Apr/14 ]

Please find attached a dmesg trace that is produced when the LBUG kernel panic occurs.

I've not be able to recreate the issue with the command line utility.

However, I have found a quicker way to trigger the bug by just merely closing and opening the Vtune GUI (i.e. no need to create any projects or analyses)

E.g.

cd ~
rm -rf .intel intel tmp/*
amplx-gui

  1. now close the GUI (if it loads) and rerun the "amplx-gui" command
  2. repeat until crash occurs (normally by 4th attempt)

The crash is triggered whilst the splash screen for Vtune is visible but before the actual main Vtune window is displayed. Therefore, I assume the bug is being triggered by some of the Vtune GUI's start-up code?

As suspected, I cannot recreate the issue if I move my homespace to a non-Lustre file system.

Hope this helps,

Comment by Rajeshwaran Ganesan [ 09/Apr/14 ]

Hello Bruno,

Please let me know, if you need any other logs or any other commands to try.

I can send them and get the results.

Thanks,
Rajesh

Comment by Bruno Faccini (Inactive) [ 10/Apr/14 ]

Hello Rajeshwaran,
Thanks for all these additional infos!
BTW, with the new dmesg you provided, and the panic/LBUG stack now available, it is clear that the problem occurs during FLock operations.
This allows me to suspect that you may trigger the problem/race I already worked on as part of LU-3684, where I pushed a patch to fix but does not seem to be in b2_4 ...
Will also try again to reproduce using your new+simplified instructions.

Comment by Rajeshwaran Ganesan [ 10/Apr/14 ]

Hello Bruno,

Cu. tried running the CLI as shown below, but after 100 iterations the login node refuses to crash.

Yet if they switch back to the GUI, they can get the login node to crash after a few attempted launches as described previously.
Thanks,
Rajesh

Comment by Rajeshwaran Ganesan [ 10/Apr/14 ]

Hello Bruno,

some good news, the localflock is not causing the crash.

if they mount using flock, it is crashing as usual.

Is there any harm in mounting our Lustre file system with “localflock” on our clients rather then “flock”?

Whats your suggestion on localflock and flock

Thanks,
Rajesh

Comment by Bruno Faccini (Inactive) [ 10/Apr/14 ]

After I found that the issue is FLock related, I had also in mind to ask you to try with either localflock and/or noflock mount options, but decided not to do so right now since I assumed that your customer is likely to use applications that require cluster-wide/multi-nodes FLock support (guaranteed with flock option), and not only local/single-node FLock support (localflock scope). But since you asked/tried, may be you can check with your customer if this can feet with their production/applications ??...

Comment by Rajeshwaran Ganesan [ 16/Apr/14 ]

Are we getting any fix for flock option?

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

1st of all I would like to add an update about my reproduction efforts in-house ... Unfortunatelly I am still unable to reproduce until now, and this even after I used your latest instructions and configuration details.

Concerning a possible fix to cover the flock option usage, I made a b2_4 back-port of my previous patch for LU-1126/LU-3684, it is available at http://review.whamcloud.com/9968, and running our tests suites.

Last, and since I am unable to reproduce, did you pursue (and succeed) with the process of having informations out from the site and as we discussed during conf-call with customer ?? As we already insisted, the full debug log from Client side taken during reproducer run wouls be more than helpful to understand the issue, and also to confirm that my patch is the fix ...

Comment by Rajeshwaran Ganesan [ 16/Apr/14 ]

Hello Bruno,

Could you please provide the source RPM with the patch, I can ask them rebuild and install it. we can verify it in the client.

Thanks,
Rajesh

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

Builds with my patch are available under http://build.whamcloud.com/job/lustre-reviews/23081/. Can you access it ? If yes, you can check the target OS from the build matrix, and then follow the "Build Artifacts" link where you can find the corresponding source rpm.

But I suggest you wait for the success of our test suites run before to apply it on-site.

Comment by Rajeshwaran Ganesan [ 16/Apr/14 ]

Thanks for your help, Sure I can wait, please let me know once it passes the test.

Comment by Bruno Faccini (Inactive) [ 17/Apr/14 ]

#9968 has successfully passed almost all Maloo tests, only one unrelated failure in lustre-rsync-test/test_8 for a known+unrelated failure already tracked in LU-3573. So, it is safe for on-site exposure !

Concerning how-to get the lustre debug-log upon LBUG, here my instructions to be applied on the Client node where the reproducer will run :

_ as already requested, ensure /proc/sys/lnet/debug[_mb] are respectively set to at least rpctrace+dlmtrace in addition to the default value (-1 would be the best!) for the trace-mask and to a reasonable value (2048 or even 4096) for the debug-buffer size.

_ unset/0 /proc/sys/lnet/panic_on_lbug

_ run reproducer/VTune to get to the LBUG

_ the Lustre debug-log should be dumped automatically upon LBUG in a file with name/path /tmp/lustre-log.<seconds-since-the-Epoch.milliseconds>, but if not you can force this with "lctl dk <path/file>" command.

_ if for any reason panic_on_lbug can not be unset ... I can also provide you with the information+stuff necessary to extract Lustre debug-log from a crash-dump!

_ when you will have ensured the Lustre debug-log has been collected/saved, you will need to reboot the Client node to get Lustre functional again.

Comment by Rajeshwaran Ganesan [ 23/Apr/14 ]

Hello Bruno,

What could be best option for the mounting for vtunes application. flock or localflock. what is the best practice for the vtunes application. Could you please check with Vtunes team

Comment by Dmitry Eremin (Inactive) [ 23/Apr/14 ]

VTune application use flock to protect database changes across multiple users. If you don't share results between different computers the options "localflock" will be enough. If you have concurrent access to the same result directory from different computers you definitely need "flock" option.

Comment by Rajeshwaran Ganesan [ 24/Apr/14 ]

Hello Bruno - Cu. is applying the patch this week, once I have an update, I will let you know.

Thanks,
Rajesh

Comment by Bruno Faccini (Inactive) [ 05/May/14 ]

Hello Rajesh,
Do you have any news/update from the site ?

Comment by Rajeshwaran Ganesan [ 06/May/14 ]

Hello Bruno,

Cu have updated the patch and not seeing the issue, they are in the process of updating the remaining clients.

Thanks,
Rajesh

Comment by John Fuchs-Chesney (Inactive) [ 16/May/14 ]

Rajesh, how are we doing?
Can we mark this as resolved?
Thanks!
~ jfc.

Comment by Rajeshwaran Ganesan [ 16/May/14 ]

Hello John -

Please go ahead and close this LU.

Thanks for your help,
Rajesh

Comment by Peter Jones [ 16/May/14 ]

Thanks Rajesh

Generated at Sat Feb 10 01:46:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.