[LU-4851] Lustre kernel panic when using Intel Vtune Amplier XE 2013 Created: 02/Apr/14 Updated: 16/May/14 Resolved: 16/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Rajeshwaran Ganesan | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 13382 |
| Description |
|
How to trigger the kernel panic: 1. start-up amplxe-gui and create a new project, e.g. "test". 2. Then set the executable to use to say /bin/ls. 3. Now create a new analysis and select any of the ones available to your process architecture and run it. 4 .Once the simulation has completed, exit amplxe-gui. 5. Load amplxe-gui again and re-run the simulation in step 3. 6. Exit amplxe-gui once the simulation completes. 7. Repeat steps 5 to 6 until the node kernel panics (normally takes three or four attempts) It should be noted that users' home directories are stored on Lustre and also the environment variable $TMPDIR is set to a directory within users' homespaces on Lustre. The Vtune "project files" are therefore stored on Lustre. I suspect that the reading or writing of these files by Vtune could be the cause of the kernel panic. |
| Comments |
| Comment by Rajeshwaran Ganesan [ 02/Apr/14 ] |
|
This issue occurs only on the client, here is the lustre versions The kernel panic only occurs on the client where Vtune was being executed. The MDS and OSS nodes do not suffer any issues. The clients are running SLES 11 SP3: lustre-client-2.4.3-3.0.93_0.8_default_gfc544a1 Kernel: 3.0.93-0.8-default The MDS and OSS nodes are running CentOS: kernel-2.6.32-358.18.1.el6_lustre.es50.x86_64 |
| Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ] |
|
Hello Rajeshwaran, |
| Comment by Oleg Drokin [ 02/Apr/14 ] |
|
This actually might be a dup of |
| Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ] |
|
Oleg, even if |
| Comment by Oleg Drokin [ 02/Apr/14 ] |
|
actually, with a client crash, it might be something else. We definitely need a full backtrace from the crash here at the very least. |
| Comment by Bruno Faccini (Inactive) [ 03/Apr/14 ] |
|
Hello Rajeshwaran, |
| Comment by John Fuchs-Chesney (Inactive) [ 09/Apr/14 ] |
|
Hello Rajeshwaran, Many thanks, |
| Comment by Bruno Faccini (Inactive) [ 09/Apr/14 ] |
|
Yes, it would be helpful and easier to reproduce in-house if you could confirm that you also reproduce by running VTune via its command-line interface. So, if I try to mimic your GUI's actions to reproduce wit command-line, it should be something like : for i in `seq 1 100` ; do
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -collect hotspots -result-dir=/mnt/lustre/vtune/intel/amplxe/projects/ls_lustre2/r${i}hs -app-working-dir=/mnt/lustre/vtune/ ls -laR /mnt/lustre
/opt/intel/vtune_amplifier_xe_2013/bin64/amplxe-cl -report hotspots -r /mnt/lustre/vtune/intel/amplxe/projects/ls_lustre2/r${i}hs
done
On the other hand, I am still trying and unable to reproduce even using the GUI interface as you reported ... So, I begin to think that it is more configuration dependent (OSTs number, default/used striping, ...) and not only a pure VTune behavior consequence ... BTW, I forgot to ask which mount options are used on your Client ? And particularly, do you mount Lustre with flock/localflock/noflock ? |
| Comment by Rajeshwaran Ganesan [ 09/Apr/14 ] |
|
Hello, Lustre is mounted on our login nodes with the following options: “rw”, “_netdev” and “flock” We mount it under /scratch The mount definitions for Lustre are stored in /etc/fstab and we manually mount the file system once the node has been booted, e.g.: mount /scratch No output from the mount command is produced. In dmesg we see: Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-22597-gfc544a1-PRISTINE-../lustre/scripts LNet: Added LNI XXX.XXX.XXX.XXX@o2ib [YYY/YYY/YYY/YYY] Lustre: Layout lock feature supported. Lustre: Mounted scratch-client Note: IP info redacted from the output above. The default stripe count is 12 (all OSTs). |
| Comment by Rajeshwaran Ganesan [ 09/Apr/14 ] |
|
Please find attached a dmesg trace that is produced when the LBUG kernel panic occurs. I've not be able to recreate the issue with the command line utility. However, I have found a quicker way to trigger the bug by just merely closing and opening the Vtune GUI (i.e. no need to create any projects or analyses) E.g. cd ~
The crash is triggered whilst the splash screen for Vtune is visible but before the actual main Vtune window is displayed. Therefore, I assume the bug is being triggered by some of the Vtune GUI's start-up code? As suspected, I cannot recreate the issue if I move my homespace to a non-Lustre file system. Hope this helps, |
| Comment by Rajeshwaran Ganesan [ 09/Apr/14 ] |
|
Hello Bruno, Please let me know, if you need any other logs or any other commands to try. I can send them and get the results. Thanks, |
| Comment by Bruno Faccini (Inactive) [ 10/Apr/14 ] |
|
Hello Rajeshwaran, |
| Comment by Rajeshwaran Ganesan [ 10/Apr/14 ] |
|
Hello Bruno, Cu. tried running the CLI as shown below, but after 100 iterations the login node refuses to crash. Yet if they switch back to the GUI, they can get the login node to crash after a few attempted launches as described previously. |
| Comment by Rajeshwaran Ganesan [ 10/Apr/14 ] |
|
Hello Bruno, some good news, the localflock is not causing the crash. if they mount using flock, it is crashing as usual. Is there any harm in mounting our Lustre file system with “localflock” on our clients rather then “flock”? Whats your suggestion on localflock and flock Thanks, |
| Comment by Bruno Faccini (Inactive) [ 10/Apr/14 ] |
|
After I found that the issue is FLock related, I had also in mind to ask you to try with either localflock and/or noflock mount options, but decided not to do so right now since I assumed that your customer is likely to use applications that require cluster-wide/multi-nodes FLock support (guaranteed with flock option), and not only local/single-node FLock support (localflock scope). But since you asked/tried, may be you can check with your customer if this can feet with their production/applications ??... |
| Comment by Rajeshwaran Ganesan [ 16/Apr/14 ] |
|
Are we getting any fix for flock option? |
| Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ] |
|
1st of all I would like to add an update about my reproduction efforts in-house ... Unfortunatelly I am still unable to reproduce until now, and this even after I used your latest instructions and configuration details. Concerning a possible fix to cover the flock option usage, I made a b2_4 back-port of my previous patch for Last, and since I am unable to reproduce, did you pursue (and succeed) with the process of having informations out from the site and as we discussed during conf-call with customer ?? As we already insisted, the full debug log from Client side taken during reproducer run wouls be more than helpful to understand the issue, and also to confirm that my patch is the fix ... |
| Comment by Rajeshwaran Ganesan [ 16/Apr/14 ] |
|
Hello Bruno, Could you please provide the source RPM with the patch, I can ask them rebuild and install it. we can verify it in the client. Thanks, |
| Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ] |
|
Builds with my patch are available under http://build.whamcloud.com/job/lustre-reviews/23081/. Can you access it ? If yes, you can check the target OS from the build matrix, and then follow the "Build Artifacts" link where you can find the corresponding source rpm. But I suggest you wait for the success of our test suites run before to apply it on-site. |
| Comment by Rajeshwaran Ganesan [ 16/Apr/14 ] |
|
Thanks for your help, Sure I can wait, please let me know once it passes the test. |
| Comment by Bruno Faccini (Inactive) [ 17/Apr/14 ] |
|
#9968 has successfully passed almost all Maloo tests, only one unrelated failure in lustre-rsync-test/test_8 for a known+unrelated failure already tracked in Concerning how-to get the lustre debug-log upon LBUG, here my instructions to be applied on the Client node where the reproducer will run : _ as already requested, ensure /proc/sys/lnet/debug[_mb] are respectively set to at least rpctrace+dlmtrace in addition to the default value (-1 would be the best!) for the trace-mask and to a reasonable value (2048 or even 4096) for the debug-buffer size. _ unset/0 /proc/sys/lnet/panic_on_lbug _ run reproducer/VTune to get to the LBUG _ the Lustre debug-log should be dumped automatically upon LBUG in a file with name/path /tmp/lustre-log.<seconds-since-the-Epoch.milliseconds>, but if not you can force this with "lctl dk <path/file>" command. _ if for any reason panic_on_lbug can not be unset ... I can also provide you with the information+stuff necessary to extract Lustre debug-log from a crash-dump! _ when you will have ensured the Lustre debug-log has been collected/saved, you will need to reboot the Client node to get Lustre functional again. |
| Comment by Rajeshwaran Ganesan [ 23/Apr/14 ] |
|
Hello Bruno, What could be best option for the mounting for vtunes application. flock or localflock. what is the best practice for the vtunes application. Could you please check with Vtunes team |
| Comment by Dmitry Eremin (Inactive) [ 23/Apr/14 ] |
|
VTune application use flock to protect database changes across multiple users. If you don't share results between different computers the options "localflock" will be enough. If you have concurrent access to the same result directory from different computers you definitely need "flock" option. |
| Comment by Rajeshwaran Ganesan [ 24/Apr/14 ] |
|
Hello Bruno - Cu. is applying the patch this week, once I have an update, I will let you know. Thanks, |
| Comment by Bruno Faccini (Inactive) [ 05/May/14 ] |
|
Hello Rajesh, |
| Comment by Rajeshwaran Ganesan [ 06/May/14 ] |
|
Hello Bruno, Cu have updated the patch and not seeing the issue, they are in the process of updating the remaining clients. Thanks, |
| Comment by John Fuchs-Chesney (Inactive) [ 16/May/14 ] |
|
Rajesh, how are we doing? |
| Comment by Rajeshwaran Ganesan [ 16/May/14 ] |
|
Hello John - Please go ahead and close this LU. Thanks for your help, |
| Comment by Peter Jones [ 16/May/14 ] |
|
Thanks Rajesh |