There are many potential reasons why tests running in LAVA might fail, or produce unexpected behaviour. Some of them can be easy to track down, but others may be more difficult. The devices, software and test suites can vary massively from one test job to the next, but nonetheless a few common ideas may help you to work out what’s going wrong.
This may seem obvious, but it is all too easy to miss real problems in the test logs! For people not used to diagnosing failures, it is worth reading all the way from deployment through test device boot to the end of the logfile. If a test job fails to complete successfully, it can often be caused by a problem much earlier in the test - don’t assume that the final few lines of the logfile will tell the whole story:
When writing tests, make things verbose to give yourself more useful logs in case they fail.
If the test system does not (seem to) boot at all, there are a few things worth checking:
Did the kernel boot OK but then fail to find the root filesystem? This is a common failure mode, and there are quite a few possible causes. Here are some of the more common failure cases.
sd_mod
), filesystems (e.g. ext4
) or network interfaces (e.g.
e1000e
) if you’re using NFS for the rootfs. You should be able to see
what devices are found by the kernel by reading the boot messages; check that
the device you are expecting to use does show up there.root=
parameter.init
program, in the correct
location. In an initramfs, the default location is /init
; this can be
over-ridden on the kernel command line using the init=
parameter.This is a common theme throughout the suggested workflow for developing tests in LAVA. Start with simple test jobs and verify they work as expected. Add complexity one step at a time, ensuring that each new option or test suite added behaves as expected. It’s much easier to work out what has broken in a test job if you’ve made just one small change to a previous test job that worked fine.
Similarly, if you have a complex test job that’s not working correctly then often the easiest way to find the problem is to simplify the job - remove some of the complexity and re-test. By removing the complex setup in the test, it should be possible to identify the cause of the failure.
If there are standard test jobs available for the device type in question, it might be useful to compare your test job to one of those standard jobs, or even start with one and append your test definitions.
When developing a test, resist the urge to make too many changes at once - test one element at a time. Avoid changing the deployed files and the test definition in the same job. When the deployed files change, use an older test definition and an inline definition to explicitly check for any new support your test will want to use from those new files. If you change too many variables at once, it may become impossible to work out what change caused things to break.
Especially when developing a new test, add plenty of output to explain what is going on. If you are starting with a new test device or new boot files, make it easy to diagnose problems later by adding diagnostics early in the process. In general, it is much easier to debug a failed test when it is clear about what it expects to be happening than one which just stops or says “error” in the middle of a test.
ifconfig
or ip a show
afterwards to show that it worked.df
or mount
to show what devices and filesystems are available.If you are writing shell scripts to wrap tests, try using set -x
- this
will tell the shell to log all lines of your script as it runs them. For
example:
#!/bin/sh
set -e
set -x
echo "foo"
a=1
if [ $a -eq 1 ]; then
echo "yes"
fi
will give the following output:
+ echo foo
foo
+ a=1
+ [ 1 -eq 1 ]
+ echo yes
yes
There are some common mistakes using LAVA which can cause issues. If you are experiencing weird problems with your test job, maybe considering these will help.
Pipes, redirects and nested sub shells will not work reliably when put directly into the YAML. Use a wrapper script (with set -x) instead for safety:
#!/bin/sh
set -e
set -x
ifconfig|grep "inet addr"|grep -v "127.0.0.1"|cut -d: -f2|cut -d' ' -f1
Un-nested sub-shells do work, though:
- lava-test-case multinode-send-network --shell lava-send network hostname=$(hostname) fqdn=$(hostname -f)
If you use a custom result parser, configure one of your YAML files to output the entire test result output to stdout so that you can reliably capture a representative block of output. Test your proposed result parser against the block using your favourite language.
Comment out the parser from the YAML if there are particular problems, just to see what the default LAVA parsers can provide.
Note
Parsers can be difficult to debug after being parsed from YAML into shell. LAVA developers used to recommend the use of custom parsers, but experience has shown this to be a mistake. Instead, it is suggested that new test definitions should use custom scripts. This allows the parsing to be debugged outside LAVA, as well as making the test itself more portable.
cd
in your YAML, always store where you were and where you end
up using pwd
.realpath
and use that to debug your directory
structure.MultiNode tests are necessarily more complex than jobs running on single test devices, and so there are extra places where errors can creep in and cause unexpected failuures.
This may seem obvious, but one of the most common causes of MultiNode test failure is nothing to do with MultiNode. If your MultiNode tests are failing to boot correctly, check that the basics of each of the desired roles works independently. Remove the MultiNode pieces and just check that the specifiied deploy and boot actions work alone in a single-node test with the right device-type. Then add back the MultiNode configuration, changing one thing at a time and ensuring that things still work as you build up complexity.
A lava-wait must be preceded by a lava-send from at least one other device in the group, or the waiting device will timeout
This can be a particular problem if you remove test definitions or edit a YAML file without checking other uses of the same file. The simplest (and hence recommened) way to use the MultiNode synchronisation calls is using inline definitions.
Always check whether the test result came back as a failure due to some cause other than the test definition itself. Particularly with MultiNode test jobs, a test can fail for other reasons like an unrelated failure on a different board within the group.