Normal good programing practice is to visibly signal error
so that they can be fixed. From such point of view attempts
to hide error are futile: bugs will stay unfixed and
system will be worse then system where errors are signalled
and bugs are promptly fixed. However, some errors appear
randomly and are hard to reproduce. In particular
hardware errors tend to be random. In such case running
program second time will frequently succeed. If errors
are rare it may be quite hard to reproduce an error and
conseqently debuging is hard. This means that during
operation embedded system we expect some errors.
Consequently, it make sense to design systems in such
a way that effect of errors are minimized (possibly
completely masked). There are two ingredients to this:
early detection of errors and some way of recovery.
Simple techniques for detection include:
- duplicating computation and comparing results
- checking if results are in prescribed range
- checking various integrity constraints, for
example motor can not be simultanousy on and off
One kind of errors which happens relatively frequently
is deadlock. There is a simple technique which can
detect deadlocks: watchdog. Watchdog is timer which
has to be accessed at regular interval. If there
is no access during prescribed time watchdog will
reset the system (presumably resolving deadlock).
Access to watchdog should be put in low priority
task or in main task. In multitasking system
with fixed priorities if higher priority task
gets into infinite loop, then low priority task
can no longer run. There are some traps when
using watchdog. First, watchdog does not help
in case of deterministic error: after reset
system will fail in the same way and we will
get continuous sequence of resets. Second,
in otherwise correct system lack of access
(or too late access) to watchdog can cause
spurious reboot. Third, access to watchdog
should be done from place which gives resonable
assurance that system is working correctly.
For example accessing watchdog from high
priority interrupt handler makes sure that
interrupt handler is called, but all lower
priority code may be deadlocked. If high
or medium priority tasks may block it is
advisable to combine hardware watchdog with
software one: higher priority tasks should
comunicate with low priority watchdog task
and watchdog task should access hardware watchdog.
At hardware level STM32 processors offer parity
control of RAM. There exist processors with
duplicated processor cores which signals errors
when results from the cores disagree. It makes
sense to use hardware properties when avaliable.
Otherwise one can use software technigues, like
duplicating one logical variable into two
places in memory and checking for agreement.
Another aspect is recovering from errors and
minimizing effects of fatal errors. Typical
method of recovery from random hardware
errors is to repeat operation. On the level
of the whole system it means reset of the whole
system. Let us add that to increase reliablity
it makes sense to initialize hardware to
desired state instead of relaying on default
state after reset. Also insted of depending
on state being unchanged it make sense to
reinitialize hardware from time to time.
To avoid bad effects it is important to have
designated safe states. For example in machine
having moving parts typically state with no
movement (stopped) is a safe state. The system
should be designed so that most probably errors
leave system in safe state.
STM text about erorrs:
AN4750 Handling of soft errors in STM32 applications