Hiding errors

Normal good programing practice is to visibly signal error so that they can be fixed. From such point of view attempts to hide error are futile: bugs will stay unfixed and system will be worse then system where errors are signalled and bugs are promptly fixed. However, some errors appear randomly and are hard to reproduce. In particular hardware errors tend to be random. In such case running program second time will frequently succeed. If errors are rare it may be quite hard to reproduce an error and conseqently debuging is hard. This means that during operation embedded system we expect some errors. Consequently, it make sense to design systems in such a way that effect of errors are minimized (possibly completely masked). There are two ingredients to this: early detection of errors and some way of recovery. Simple techniques for detection include:

duplicating computation and comparing results
checking if results are in prescribed range
checking various integrity constraints, for example motor can not be simultanousy on and off

One kind of errors which happens relatively frequently is deadlock. There is a simple technique which can detect deadlocks: watchdog. Watchdog is timer which has to be accessed at regular interval. If there is no access during prescribed time watchdog will reset the system (presumably resolving deadlock). Access to watchdog should be put in low priority task or in main task. In multitasking system with fixed priorities if higher priority task gets into infinite loop, then low priority task can no longer run. There are some traps when using watchdog. First, watchdog does not help in case of deterministic error: after reset system will fail in the same way and we will get continuous sequence of resets. Second, in otherwise correct system lack of access (or too late access) to watchdog can cause spurious reboot. Third, access to watchdog should be done from place which gives resonable assurance that system is working correctly. For example accessing watchdog from high priority interrupt handler makes sure that interrupt handler is called, but all lower priority code may be deadlocked. If high or medium priority tasks may block it is advisable to combine hardware watchdog with software one: higher priority tasks should comunicate with low priority watchdog task and watchdog task should access hardware watchdog. At hardware level STM32 processors offer parity control of RAM. There exist processors with duplicated processor cores which signals errors when results from the cores disagree. It makes sense to use hardware properties when avaliable. Otherwise one can use software technigues, like duplicating one logical variable into two places in memory and checking for agreement. Another aspect is recovering from errors and minimizing effects of fatal errors. Typical method of recovery from random hardware errors is to repeat operation. On the level of the whole system it means reset of the whole system. Let us add that to increase reliablity it makes sense to initialize hardware to desired state instead of relaying on default state after reset. Also insted of depending on state being unchanged it make sense to reinitialize hardware from time to time. To avoid bad effects it is important to have designated safe states. For example in machine having moving parts typically state with no movement (stopped) is a safe state. The system should be designed so that most probably errors leave system in safe state. STM text about erorrs: AN4750 Handling of soft errors in STM32 applications