OS/2 Crash Protection



 

 


Chapter:Chapter 02 – Inside OS/2 Warp
Subsection: 01. Base Operating System Architecture
Document Number:03
Topic: OS/2 Crash Protection
Date Composed: 10-04-96 10:33:11 AM Date Modified: 01-13-97 01:09:19 PM

This section discusses the term Crash Protection and what is meant by Crash Protection within the context of the OS/2 operating system. It also discusses why applications crash, the causes of OS/2 crashes, and methods which can be used to ensure that crashes of the operating system can be minimized.

Crash Protection

IBM began talking about OS/2 and its crash protection in early 1992, while OS/2 2.0 was in beta testing. IBM states on the OS/2 Warp package that “[OS/2’s] Crash Protection helps prevent a single, wayward program from affecting the rest of your system.” Notice that the term Crash Protection is a trademark of the IBM Corporation. IBM thinks that this is so important that they want no one else to use the term. If Crash Protection so important, and if OS/2 has it, why does OS/2 still crash?

Application crashes

Application programs can and do crash. In many cases in which an application actually crashes, the problem is a programming error of some type. The short C++ program in Figure 1 is an example of a program which may crash under some circumstances. This program is designed to request a name as input. The variable “Name” is defined as a three character string. Most users will type their own name, and names vary in length. The program will cause an access violation and SYS3175 error if the user enters a name with more than three characters. Thus it will fail for long names and not fail for short ones. The SYS3175 error message is displayed in a PM dialog box. It is also possible to get SYS3170 , SYS3171 , or SYS1808 errors depending upon the length of the name entered.

#include <iostream.h>
void main()
{
char Name[3];

   cout << endl << "Enter a name: ";
cin >> Name;
cout << endl << "The name you entered is: " << Name <<"." << endl;
}

Figure 1: This C++ program can crash under the right circumstances.

In the case of a three letter name like “Don”, the program displays a text line which says, “The name you entered is: Don.” No error occurs. With a four letter name, like “Dave”, the program crashes and OS/2 displays a dialog box which says that a SYS3171 error has occurred. If the user chooses to view the register display from the dialog box choices, he or she will find that the error detail is “exception c0000005… insufficient stack space”.

Now assume that the user enters a fifteen letter name. The program crashes again and the user is presented with a dialog box which says that a SYS3175 error has occurred. This time viewing the register display shows that the error is an “…access violation…”. This program has also generated a SYS3186 error, which is a “privileged instruction” error when used with the name “Jennifer”, for example.

In all cases, when the error occurs, the program crashes. In no case does OS/2 crash and all of the other applications running on the system continue to run as if nothing has happened. As far as the other programs running on your system are concerned, nothing has happened because they were protected from the crash of the defective program. This is what crash protection means and OS/2 fulfills this promise quite nicely. This program would also crash in a DOS or Windows environment, but the entire system would crash and recovery would usually necessitate rebooting the system.

There are some very interesting conclusions which can be drawn from this little experiment. The first is that a single bug in a program can cause a number of different symptoms. In this case, at least five different errors have been presented.

The second conclusion we can make is that a bug may not necessarily present itself and cause a symptom; at least not until the proper circumstances are present. This situation can occur in a real program in which a programmer did not fully consider the size of a data field. The name field in a payroll program may work fine for years – until a new employee with a very long name is hired. An error occurs while entering the new employee into the payroll program, and the payroll clerk calls the company support person with this “new” bug. In many cases, the assumption is that the program is still fine, but the hardware is at fault, or that OS/2, which was installed last week, is now the problem.

Application lockups

OS/2’s multitasking uses a priority based, round-robin, preemptive algorithm. This means that OS/2 will give the CPU to the task with the highest priority. If another task is running at a lower priority, that running task will be preempted by OS/2 in order to do so. OS/2 will not allow a lower priority task to run until the higher priority task has completed.

A poorly written OS/2 application can take up all the CPU time and cause the entire system to lock up. Figure 2 contains a C program which operates entirely within the rules of OS/2 and its programming APIs, but which will lock up the system as soon as it begins to run. This program boosts itself to the highest priority in the system and loops in a math operation. This prevents OS/2 from allowing any other application CPU time.

#define INCL_DOSPROCESS
#include <os2.h>
#include <stdio.h>
#include <math.h>

int main(void)
{
double      value, result;
ULONG       Scope, count;
ULONG       PriorityClass;
LONG        PriorityDelta;
ULONG       ID;
APIRET      rc;

   PriorityDelta = 31;      /* delta to process priority             */
ID = 0;                  /* Current process                       */
value = 236723;
rc = DosSetPriority(PRTYS_PROCESS, PRTYC_TIMECRITICAL, PriorityDelta, ID);
if (rc != 0) {
printf("DosSetPriority error: Return Code = %ld", rc);
return;
}
for (; ; ) {
result = sqrt(value);
} /* endfor */
}

Figure 2: This program will lock up OS/2 by hogging all the CPU cycles

On rare occasions, a program may enter a high priority section of code and get stuck there. This can cause a complete system lock up. In other cases, the program will stay in the high priority section for several seconds or even minutes. This is not good programming, but it happens. Waiting a few moments will bring the system back to normal when the program causing the problem reduces its priority back to normal or completes execution of the high priority thread.

It is important to note that code can be written which can cause any operating system to lock up or crash.
In most real world applications, the priority of a task is boosted temporarily and other applications and tasks will not even be affected. Other applications use a separate thread for high priority activities. These threads only need to execute for very short period of time and the effect on the rest of the programs running in the system is negligible.

The Single Input Queue Dilemma

Another cause of OS/2 crashes is the single input queue (SIQ) design of the Workplace Shell (WPS). The single input queue simply means that all of the mouse input and keystrokes sent to the WPS are sent to a single queue to wait until they can be processed by the application for which they were intended. This means that an application which was not written to properly respond to the queue and release it immediately can cause the entire desktop to lock up. Memory overcommitment of 100% (when total system memory requirements are 200% of installed RAM memory) or more also seem to contribute to the single input queue lockup problem.

In reality, OS/2 is not locked up; only the desktop is frozen. The processing of other applications continues so that if you have a print job spooling or a download from a BBS in process, those tasks continue to run. You can see this if you start the OS/2 System Clock and configure it to show the seconds hand. If your system locks because of the single input queue, the second hand on the clock will continue to run. You may wait until critical apps or downloads have finished before you reboot the system, if it actually comes to that.

The single input queue was designed into OS/2 by Microsoft back in days of OS/2 1.1, the first version to have the Presentation Manager desktop environment. This was done because – so the story goes – the programmer responsible for the Presentation Manager did not know how to deal with multiple input queues. With multiple input queues, each application has a separate queue for mouse and keyboard activity, so that a single misbehaved application does not lock up the entire WPS. Microsoft has used a multiple input queue strategy in Windows NT and Windows 95 to overcome this problem.

IBM has not modified OS/2 to provide multiple input queues because they contend that would break existing OS/2 applications which depend upon the single input queue model. IBM has said, however, that they intend to fix this problem, and appears to have included the circumvention in Fixpak 17 which became available in late January, 1996. The fix is not multiple queues, but rather a watchdog timer on the single input queue which will release the queue automatically when one application refuses to respond.

There is a great deal of controversy over whether this approach to the solution results in a true fix or merely a workaround. If it works, however, the specific approach used is irrelevant. Fixpak 16 is also supposed to have had the SIQ fix, but has been withdrawn because it had some problems. If you have a copy of it, you should not install it. Fixpack 17 does contain the SIQ fix. The fix is described in the text file READ17_1.TXT which is included with the fixpack. The following line must be added to the CONFIG.SYS file to activate the fix.

SET PM_ASYNC_FOCUS_CHANGE=ON x

The x parameter specifies in milliseconds the amount of time which OS/2 waits before determining that the application is not responding. The default with no parameter is 2000 milliseconds (2 seconds). The suggested range is from 2 to 5 seconds. Once it has determined that a program is not responding to the queue, OS/2 flags the queue as bad and switches to the next application you are trying to use.

OS/2 continues to monitor the queue to see whether the application begins to respond by reading messages. If this occurs, OS/2 marks the queue as good and continues to operate normally. If the application does not respond to the queue, you can try to terminate the program, or just ignore it.

Recovering from hangs and lockups

Once you realize a lockup has occurred, you should spend a moment before taking any action to observe what is actually happening. Some apparent lockups are caused by normal activities like disk swapping. When launching a program with memory already significantly overcommitted, a large amount of swapper activity will occur. Since this activity must take priority over all else except communications, the system may seem to come to a halt. Heavy disk activity might indicate that the system is engaged in serious swapping rather than being locked up.

There are a couple ways of recovering from OS/2 system hangs and lockups caused by the software types of problems we have discussed so far. First, press the Ctrl-Esc key combination repeatedly. This will normally result in a dialog box which prompts you to terminate the application. If you try this it may result in termination of the hung program and negate the need for a reboot. Be sure to wait for several minutes before proceeding with any more drastic actions because OS/2 can take a very long time to respond to Ctrl-Esc. Ten or fifteen minutes would not be unreasonable to wait on a slow (less than 486 33 MHz) system; five or six minutes would be appropriate on fast computers. Warp’s response to Ctrl-Esc seems much faster with Fixpak 16.

WatchCat

WatchCat is a shareware program written by a pair of programmers in Germany which can help in the recovery from some OS/2 lockup situations. When a computer with WatchCat installed locks up, the user generates a wake-up signal using one of several methods. A device driver is installed which allows WatchCat to respond to a user selected keyboard combination, or it can be configured during or after installation to respond to mouse or joystick input, or to switches connected to the serial or parallel ports, or other custom devices. Once WatchCat has control of the system in a full screen or windowed text mode session, the locked application can be terminated. By their nature, some types of applications cannot be terminated. The registered version can terminate more applications more “brutally”, and has more activation methods and other features.

WatchCat is intended to kill programs which are not responding to input, and therefore, to free the input queue. It seems to do a good job of this. WatchCat cannot, however, help with programs which hog the CPU by running at high priority. The program in Figure 2, for example, cannot be terminated because WatchCat cannot get any CPU cycles. I also found it necessary to use the key combinations several time to get WatchCat’s attention. The documentation suggests an alternative activation method when this occurs

If terminating the locked program does not work, you can reboot using the Ctrl-Alt-Del key combination. This soft reboot will be intercepted by the OS/2 kernel and flush the cache and buffers to disk before terminating OS/2 and performing the reboot. This is not a wonderful “recovery” but you should not lose any data and all files will be closed properly.

If a soft reboot does not work, the chances are very high that you do not have a software problem, but rather a hardware problem such as a lost interrupt. You can also tell when the system has entered a “hard lock” because the second hand of the OS/2 system clock will stop. Hardware causes of OS/2 crashes are discussed in following sections. When other attempts at recovery don’t work, you will have to reboot with the Big Red Switch.

Other causes of crashes

There are a number of other causes for system crashes. These are not unique to OS/2, however, and can affect any computer running under any operating system. The bus design of your system is important and the environment in which your system operates is also consequential.

System design

One area which can cause problems is the design of your system, particularly the system data bus. The original IBM PC data bus is called the ISA bus. ISA stands for Industry Standard Architecture – even though it is not an industry standard. The ISA bus was developed, along with the original IBM PC and DOS, based on the assumption that the PC would be used in a stand-alone, single tasking environment. As a result, the nature of the ISA bus is not conducive to a multitasking operating system.

In a true multitasking environment many different tasks can be under way at any given time. This can result in a large number of interrupts being generated. It is possible with the ISA bus for an interrupt to be lost if it happens to occur at the same time as another interrupt. A lost interrupt causes the system to hang or lock up. Many times, OS/2 hangs on an ISA bus system are symptomatic of old hardware technology.

IBM developed the Micro Channel bus for its PS/2 line of computers with a multitasking environment in mind. It is designed so that an interrupt cannot be lost, even if it coincides with another interrupt. Far fewer hangs occur on systems with Micro Channel than on ISA bus systems. The biggest problem with the Micro Channel bus is that IBM did not and does not know how to market personal computer products. The PCI bus which is becoming widespread in systems today is also designed to provide a better multitasking hardware platform than the ISA bus. Not only is it significantly faster than the ISA bus, it also helps to prevent lost interrupts.

If you have a choice, you should purchase a system with a Micro Channel bus or a PCI bus to eliminate lost interrupts. These two busses are also designed to reduce problems caused by Electromagnetic Interference (EMI).

Electromagnetic Interference

Environmental problems for computers are similar to environmental pollution for human beings and other living things. Electronic pollution is called Electromagnetic Interference, or EMI. Electromagnetic interference is caused by two types of electromagnetic fields; radio frequency fields and magnetic fields. This class of phenomena affects the system hardware, but many of the symptoms can appear to be the result of problems in the operating system. Any operating system can be affected, OS/2, DOS, Windows NT, AIX, Unix, and Windows 95.

There are a number of different electromagnetic phenomena which can cause problems for computers and other electronic equipment. Electrostatic discharge (ESD) can occur in the Autumn and Winter. Radio frequency interference (RFI) can occur near radio and TV stations, radar installations, airports, and other locations as well. Poor grounding can cause problems of its own and it can aggravate other problems like ESD and RFI.

Electrostatic Discharge

Electrostatic discharge (ESD) begins to show up in the Autumn as the moisture content of the air – relative humidity (RH) – decreases. During the summertime the high relative humidity keeps ESD at bay by draining the electrostatic charges almost as quickly as they accumulate.

Electrostatic charges are created when two dissimilar materials are separated. The most commonly recognized method for people to accumulate a static charge is to walk across a carpet on a dry autumn or winter day. The static charge accumulated on such a day is not noticeable until it is discharged to another object – usually a door knob or the computer – with the crackle of a spark which causes an unpleasant jolt.

ESD can cause a computer to crash in many ways. You may find that your computer just hangs. You may experience parity errors in DOS or Trap errors in OS/2. Windows may present a general protection fault (GPF) as the result of ESD. The symptoms will vary and the true source of the problem will be very difficult to determine.

The case in which static is discharged from your body is probably the least common cause of problems for your computer. The charges accumulated are just not that high. The real culprit is your chair. A charge of up to 10,000 volts is generated when you get up out of the chair – remember the separation of dissimilar materials. The charge is retained by the chair because the casters on most chairs are rubber or plastic – both of which are nearly perfect insulators. When the chair touches or comes in close proximity to the desk or cart on which your computer sits, the resultant electrostatic discharge can and frequently does disrupt its operation.

Simple ways to prevent ESD

There are only a couple things you can do to prevent ESD. You can also take steps to ensure that when ESD does occur, the results are as harmless as possible.

The primary and least expensive thing that anyone can do to reduce the occurrence of ESD is to prevent the buildup of static charges. The best way to do this is to keep the relative humidity in the computer room between 45% and 70%. 50% to 60% RH is ideal. The 45% to 70% relative humidity range drains away the static charge through the moisture in the air quickly enough that it does not build to a level high enough to cause a discharge.

Another way to prevent static buildup, particularly on your chair, is to use special static draining carpet and chairs with casters which are designed to drain static to the floor and from there to the ground. This is obviously a more expensive solution than keeping the relative humidity in the correct range. It may be necessary, however, to use this approach in buildings or offices in which there is no control over the relative humidity.

There is another very simple way to keep the air in your computer room moist as well as filtered. Green, growing plants add moisture to the air and filter it to remove harmful pollutants and toxins. Plants also remove undesirable ions from the air. These attributes of plants are good for people as well as computers.

Magnetic fields

The magnetic field created by your computer monitor can cause problems when the monitor is placed on top of or in close proximity to your system. All computer monitors use a CRT (cathode ray tube) to generate an image using a beam of electrons. This electron beam is swept across the face of the CRT by a powerful magnetic field which can interfere with the ability of the system to read data from the diskette drive or from any device which has a cable passing too close to the CRT. This can cause CRC – cyclic redundancy check – or other disk media errors.

Almost any type of electrical device can generate a magnetic field. Many of these devices are not computer related. The best way to prevent problems due to magnetic fields is to remove the source of the magnetic field. Move the system monitor away from other devices like the system unit. Keep the entire system away from large electric motors or CRT devices like television sets.

Radio Frequency Interference

Radio Frequency Interference – RFI – is any unwanted electronic signal that is transmitted or received by an electronic device. A computer can generate RFI that interferes with the operation of other electronic devices, just as other devices generate RFI that disrupts the computer. RFI can propagate through the air, as with radio waves, or through the power lines to the power plug of your computer. These unwanted and undesirable signals, whatever their source and however they are propagated, can crash your computer unexpectedly or initiate any number of unusual symptoms.

Symptoms of RFI can be lockups and hangs, trap and SYSxxxx errors of various types, repeated booting, CRC and disk media errors, and internal processing errors. In other words, many of the errors that can be caused by real hardware or software problems can also be caused by RFI problems that are just as real but harder to find, prove, and fix. Many times problems that cannot be traced or explained in any possible way should make you suspicious of RFI.

Sources of RFI

Nearby radio and TV stations can generate powerful electromagnetic signals. These signals propagate through the air and can be picked up by a computer. The cables attached to a computer and to peripheral devices can act as excellent antennae. The keyboard cable, the printer cable, the cables to modems and other external devices all pick up the radiated signals from radio and TV stations. If your system is located close enough to one, you may experience problems.

Powerful radar sources can also affect your system. An airport or air station can be the source of ground-based radar as well as aircraft radars and other radio signals. These radar signals can be powerful and can cause problems similar to those caused by radio and TV signals. The system entry points are cables, which act as antennae for radar signals. Microwave relay towers and cellular phones and their relay towers can cause RFI problems with computers, too.

Minimizing RFI problems

RFI problems cannot be prevented entirely, but they can be minimized by taking certain simple precautions. If you are located next to a radar installation, for example, even proper grounding and all other measures may not be enough to prevent high power radar pulses and radio emissions from interfering with your computer. The following suggestions should help to minimize problems with RFI.

One very common set of entry points for radiated RFI is loose system covers and missing card slot covers. Replace any missing card slot covers. These covers may have been left off after removing a card from the system. Be sure to always replace them when a card is removed from a system. Be sure to install the covers on your system unit or attached peripheral devices if you currently have them off. Fasten them down securely with the latches or screws provided. A missing or improperly installed system cover allows significant RFI entry into the system.

Another common entry point for RFI are the cables that connect the various external peripherals to the system. These cables act as antennae and can pick up radio frequency signals and transmit them inside the system where they can cause problems. These cables can be keyboard cable, mouse cable, serial communications and printer cables, parallel printer cables, audio cables if you have a sound card, data cables to external devices such as external SCSI hard drives or CD-ROM drives, and even the system’s own power cord.

To reduce RFI pickup on cables, ensure that each cable connector is seated properly in its receptacle at both ends of the cable and that the fasteners are properly installed and in use. If screws are used to hold the connector in place, ensure that they are tightened snugly. Where wire retaining clips are used, ensure that they are properly seated and latched. For printer cables that have separate ground wires, you should connect the ground wire to the screw or fastener provided on one end only. Connecting the ground wires on both ends can cause ground loop currents to flow that defeat the purpose of the ground wire. In the case of cables like this, the ground wire is used for shielding, and it makes its connection to the ground reference through the frame of the printer.

Check the ground

Most new homes and office buildings have adequate grounding for the proper operation of a personal computer system, but older homes and offices may not. Be aware, however, that even though a building is relatively new, there still may be problems such as loose connections and missing connections that increase the susceptibility of the system to RFI problems. If you even suspect an EMI problem, check the ground! Computers are much more susceptible to the effects of ESD and RFI when they are improperly grounded. A quick check of your computer’s ground can be accomplished with a simple electrical outlet ground checker available at most hardware or electrical supply stores. Even if the ground checks good, however, you could still have a grounding problem which can aggravate the effects of ESD,

The ideal ground wire installation is an insulated green wire at least the same size as the wires which supply the power to the outlet. The wire should connect only to a one inch diameter copper stake driven at least 10 feet into moist earth or to an equivalent copper water pipe. It should not connect to any other wire or bus at any point along its length. The connection to the ground stake should be made at a point no greater than twelve inches from the entry into the earth, and should be as close as possible to the earth.

Proper grounding is not difficult to achieve, but it can be expensive. You should definitely call a trained electrician to deal with this type of problem. Do NOT under any circumstances attempt to work on the electrical system of your home, office, or building yourself. It can kill you.

If you are having problems which no one can seem to fix, the important things to check are the relative humidity and the grounding of your computer. It would also be wise to consult with someone who specializes in resolving electromagnetic environmental problems.

OS/2 is crash protected

OS/2 does live up to the crash protected claims which IBM makes for it, but it is not crash-proof. Many of the causes of OS/2 crashes are really not problems with OS/2, but rather are the result of external factors which affect all operating systems equally. A true crash – as opposed to a single session lockup – which can be truly attributed to a problem in OS/2 is very rare. OS/2 is a solid platform which gets better with each release.