Recovering from I/O and Media Errors
There you are, computing along, when suddenly your application freezes. Other applications seem to be working, until you try to access a file, then they freeze too. Eventually something coughs up the dreaded "I/O Error" or "Media Error". Precious files are now beyond your reach, maybe your whole drive. Rebooting might help for a while, or it doesn't help at all. What do you do?
This article is about how to recover from I/O and media errors occurring on a magnetic hard disk drive. It does not apply to CDs, DVDs, SSDs, or flash drives. With modern hard drives, you can often rehabilitate the device, recovering your data and sometimes even restoring the drive to full working health. There are also things you can do which will destroy your data and the hardware, so pay attention.
What Not To Do
Never use filesystem recovery software to "fix" your drive if you suspect a hardware problem. Doing so will most likely destroy your data and may render the hardware unrecoverable.
There are a number of file system repair utilities out there, but they are designed strictly to fix software related data corruption issues. If the hardware is failing, not only will these programs not help, but having them muck around with your data while the hardware is unstable will almost certainly make the situation much worse.
Most utilities claim to be able to detect and diagnose hardware problems through the use of a technology standard called S.M.A.R.T. Nearly all modern drives support this standard which is supposed to monitor and report errors. In my experience, it simply does not work. I have never seen a drive report a S.M.A.R.T. condition other than "healthy", including drives that were so fried the platter wasn't even spinning.
So do NOT believe any software that claims your hardware is fine. If you are seeing "I/O" or "Media" errors, then you have a hardware problem.
Symptoms to Look For
There are four types of failures to look for: motor failure, media failure, cable failure, and low power.
All four may result in applications or the system freezing for a while during drive access, or a failure to boot. Cable and power problems can also cause the drive to unmount unexpectedly. Applications may report "I/O" or "Media" errors. If you are experiencing freezes, check your system or console logs for such errors. In Mac OS X, you can run
tail -f /var/log/system.log
in the terminal. Other unix systems may log such information to /var/adm/messages or similar locations. Here's an example:
Aug 30 13:48:13 localhost kernel: disk4s3: I/O error.
You may also hear a loud clacking noise from the drive: much louder and harsher than the usual clicking you may hear when files are being accessed. That clacking is the drive head being thrown back to its stop as the hardware controller tries to reset its position relative to the media tracks. This used to be the best indication of problems, but many modern drives are too quiet to hear this noise distinctly.
Not covered here is a head crash. That's where the drive heads physically contact the media. The result is a truly horrific screeching sound accompanied by a complete inability to access the device. This is extremely rare in modern hardware and is unrecoverable by normal means. Unless you are willing to pay a hardware recovery service a ton of money, all you can do about a head crash is hope you have a solid backup.
A motor failure is the most serious condition (after a head crash). Either the motor that spins the disks or the motor that moves the read/write head is no longer working properly and it will only get worse.
Spindle motor failure causes the drive to not start spinning during boot-up. It results in the drive being completely inaccessible. This may be related to a failure of the spindle ball-bearings (if your drive isn't one of the new liquid-bearing types), in which case it may be preceeded by a loud whining noise. Repeated startup attempts may get it going, at which point it will probably behave normally until you restart. But repeated attempts will also further damage the motor, so if you do get it going do not shut it off. Disable any "energy saving" settings that might spin it down, and proceed to the "What to Do" section below.
Head motor failure causes sporadic problems accessing all files, not just specific ones. It is most often heat related, so the drive may work fine when you first turn on your computer, but then begin seizing up as the system warms with use. For a head motor failure, you will need to cool the drive which may include shutting it down for a while. So its important to distinguish between a spindle motor and head motor failure.
If a motor is failing, then the hardware is doomed. Your goal is to get as much data off as you can before it dies completely.
Media failure is the most common condition. In fact, even a working drive experiences media failures all the time. Modern drive controllers watch for media blocks that are having problems and simply avoid using them in the future. This fact is the key to recovery, as I'll explain later.
Media failures are characterized by problems only occurring when certain files are read or only when files are being written. They are usually (but not always) independent of how long the system has been running. They most commonly occur with files that have not been accessed in months or years.
It is possible to completely recover from most media failures, it just takes time and patience. It is also possible to prevent most media failures, as I'll explain later.
Problems with data or power cables are tricky. There may be nothing wrong with your data (yet), but they can lead to data corruption and can exhibit the symptoms of other problems. The big danger with a cabling issue is that data being written to your drive can become corrupted, resulting in escalating software problems that could eventually wipe out your drive.
If the drive has been working up until now and you have not made any recent cabling adjustments, then cable failure is unlikely. On the other hand, if you just installed the drive, changed ports, or moved a cable, then that should be your first suspect.
Problems with the power supply can cause any of the above symptoms. Sudden umounting of the drive is the most common symptom, especially for USB or Firewire bus powered devices. But low power can also cause more subtle symptoms like I/O errors or head clicking. Low power issues are most common when running on bus power, but they can also occur with wall power if the power supply is failing or if the power cable is damaged. Even internal drives can experience power problems if the system power supply is failing or internal power cables are damaged.
What to Do
Your next steps depend on the type of hardware failure diagnosed above. But since that diagnosis may not be perfect, you should read through all of the steps below.
If it is an external drive, make sure it is connected to wall power. If you are using a laptop, make sure it is running on wall power as well.
If you are having problems with your boot drive, boot off of another drive if you can. Macintosh systems with firewire can be placed into "Target Disk Mode" by booting up with the "T" key held down. (If you have a wireless keyboard, you may need to borrow a USB one for boot keys to work.) This turns the whole system into an external firewire drive that can be accessed from another computer. Otherwise consider moving the drive to an external enclosure so you can access it from another system. Booting off a failing drive is very risky for motor and cable failures. Booting off a drive experiencing media failures is not as bad, but still best avoided.
If you are dealing with a motor failure, then you have only a limited number of minutes of runtime or a limited number of reboots before the drive fails completely. So you have to act carefully and efficiently. Media and cable failures are less time sensitive.
If your system is mostly operational and you have a fairly recent backup, then try to read off your most critical data as quickly as possible. If you already have software to control your cooling fan speeds, turn them up to maximum. But if the system is becoming unstable, shut it down and leave it off for at least half an hour to let things cool down.
Regardless of what the problem may be, both the computer and drive should be powered by a stable wall-socket before you do anything else. If an external drive only misbehaves when its powered by USB or Firewire then you can be pretty sure that its power needs are not being met. That could mean that the drive itself is needing more power than it should, which in turn could indicate that the drive motor is is failing. It could also indicate problems with the computer's power supply, USB or Firewire ports, or just a software glitch.
Wall power supplies can also fail over time. The easiest way to test this is to substitute a known good power supply. That can be tricky if you don't happen to have a spare. If you do have a spare that was purchased at the same time, it may be hard to know whether or not it is also failing. Substituting power supplies from a different model drive is dangerous: you must make sure that the cables not only match the plug but that the pins and voltages are exactly right. If you have a volt meter and a steady hand, test the voltage of the power supplies to make sure it matches the values printed on the power supply. A low reading indicates a failing power supply.
Internal drives are unlikely to have power problems unless you've recently moved the power cables around or are experiencing other power issues with the system. If you have a volt meter and can find the pin voltage specifications, you can test the power leads to each drive.
If you are dealing with an external drive or have recently done work on or near an internal drive, check all the data and power cables that you might have touched or moved. Make sure they are plugged in securely and not obviously damaged. Swap out USB, firewire, or eSATA cables and try other ports if you can. Internal ribbon cables should not be messed with if they are original and the drive has been working up until now: you are more likely to cause damage than to find it.
Don't forget to check your external power supply cables as well. For external drives, try swapping out the power brick if you have another one of the exact same kind. Power supply failures can sometimes mimic cable or other failure modes. For USB or Firewire bus powered drives, switch them to wall power or plug them directly into the computer (not through a hub).
If the drive is having trouble spinning up, you may need to let it cool for an hour or more. Spinning up places the most stress on the drive motor and requires the most power. So your goal is to get it going and keep it going long enough to make full backup. Disable any energy saving settings that might cause it to spin down. Plan ahead: have a backup plan with enough space ready to go.
Cooling can be critical for a motor failure. If you have or can very quickly obtain software for controlling system fan speed, turn the fans up to maximum. For a laptop or external hard drive, get a cold pack and place it under the drive enclosure or under the location in the laptop where the hard drive is installed. Change it frequently. I once recovered data from a failing laptop by placing moist towels under it while the backup ran and blowing canned air into the vent slots next to the drive. I could tell when the towels needed changing because the backup would slow down and I/O errors would start showing up in the system log!
Obviously you want to be very careful with any water near your system, and avoid it if you can. Also take care than any canned air does not spray liquid into your system, as that can damage components. For a desktop system, note that opening the case can actually make it run hotter so close the case while the system is running.
Your choice of backup software is critical at this point. As discussed in my article on the subject, some software will abort the entire backup if it encounters any problems reading from the drive. Make sure to use software that can keep going in the face of errors. Save any error logs so that you can go back and try again on any files that were skipped due to errors. Since the clock is ticking toward complete failure, make sure to copy off your most important files first. Then go back and do a full system backup if you can.
With aggressive cooling and rapid backup action, you should be able to get all your data off before the drive motors fail completely.
Media failure is actually the most common hard drive problem these days. Media failures can be caused by imperfections in the magnetic surface of the drive platters, or simply by the signal fading over time.
The key to recovery is the block reallocation algorithm of your drive's hardware controller. When the drive reads a block, it checks the data against an forward error correction (FEC) checksum. If the data is corrupted, it attempts to use FEC to recover the block. This triggers the drive's reallocation algorithm.
The drive will try many times to access the problem block. If it succeeds, it will quietly mark that block as "bad" and rewrite the data to another location on the disk (called a "spare block"). From that point on, any attempt to access the bad block will instead be redirected to the good one. Thus the bad block is "repaired" and the data saved!
If the drive fails to read the data after some number of tries, it will give up and pass an "I/O" or "media" error back up to the operating system. But that does not mean the data isn't there, just that the signal is weak. Repeated tries may yet acquire the data and allow the block to be reallocated.
The first step to dealing with media failure is to treat it like a head motor failure. Cool the drive and backup what you can as described above. Save the rehabilitation for after you've secured all your other data.
Try to isolate exactly which files are causing the error. If you can find a specific file that always causes a hardware error, try reading it repeatedly. The simplest is to just use the command:
cat filename > /dev/null
You may need to repeat the read attempt ten times or more, but eventually the data should be read successfully. At that point the block reallocation will fix the file and it should cause no more errors, or at least any new errors will occur further along in the file. Monitoring your system log or console output may reveal specific block numbers, allowing you to see your progress.
If a file is particularly stubborn, you may want to simply quarantine it yourself. Create a directory for "BadBlocks" on the same filesystem and move (do not copy) the file into there. Then remove all access permissions. You've effectively marked all the blocks in the file as "bad" and prevented them from being used in the future.
Occasionally, you may experience media failures while writing.
This is usually not associated with any particular file. Some writes
will fail with media errors, while others may succeed, depending on where
the system happens to try to place the data. Recovering from write errors is much like recovering from read errors: repeatedly write data to the drive until there are no more errors. The simplest method is to use a command like:
cat /dev/zero >> zerofile
This will attempt to fill up all the free space on your drive with a file full of null bytes. Keep repeating the process until there are no more media errors, only a disk full error. Note the double '>' character: this appends data to the file so you don't have to start over every time.
With enough persistence and patience, you will likely be able to eliminate all the media errors and recover most if not all of your data. You can probably continue using your hard drive, just make sure to follow the prevention measures below.
The steps described above for recovering from failure can also prevent those failures from occurring in the first place.
For long term use of external drives, wall power is usually the best choice. Bus power should only be used as a temporary convenience, such as when traveling with a laptop.
Handle your cables with care: do not over bend them, always grip the plug and not the cable when inserting or removing, and replace any cable that looks worn or frayed. Remember that internal ribbon cables are not designed to be frequently moved, so handle them with extra care and consider replacing them if you've had to move them more than a few times.
Most hardware failures are caused by excessive heat. For desktop towers, you should periodically open them (while off!) and use canned air to blow out accumulated dust. For laptop and compact systems (like iMacs), install fan control software to turn up the cooling fans a little. This is especially important if you play video games, as this tends to stress components to (and sometimes beyond) their capacities. Running Windows under Boot Camp on a Mac seems to cause particularly bad heat problems. If a fan ever fails, have it repaired immediately.
Use It or Lose It
The magnetic media of a hard drive is not permanent. As soon as data is written, it begins to fade. Eventually, the signal will become too weak to read, resulting in media errors even if the physical media is perfect. This can occur in as little as a few months, though usually it takes a few years.
The best way to protect your data from media failures is to occasionally read it. For example, by making a backup! That's right, the simple act of reading a file to backup it up makes it much less likely that you'll ever need to use that backup.
But beware: if you perform incremental backups, such as with Time Machine, then files which have not changed will not be reread by the backup software. Only a full, fresh backup will get every file on your disk. If there are a few files that you do not normally change or backup, make sure to occasionally read them in some manner so that the drive's reallocation algorithm can find and correct any problems before they become severe. A simple way to read a big file is to use a command like:
cat bigfile > /dev/null
If you have hard drives or files that you are storing offline, be sure to exercise them at least twice per year. This includes your backups themselves!!
A hardware failure in a disk drive is a scary thing: your precious data is at risk. Cool it, back up critical files, backup the whole thing, then try to rehabilitate.
And keep making those backups!