I/O & Media Failures

An archive can recover from unanticipated interruptions, such as a power failure or process crash, through the auto-repair feature or by repairing the archive. It can also tolerate limited data loss or corruption if data redundancy was enabled for the archive.

If an action is running exceptionally slow, or suddenly stops for any of these reasons:

it could indicate a hardware problem with either the data pathway or the storage media. I/O failures of this type are broadly categorized as either transient or permanent failures.

But First…

Start by sending a diagnostic report. We may be able to assist you in identifying the problem or offer advice in diagnosing it. (Plus, there's a teeny tiny chance it could be bug in QRecall, and we want to know about that too.)

Before beginning the process of diagnosing a hardware failure, consider repairing the volume that contains the archive using macOS's Disk Utility or your favorite volume repair tool. I/O (and many other strange) errors can be caused by a corrupted volume structure. If the volume needs repairing, repair the volume and then repair the archive. This will usually resolve the problem.

Transient Failures

A transient failure corrupts data as it's being read from the storage media, or en route from the storage device to the CPU. Data can become corrupted at any point along this path:

These are classified as transient errors because they are highly unlikely to occur exactly the same way twice. Read the same block of data again, and the second time the data is correct.

Built-in Retry More Info

Transient errors can also be corrected by data redundancy. After detecting a corrupted record, QRecall will try to use the correction code blocks to reconstruct the original data. If successful, a warning will be logged and the action will continue with the corrected data.

When and how is data corrected? More Info

Diagnosing Transient Failures

One way to hunt for transient errors is to verify the archive multiple times. If the error reported by the verify doesn't consistently occur every time, the problem is transient.

For example, a capture reports a checksum failure at file offset 123,512,345,680. Verifying the archive reports a checksum failure at file offset 160,162,733,274, and running the verify again reports no problems at all.

This would definitely indicate transient data errors. The data in the archive is recorded correctly, but one-in-a-million read operations returns corrupted data. Culprits to consider would be RAM (rare), bus (most likely), and drive controllers/enclosures (sometimes).

Media Failures

Permanent media failures are characterized by a region of the storage device that will not reliably store data. The data may be initially written correctly and become corrupted over time, or may have been mis-written to begin with.

Diagnosing Permanent Media Failures

In many respects, these are the easiest to diagnose. An action, such as a verify, will repeatedly return the same data error (or log the same data correction warning) every time it is run. For example, if three verify operations in a row all report a bad checksum at file offset 123,512,345,680, you have a permanent data failure at that location.

Dealing with Permanent Failures

It may be possible to correct the problem simply by repairing the archive. During the repair, all correctable data will be rewritten. Uncorrectable records will be erased and overwritten.

Modern magnetic hard drives employ a technique called sparing that will automatically detect a defective sector of the media and relocate the data to a different (presumably good) area of the drive. In other words, simply rewriting the data can permanently fix it.

If the repair is successful, verify the archive. If the verify is successful too, then the archive and drive are probably sound.

If the problem can't be fixed by overwriting the data, recover the archive to a new drive.

Recovering an archive on failing or read-only media

Transient Failures that Cause Permanent Failures

As if things weren't complicated enough, keep in mind that transient failures can act like permanent failures if they occur while writing to the storage device.

A perfectly good block of data, if corrupted by flakey RAM or a transmission error, will end up writing corrupted data to the archive. The drive media is sound, but the data will appear to be corrupted every time it is read.

Data redundancy is the best defense against transient write errors. When data redundancy is turned on, new data is (literally) written more than once. When the corrupted data is read back, error correction will fix it.

Repairing the archive will permanently fix the problem—until it occurs again. One way to diagnose this is to test the same drive on different systems, or via different connections (USB instead of eSATA, Ethernet instead of WiFi, and so on), in order to determine if it's the drive or the system it's connected to that's the source of the errors.

Living with transient failures Note