An archive can recover from unanticipated interruptions, such as a power failure or process crash, through the auto-repair feature or by repairing the archive. It can also tolerate limited data loss or corruption if data redundancy was enabled for the archive.
If an action is running exceptionally slow, or suddenly stops for any of these reasons:
it could indicate a hardware problem with either the data pathway or the storage media. I/O failures of this type are broadly categorized as either transient or permanent failures.
Start by sending a diagnostic report. We may be able to assist you in identifying the problem or offer advice in diagnosing it. (Plus, there's a teeny tiny chance it could be bug in QRecall, and we want to know about that too.)
Before beginning the process of diagnosing a hardware failure, consider repairing the volume that contains the archive using macOS's Disk Utility or your favorite volume repair tool. I/O (and many other strange) errors can be caused by a corrupted volume structure. If the volume needs repairing, repair the volume and then repair the archive. This will usually resolve the problem.
A transient failure corrupts data as it's being read from the storage media, or en route from the storage device to the CPU. Data can become corrupted at any point along this path:
These are classified as transient errors because they are highly unlikely to occur exactly the same way twice. Read the same block of data again, and the second time the data is correct.
QRecall uses a retry mechanism that will often recover from a transient error. If any archive record contains corrupted data, QRecall immediately flushes the RAM buffers and requests the operating system to re-read the record again from the physical media†.
If this second read contains the correct data, a "Transient read failure" message will be recorded in the log and the action will continue.
† Requesting a re-read of physical media is not universally supported. Networked volumes, NAS devices, and so on may ignore this request and return a cached copy of the bad data. Try restarting the server to verify transient read problems.
Transient errors can also be corrected by data redundancy. After detecting a corrupted record, QRecall will try to use the correction code blocks to reconstruct the original data. If successful, a warning will be logged and the action will continue with the corrected data.
When error correction occurs during regular operation the problem is logged, but the archive is not modified. If it was truly a transient error, it's unlikely to happen again. If the corrupted data is permanently stored on the media, the same correction will occur the next time, until the archive is repaired.
The only action that attempts to permanently correct invalid data by rewriting it is the repair command.
One way to hunt for transient errors is to verify the archive multiple times. If the error reported by the verify doesn't consistently occur every time, the problem is transient.
For example, a capture reports a checksum failure at file offset 123,512,345,680. Verifying the archive reports a checksum failure at file offset 160,162,733,274, and running the verify again reports no problems at all.
This would definitely indicate transient data errors. The data in the archive is recorded correctly, but one-in-a-million read operations returns corrupted data. Culprits to consider would be RAM (rare), bus (most likely), and drive controllers/enclosures (sometimes).
Permanent media failures are characterized by a region of the storage device that will not reliably store data. The data may be initially written correctly and become corrupted over time, or may have been mis-written to begin with.
In many respects, these are the easiest to diagnose. An action, such as a verify, will repeatedly return the same data error (or log the same data correction warning) every time it is run. For example, if three verify operations in a row all report a bad checksum at file offset 123,512,345,680, you have a permanent data failure at that location.
It may be possible to correct the problem simply by repairing the archive. During the repair, all correctable data will be rewritten. Uncorrectable records will be erased and overwritten.
Modern magnetic hard drives employ a technique called sparing that will automatically detect a defective sector of the media and relocate the data to a different (presumably good) area of the drive. In other words, simply rewriting the data can permanently fix it.
If the repair is successful, verify the archive. If the verify is successful too, then the archive and drive are probably sound.
If the problem can't be fixed by overwriting the data, recover the archive to a new drive.
On failing a device, there's the possibility of causing more damage by attempting to overwrite the existing archive.
The repair command's recovery option is designed just for this situation; it doesn't write any new data to the existing volume, it just reads what it can and uses that to construct a new archive someplace else. The recovery option is also suitable for use on archives on read-only media (like a DVD) or drives that have been installed into a read-only (forensic) drive enclosure.
To recover an archive on suspect media, follow these steps:
Most magnetic media has a built-in mechanism that will retry failed I/O operations in the hopes that they will succeed—and sometimes it works! These attempts are typically accompanied by a head recalibration procedure, which you can hear as a faint "chirping" noise.
These retry attempts are often very time consuming—more than a hundred times slower than normal I/O—so be patient during the recovery process.
As if things weren't complicated enough, keep in mind that transient failures can act like permanent failures if they occur while writing to the storage device.
A perfectly good block of data, if corrupted by flakey RAM or a transmission error, will end up writing corrupted data to the archive. The drive media is sound, but the data will appear to be corrupted every time it is read.
Data redundancy is the best defense against transient write errors. When data redundancy is turned on, new data is (literally) written more than once. When the corrupted data is read back, error correction will fix it.
Repairing the archive will permanently fix the problem—until it occurs again. One way to diagnose this is to test the same drive on different systems, or via different connections (USB instead of eSATA, Ethernet instead of WiFi, and so on), in order to determine if it's the drive or the system it's connected to that's the source of the errors.
Some transient failures are just a fact of life.
WiFi networks have notoriously higher error rates than hard-wired data busses. If you capture over a Wi-Fi network long enough, you'll eventually encounter a transient data error.
It's inevitable. Just repair the archive and move on.