Storage Solutions Using parity files to protect data on optical media

Status
Not open for further replies.

blr_p

Oracle
Looking to see how much interest there is in this topic.

The idea is everytime you save data to optical discs, you create parity data say 10-20% of the total disc data and store it elsewhere. If ever your disc develops reading problems in the future then depending on how much redundnacy was chosen you can reconstitute the missing files.

It invovles some extra work and requires a reasonably fast CPU to make it practical. But i'm thinking this is a pretty good way to protect your data which might be unobtainable in the future.

To give a simple example, ppl say make a back up copy on yet another disc to be safe. Burn two copies. But if the first disc develops read errors then you are hoping that the second one will be good, what if its not, then you are sunk. Whereas if you made parity data at the time of writing, you can lose any % of the redundancy chosen and still reconsitiute the disc back.

The idea is simple, think of it this way.

1+2+3+4+5=15

if you lose any one on the LHS but know the total then you can figure out the missing number.
 
Been playing with this for weeks, wanted to have a discussion if anybody was interested.

It takes some effort but if your data is valuable then its worth it IMO.

I used to be crazy about quality scans, they have their place but then i did a scan recently got a good score, then did a read error test and some red spots. What ?!?

See, what matters is whether you can get your data off or not. You might have a very pretty scan but if the data does not come off then you've had it.The question then beceoms what are the repair options. If the dye has problems then forget it. You can try rereading your sectors a million times, killing your drive in the process and still not getting anything off.
 
Rather than going with this, it might be better to make 2 or more copies of the same disc and store them safely and seperately. Did the same for my cousin, rather than giving a single disc and finding out later that it might not work.
<
 
Rather than going with this, it might be better to make 2 or more copies of the same disc and store them safely and seperately. Did the same for my cousin, rather than giving a single disc and finding out later that it might not work.
<

Assumes the backup discs will not develop errors themselves and is a weakness as pointed out earlier. Its better to have parity data with sufficient redundancy so any read errors can be fixed.

What if i told you i do not want to lose anything from the discs i have already written, do you expect me to make a duplicate copy of every single one. And that still isn't as good as parity is it.


Right, this is the app i wanted to start with. It uses the PAR2 std and is supported by other apps. The app itself isn't supported any more as the author has moved on but the std remains.

There is a lot of flexibility but one needs to understand the rationales of the various options before being able to make proper use of it. You need to understand what you are doing before repeating the process on loads of discs to have a reasonable chance of error recovery. This in itself is a tradeoff of time vs recoverability and is upto the user.

Parity compute time as well as repair is a function of preferred block size, number of source blocks and recovery blocks and lastly redundancy. If you have a fast machine the compute time is less than with a slower machine and this is where compromises have to be made.

It would be good if ppl download the quickpar app and start with a simple text file to get a feel for it. Takes only a few seconds to generate parity files and easy to do lots of tests.

Make a small txt file with txt like '1234568' which is 10 bytes, the CR+LF will add 2 more bytes on windows. but only a LF or one byte with *nix.

then copy paste it till the file size is 10kb or 10k characters or if you want 100kb for 100k chars, if you order them in lines of 10 or 100 chars then its easy to manipulate and know how much is missing and needs to be corrected.

The next thing to do is set the preferred block size in the options. This is the size the file will be virtually partitioned into to create parity data.

Then there are the number of source blocks & recovery blocks to choose and finally the redundancy (in percent).

Say we have a 10kb file then choose 100bytes as the preferred block size. that means a 10k file will consist of 100 source blocks. Out of those 100 source blocks if we choose a redundancy of 10% there will be 10 recovery blocks.

If you lose 10% of the file, edit it so that 1k of the file is missing or has been modified then you expect to be able to repair that file. The loss can be either a 1 kb block or much smaller bits anywhere in the file that together add up to 1 kb. If you have more than 1kb loss then you will not be able to repair the file.

Experiment with block sizes smaller & larger than 100bytes and see how well the data can be recovered. The bigger the block size chosen the harder it is to recover from small errors but easier to recover from bigger errors. The smaller the block size the longer it takes to compute parity & repair.

The problem with optical media is you don't know where the errors wil occur or how large or small they will be. All you know is the smallest block of data that a sector has. So choosing preferred block size is easy but how many source & recovery blocks needs to be determined. This requires tradeoffs in choosing the above 3 parameters.

So one needs to get a feel of choosing the right number of recovery blocks which in turn is a function of source bloock size which in turn is a function of the preferred block size.

I'll leave it at that.

There is another app called dvdisaster that does the same and is much simpler to use but there is no flexibilty of choosing the number of recovery blocks and that to me is a weakness as you have no idea how much damage you can recover from. The only thing it alows the user to configure is redundancy ie 14.3 % for normal, 33.5% for high and a custom which you can set. Another disadvantage is it only works with data discs, no ROM type (cd or dvd, VCD or S-vcd).

However the approach of using ISO images is one i prefer, adds a little more protection and will get into disc image creation later.
 
its a very good idea but it will require a lot of efforts. As you might have considered, it will require special s/w for reading and writing the disc and rightly pointed out by you a fast processor all depending upon the scheme used for detecting and correcting.

#lll_aritra_lll CRC can only detect errors, it CANNOT correct it, so practically its of no use in this situation.

#blr_p look into Error correction codes

Hamming codes are the simplest of the lot.

Turbo code is the fastest, while Viterbi is widely used mainly in telecomm field (CDMA), also the popular ones are BCH and RS coding schemes
 
its a very good idea but it will require a lot of efforts. As you might have considered, it will require special s/w for reading and writing the disc and rightly pointed out by you a fast processor all depending upon the scheme used for detecting and correcting.

Just 2 apps for imaging depending on the kind of disc. Takes as long to make an iso as it takes to copy off all the data from the disc, about 5-10mins for a dvd. The added advantage of this is you get to retest your old discs and see how well they hold up. If there are problems maybe its easier to replace something you got a year or two ago than five years ago.

Parity generation can be improved depending on the earlier mentioned 3 parameters, once you find the right mix which was the purpose of the earlier exercise then you just replicate it for as many discs as you have. It's a gamble but its still better than nothing.

#blr_p look into Error correction codes

Hamming codes are the simplest of the lot.

Turbo code is the fastest, while Viterbi is widely used mainly in telecomm field (CDMA), also the popular ones are BCH and RS coding schemes

Do you know of any readily available apps that employ these coding schemes and could be used for this purpose ? They would need to work on one image file per disc.

Bear in mind the amount of data ie 4.7GiB for a dvd, i'd think these codes work for much smaller chunks of data and would need to be customised to work with a larger dataset. Also these codes assume retransmits in a communications medium as such they are very fast and work on small chunks, whereas a disk write is a one time process, no chance of retransmit,

I've found a couple that use other algorithms but they are proprietary and windows only, that means you need to use whatever app is provided, additionally you would need to do extensive testing to see how robust they are. One of them is optimised for encoding but takes very long to do a repair.

quickpar uses reed-solomon and has been widely tested, the problem of course is PAR2 even with optimised algorithms takes more time. I use another app that is more flexible than quickpar but quickpar is an easier app to learn with for a beginner.

To give an idea, a DVD iso image with 20% redundancy & 1,300 recovery blocks on my pc (thinkpad T-43, quite an old model), takes about 3 hours. Conceivably that would be much faster on a more modern machine or one with multiple cores. With scripting, multiple jobs can be queued up. Quickpar came out in 2003, machines then were much slower, forcing people to use very low redundnacy (2-5%)as well as recovery blocks, but today i think the procedure is feasible.

At the end of the day its down to how valuable you consider your data to be. For me if i save anything then that means its valuable. HDDs are very good for this they are much more reliable than optical media but optical media is the cheapest option and the above procedure is intended to make it more robust.
 
#blr_p, nicely explained.

A dvd disc has block size of 2048 bytes (2KB) or 2352 bytes? so if we choose that for quickpar, and keep 10% recovery then we would have about 450MB recovery data for a 4.5GB disk? right?
 
processing the parity information and recovering/assuming data is very taxing. in case of raid on hdds, the raid controller does it, but in this case it has to be done via software, ie CPU, which will be very intensive for cpu.

Also in case oh hdd, the recovered data is immediately written to a new hdd while syncing, but in this case it has to be kept in cache for ... like recovering to an iso. then why not store the iso on the hdd itself ? and relieve the system from the hefty processing and all. Also if there is any slight read error on the parity info disc, everything is lost if you try to recover a damaged disc.

And you would need two dvd drives to perform this or mount the parity disc in some virtual drive.

at time like this, when hdd prices are hitting the roof, the idea makes sense, but when hdd prices fall, well hdds are prefered all the way.
 
#blr_p, nicely explained.

A dvd disc has block size of 2048 bytes (2KB) or 2352 bytes? so if we choose that for quickpar, and keep 10% recovery then we would have about 450MB recovery data for a 4.5GB disk? right?

The block size depends on how the image is created. If its only user data then the block size will be 2048, but if its a RAW image then it will be 2352. A raw dvd image is not much use so 2048 which applies to user data only will be used.

The situation is more complicated with CDs. If its a data disc, then again its 2048, but otherwise it depends on the type. Audio cds do not have any parity data on the disc so its 2352 bytes per sector.

CD-ROM XA

None of this exists with dvd, its just 2048.

Its important to get the right block size as that influences efficiency. The idea is if the source blocks realign with the data that is extracted then the smallest number of recovery blocks will be used to repair the data. Otherwise it will require more recovery blocks and this in turn will influence chances of recovery, you might get lucky or not.

This is why the basics have to be understood properly at the time of generating parity.

processing the parity information and recovering/assuming data is very taxing. in case of raid on hdds, the raid controller does it, but in this case it has to be done via software, ie CPU, which will be very intensive for cpu.

Best is to try and see how long it takes, you can tweak the quickpar parameters.

Also in case oh hdd, the recovered data is immediately written to a new hdd while syncing, but in this case it has to be kept in cache for ... like recovering to an iso. then why not store the iso on the hdd itself ? and relieve the system from the hefty processing and all.

Storing the iso on the drive is fine, it takes up more space and does not include any parity data if you have HDD read errors you will again lose data though this is much less likely than with optical media. Better would be to store just the parity info on HD and leave the optical disc as is.

Also if there is any slight read error on the parity info disc, everything is lost if you try to recover a damaged disc.

This is a good point but, no.

With PAR2, even if there are parity info errors from reading only the recover block in which it occurs will be affected the rest is good to go. There are checksums stored in the recovery block for each & every one. So if one recovery block has problems then the others will be used.

This is why more recovery blocks is good, because then fewer blocks will be affected and chances of recovery are good. But generating more recovery blocks takes more time than less recovery blocks. Lets say for example, you used just one recovery block, and got a read error then only in this case will there be total loss of the parity info. You never want to have just one recovery block.

This is another reason to use PAR2, because the specs are published and there is lots of discussion using it compared to other more proprietary algorithms

Typically 300 recovery blocks for 3000 source blocks with 10% redundancy is a reasonable amount. But what is reasonable
<
depends on how fast your cpu is, how much risk you are willing to take and how many discs you have to do. These choices will be influenced by what your personal experience has been with data loss on optical media, how much did you lose on a disc in the past This is a personal choice, there is no one size fits all. So that is why i suggested playing with the text file, before moving to disc images, much faster to learn.

You can even try to simulate loss with a cd-r and a marker to colour over the disc to see how much can be recovered though leave this for later when we discuss imaging apps. Can always clean the disc with some brasso and reuse the disc.

Right now i use 15 or 20% redundnancy,

20% means do five discs and store the total parity on another disc.

15% means do six discs and store the total parity on another disc.

And you would need two dvd drives to perform this or mount the parity disc in some virtual drive.

No, just one drive is necessary.

You would need an app that can read the disc sector by sector, wth multiple sector re-reads if necessary and if it fails then it inserts zeroes and goes on. Once the image is created on your HDD, then you put the parity files in the same folder as the image and run the repair, depending on the damage it will take time to repair and once done, you re-burn the fixed image. You have repaired your disc
<


at time like this, when hdd prices are hitting the roof, the idea makes sense, but when hdd prices fall, well hdds are prefered all the way.

Sure, but i dont want all my data on HDDs, things like movies i watch only once for what do i need to store them on HDD. I prefer data that is frequently accessed to be on HDD whilst the data that is used once in a while can easily be stored on optical medium. Provided there is adequate protection against data loss.

rareravi' timestamp='1329989991' post='1708148 said:
Nero SecureDisk uses same concept I think.

http://www.securdisc.net/eng/how-to-secure.html

How do I retrieve data from damaged discs?

I want to be able to retrieve my files if a disc is accidentally damaged.

After you've copied all your files onto a disc, SecurDisc uses the empty space to add redundant and checksum data. This significantly increases the chances of your files being retrieved, even if the disc itself is damaged.

Dvdisaster can do the same thing too. I prefer not to store the parity info of a disc on the same disc itself but rather on another disc as well as HDD. Secondly i only write discs to 90-95% capacity and rarely upto 100% because IME errors start from the outer and move inwards. There can always be exceptions but this is the general case. With parity data i dont care as i can recover the disc with reasonable confidence. 20% redudancy means you can lose nearly 700MB on the disc and still recover it. With many recovery blocks, does not matter if sector errors are small & scattered all over the place

Also what are the parameters for this recovery. All i can see is choose the level of redundnancy, this means its just like dvddisaster. No idea about recovery blocks, source blocks or anything because there is no published spec and a proprietary algoriithm is used. Will be faster but you have no idea just how much damage or the kind you can recover from. Somebody else already made that decision for you.
 
SecurDisc Solution:Data Reliability

After you've copied all your files onto a disc, SecurDisc uses the empty space to add redundant and checksum data. This significantly increases the chances of your files being retrieved, even if the disc itself is damaged.

I never tried SecurDisk and probably it is bad idea to store the parity bit on the same media.
 
Status
Not open for further replies.