Any way to reduce the size of texts?
Please Note neither base64
nor encryption
was designed for reduction of string length. What you should be looking at is compression and i think you should look at gzcompress
and gzdeflate
Example using decoded version of your text
$original = "In other cases, some countries have gradually learned to produce the same products and services that previously only the U.S. and a few other countries could produce. Real income growth in the U.S. has slowed." ;
$base64 = base64_encode($original);
$compressed = base64_encode(gzcompress($original, 9));
$deflate = base64_encode(gzdeflate($original, 9));
$encode = base64_encode(gzencode($original, 9));
$base64Length = strlen($base64);
$compressedLength = strlen($compressed) ;
$deflateLength = strlen($deflate) ;
$encodeLength = strlen($encode) ;
echo "<pre>";
echo "Using GZ Compress = " , 100 - number_format(($compressedLength / $base64Length ) * 100 , 2) , "% of Improvement", PHP_EOL;
echo "Using Deflate = " , 100 - number_format(($deflateLength / $base64Length ) * 100 , 2) , "% of Improvement", PHP_EOL;
echo "</pre>";
Output
Using GZ Compress = 32.86% Improvement
Using Deflate = 35.71% Improvement
Base64 is not compression or encryption, it is encoding. You can pass text data through the gzip compression algorithm (http://php.net/manual/en/function.gzcompress.php) before you store it in the database, but that will basically make the data unsearchable via MySQL queries.
Okay, it's really challenging! (at least for me!) ... you have 10 TB of text and you want to load it on your MySQL database and perform a fulltext search on the tables!
Maybe some clustering or some performance tricky ways on a good hardware works for you, but if that's not the case, you may find it interesting.
First, you need an script to just load these 50 billion piece of text one after each other, split them into some words
and treat them as a keyword, that means giving them a numeric id and then save them on a table. by the way I am piece of large text.
would be something like this:
[1: piece][2: large][3: text]
and I'm the next large part!
would be:
[4: next][2: large][5: part]
By the way words I, am, of, I'm, the
plus ., !
has been eliminated because they do not nothing usually in a keyword-based
search. However you can keep them also in your keywords array, if you wish.
Give the original text a unique id. You can calculate the md5
of the original text or just simply giving a numeric id. Store this id
somewhere then.
You will need to have a table to keep the relationships between texts
and keywords
. it would be a many-to-many
structure like this:
[text_id][text]
1 -> I am piece of large text.
2 -> I'm the next large part!
[keyword_id][keyword]
1 -> piece
2 -> large
3 -> text
4 -> next
5 -> part
[keyword_id][text_id]
1 -> 1
2 -> 1
3 -> 1
4 -> 2
2 -> 2
5 -> 2
Now, imagine how much it would be easier (especially for MySQL!) if somebody search large text
!
As far as I found on the 'net, it would be about 50,000
or 60,000
of words as your keywords or maximum 600,000
-700,000
words, if you just keep everything as a keyword. So, you can simply guess 50,000 words would be far less than 10 TB
of text-based data.
I hope that it helps, and if you need I can explain more or help you to make that works somehow! :)