Mark Adler's top voted answer on the SO link that you included in your question does provide a solution for specifying compression-level as well as number of processors to use:
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
News | Compression | Recommended Books | Recommended Links | DNA sequencing data compression | Pigz | ||
gnu tar | zip | bzip2 and pbzip2 | xz | 7za | parallel | Humor | Etc |
|
While gzip still is the most widely used compression program Unix (after compress) it shows its age. It is not developed further. It implements DEFLATE method originated from pkzip (combination of LZ77 and Huffman encoding). Being non-multithreaded program it uses only a single core on multicore servers althouth there is a good "parallel version" pigz that can compensate for that, and which is blazingly fast.
Actually it was MS DOS pkzip which provided strong impetus for the development of sophisticated compression programs and method used in gzip is similar to deflate method developed by Phil Katz for famous pkzip -- classic archiver for DOS. Many talented authors competed with each other in DOS for the best compression program as it was really important during the period when floppies were the main method of transporting data from one computer to another and generally exchanging information on computers. Local networks did existed at this time (with Novell as primary vendor), but they were expensive and not very common. Internet did not exists (although email did exist) and BBC only started to appear. When the dust settles and winners emerged pkzip was on top.
Right now on linux for middle size text tarballs (say up to 100MB) the winner is 7za as for smaller compressed size you can tolerated much slower compression speed. Utility xz included with most Linus flavors is much slower but despite this it is now widely used in distributing sources and patches in Linux emerging as the standard data exchange format for this purpose and displacing gzip. But for files over 100MB it is way too slow. For example on a typical server compression 2GB file tales half an hour.For large size text files such Genomic data coded in FASTA/FASTQ format bzip2 format proved to be generally preferable as it provides around 30% better compression with similar compression and decompression speed. Parallel implementation of bzip2 -- pgzip2 in blazingly fast and now is probably the optimal archiver for this important area, where file are really huge. It is much faster then gzip and has decent error recovery capabilities (important in case the archive is corrupted)
Storing data in any compressed format means that you take greater risk then in case of string data in tarball. You save space, but recovery in case of bad disk blocks becomes more tricky or even impossible. So generally you need to store such archives in RAID 1 volumes.
Another problem is that people rarely check their tarballs. In such case naive users think that they had their files backed up - until it came time to restore. Then you found out that you've lost almost everything because gzip craps out 10% of the way through your archive.
One possible remedy is gzrecover from the gzip Recovery Toolkit has a program - - that attempts to skip over bad data in a gzip archive. This saved a lot of people from otherwise hopeless situation.
There are generally two situation here
With the option -r gzip can compress the whole tree, not just a single file. If you have files larger then 4GB you should not use this option. It is saver to have tarball and compress it. This is a tried and true methods that works up to several terabytes (although question arise why do you need a single tarball that have a size of several terabytes; it is better to spend some time to recognize data so that you can have several tarballs instead of one.; listing of 4TB tarball can take more then 24 hours even on a computer with a fast CPU and SSD drives)
Right now for large files pigz (parallel implementation which can use multiple cores) should always be used instead of classic gzip. Pigz is not stream archiver so it can't be used as an option for tar.
You need to use a pipe ( bash - How to use Pigz with Tar - Stack Overflow )
Mark Adler's top voted answer on the SO link that you included in your question does provide a solution for specifying compression-level as well as number of processors to use:
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
The man page is available in PDF format from zlib.net
DESCRIPTION
Pigz compresses using threads to make use of multiple processors and cores. The input is broken up into 128 KB chunks with each compressed in parallel. The individual check value for each chunk is also calculated in parallel. The compressed data is written in order to the output, and a combined check value is calculated from the individual check values.
The compressed data format generated is in the gzip, zlib, or single-entry zip format using the deflate compression method. The compression produces partial raw deflate streams which are concatenated by a single write thread and wrapped with the appropriate header and trailer, where the trailer contains the combined check value.
Each partial raw deflate stream is terminated by an empty stored block (using the Z_SYNC_FLUSH option of zlib), in order to end that partial bit stream at a byte boundary. That allows the partial streams to be concatenated simply as sequences of bytes. This adds a very small four to five byte overhead to the output for each input chunk.
The default input block size is 128K, but can be changed with the -b option. The number of compress threads is set by default to the number of online processors, which can be changed using the -p option. Specifying -p 1 avoids the use of threads entirely.
The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread. This can be turned off using the -i or --independent option, so that the blocks can be decompressed independently for partial error recovery or for random access. This also inserts an extra empty block to flag independent blocks by prefacing each with the nine-byte sequence (in hex):
00 00 FF FF 00 00 00 FF FF.Decompression can�t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result,
pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances. Parallel decompression can be turned off by specifying one process ( -dp 1 or -tp 1 ).All options on the command line are processed before any names are processed. If no names are provided on the command line, or if "-" is given as a name (but not after "--"), then the input is taken from stdin. Compressed files can be restored to their original for m using
pigz -d or unpigz.OPTIONS
-# --fast --best
Regulate the speed of compression using the specified digit #, where −1 or −−fast indicates the fastest compression method (less compression) and −9 or −−best indicates the slowest compression method (best compression). -0 is no compression. −11 gives a few percent better compression at a severe cost in execution time, using the zopfli algorithm by Jyr ki Alakuijala. The default is −6.
-b --blocksize
mmm Set compression block size to mmmK (default 128KiB).
-c --stdout --to-stdout
Wr ite all processed output to stdout (won�t delete).
-d --decompress --uncompress
Decompress the compressed input.
-f --force
Force overwrite, compress .gz, links, and to terminal.
-h --help
Display a help screen and quit.
-i --independent
Compress blocks independently for damage recovery.
-k --keep
Do not delete original file after processing.
-K --zip
Compress to PKWare zip (.zip) single entry format.
-l --list
List the contents of the compressed input.
-L --license
Display the pigz license and quit.
-m --no-time
Do not store or restore the modification time. -Nm will store or restore the name, but not the modification time. Note that the order of the options is important.
-M --time
Store or restore the modification time. -nM will store or restore the modification time, but not the name. Note that the order of the options is important.
-n --no-name
Do not store or restore the file name or the modification time. This is the default when decompressing. When the file name is not restored from the header, the name of the compressed file with the suffix stripped is the name of the decompressed file. When the modification time is not restored from the header, the modification time of the compressed file is used (not the current time).
-N --name
Store or restore both the file name and the modification time. This is the default when compressing.
-p --processes n
Allow up to n processes (default is the number of online processors)
-q --quiet --silent
Print no messages, even on error.
-r --recursive
Process the contents of all subdirectories.
-R --rsyncable
Input-determined block locations for rsync.
-S --suffix .sss
Use suffix .sss instead of .gz (for compression).
-t --test
Test the integrity of the compressed input.
-v --verbose
Provide more verbose output.
-V --version
Show the version of pigz. -vV also shows the zlib version.
-z --zlib
Compress to zlib (.zz) instead of gzip format. -- All arguments after "--" are treated as file names (for names that start with "-") These options are unique to the -11 compression level:
-F --first
Do iterations first, before block split (default is last).
-I, --iterations n
Number of iterations for optimization (default 15).
-J, --maxsplits n
Maximum number of split blocks (default 15).
-O --oneblock
Do not split into smaller blocks (default is block splitting).
COPYRIGHT NOTICE
This software is provided �as-is�, without any express or implied warranty. In no event will the author be held liable for any damages arising from the use of this software.
Copyright (C) 2007-2017 Mark Adler <[email protected]>
GZIP competes with
Gzip is weaker as the for compression ratio but has other advantages like high speed on a single care (ona single core it probably remain the fastest) and wide availability. It is still it is more or less adequate and despite being obsolete from the compression ration standpoint is widely used.
By default applying gzip to the file leads to replacing the original file for the archive with the extension .gz. The created archive keeps the same ownership modes, access and modification times. If no files are specified, or if a file name is "-", the standard input is compressed to the standard output. gzip will only attempt to compress regular files. In particular, it will ignore symbolic links.
If the new file name is too long for its file system, gzip truncates it. gzip attempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name consists of small parts only, the longest parts are truncated. For example, if file names are limited to 14 characters, gzip.msdos.exe is compressed to gzi.msd.exe.gz. Names are not truncated on systems which do not have a limit on file name length.
Behaviour of gzip became standard de-facto for similar programs. Often most options are the same.
When compression file by default gzip replaces the files with the compressed file. while doing so gzip keeps the original file name and timestamp in the compressed file. These are used when decompressing the file with the `-N' option (default). This is useful when the compressed file name was truncated or when the time stamp was not preserved after a file transfer. However, due to limitations in the current gzip file format, fractional seconds are discarded. Also, time stamps must fall within the range 1970-01-01 00:00:00 through 2106-02-07 06:28:15 UTC, and hosts whose operating systems use 32-bit time stamps are further restricted to time stamps no later than 2038-01-19 03:14:07 UTC. The upper bounds assume the typical case where leap seconds are ignored.
Compressed files can be restored to their original form using option -d (gzip -d) Alternatively you can use gunzip or zcat. Using gzip -c for decompression naturally leads to the loss of permissions, owners and datastamps. If the original name saved in the compressed file is not suitable for its file system, a new name is constructed from the original one to make it legal.
gunzip takes a list of files on its command line and replaces each file whose name ends with `.gz', `.z', `.Z', `-gz', `-z' or `_z' and which begins with the correct magic number with an uncompressed file without the original extension. gunzip also recognizes the special extensions `.tgz' and `.taz' as shorthands for `.tar.gz' and `.tar.Z' respectively. When compressing, gzip uses the `.tgz' extension if necessary instead of truncating a file with a `.tar' extension.
gunzip can currently decompress files created by gzip, zip, compress or pack. The detection of the input format is automatic. When using the first two formats, gunzip checks a 32 bit CRC (cyclic redundancy check). For pack, gunzip checks the uncompressed length. The compress format was not designed to allow consistency checks. However gunzip is sometimes able to detect a bad `.Z' file. If you get an error when uncompressing a `.Z' file, do not assume that the `.Z' file is correct simply because the standard uncompress does not complain. This generally means that the standard uncompress does not check its input, and happily generates garbage output. The SCO `compress -H' format (lzh compression method) does not include a CRC but also allows some consistency checks.
Files created by zip can be uncompressed by gzip only if they have a single member compressed with the 'deflation' method. This feature is only intended to help conversion of tar.zip files to the tar.gz format. To extract a zip file with a single member, use a command like `gunzip <foo.zip' or `gunzip -S .zip foo.zip'. To extract zip files with several members, use unzip instead of gunzip.
zcat is identical to `gunzip -c'. zcat uncompresses either a list of files on the command line or its standard input and writes the uncompressed data on standard output. zcat will uncompress files that have the correct magic number whether they have a `.gz' suffix or not.
gzip uses the Lempel-Ziv algorithm used in zip and PKZIP. The amount of compression obtained depends on the size of the input and the distribution of common substrings. Typically, text such as source code or English is reduced by 60-70%. Compression is generally much better than that achieved by LZW (as used in compress), Huffman coding (as used in pack), or adaptive Huffman coding (compact).
Compression is always performed, even if the compressed file is slightly larger than the original. The worst case expansion is a few bytes for the gzip file header, plus 5 bytes every 32K block, or an expansion ratio of 0.015% for large files. Note that the actual number of used disk blocks almost never increases. gzip normally preserves the mode, ownership and time stamps of files when compressing or decompressing.
The gzip file format was specified in P. Deutsch in gzip file format specification version 4.3, Internet RFC 1952 (May 1996). The zip deflation format was specified in P. Deutsch, deflate Compressed Data Format Specification version 1.3, Internet RFC 1951 (May 1996).
Here are some realistic examples of running gzip. Most of the example (those that do not include tar) are also valid for pigz.
Compress the ISO content into gz file:
gzip -rc /mnt > /tmp/iso.gzYou can time this operation is the speed of compression of ISO can serve as a poor man test of performance of the computer
time gzip -rc /mnt > /tmp/iso.gz
From 11 Simple Gzip Examples RootUsers
- 1. Compress a single file
This will compress file.txt and create file.txt.gz, note that this will remove the original file.txt file.gzip file.txt- 2. Compress multiple files at once
This will compress all files specified in the command, note again that this will remove the original files specified by turning file1.txt, file2.txt and file3.txt into file1.txt.gz, file2.txt.gz and file3.txt.gzgzip file1.txt file2.txt file3.txtTo instead compress all files within a directory, see example 8 below.
- 3. Compress a single file and keep the original
You can instead keep the original file and create a compressed copy.gzip -c file.txt > file.txt.gzThe -c flag outputs the compressed copy of file.txt to stdout, this is then sent to file.txt.gz, keeping the original file.txt file in place. Newer versions of gzip may also have -k or �keep available, which could be used instead with �gzip -k file.txt�.
- 4. Compress all files recursively
All files within the directory and all sub directories can be compressed recursively with the -r flag[root@centos test]# ls -laR .: drwxr-xr-x. 2 root root 24 Jul 28 18:05 example -rw-r--r--. 1 root root 8 Jul 28 17:09 file1.txt -rw-r--r--. 1 root root 3 Jul 28 17:54 file2.txt -rw-r--r--. 1 root root 5 Jul 28 17:54 file3.txt ./example: -rw-r--r--. 1 root root 5 Jul 28 18:00 example.txt [root@centos test]# gzip -r * [root@centos test]# ls -laR .: drwxr-xr-x. 2 root root 27 Jul 28 18:07 example -rw-r--r--. 1 root root 38 Jul 28 17:09 file1.txt.gz -rw-r--r--. 1 root root 33 Jul 28 17:54 file2.txt.gz -rw-r--r--. 1 root root 35 Jul 28 17:54 file3.txt.gz ./example: -rw-r--r--. 1 root root 37 Jul 28 18:00 example.txt.gzIn the above example there are 3 .txt files in the test directory which is our current working directory, there is also an example sub directory which contains example.txt. Upon running gzip with the -r flag over everything, all files were recursively compressed.
This can be reversed by running �gzip -dr *�, where -d is used to decompress and -r performs this on all of the files recursively.
- 5. Decompress a gzip compressed file
To reverse the compression process and get the original file back that you have compressed, you can use the gzip command itself or gunzip which is also part of the gzip package.gzip -d file.txt.gzOR
gunzip file.txt.gzBoth of these commands will produce the same result, decompressing file.txt.gz to file.txt, removing the compressed file.txt.gz file.
Similar to example 3, it is possible to decompress a file and keep the original .gz file as below.
gunzip -c file.txt.gz > file.txtAs mentioned in step 4, -d can be combined with -r to decompress all files recursively.
- 6. List compression information
With the -l or --list flag we can see useful information regarding a compressed .gz file such as the compressed and uncompressed size of the file as well as the compression ratio, which shows us how much space our compression is saving.[root@centos ~]# gzip -l linux-3.18.19.tar.gz compressed uncompressed ratio uncompressed_name 126117045 580761600 78.3% linux-3.18.19.tar [root@centos ~]# ls -lah -rw-r--r--. 1 root root 554M Jul 28 17:24 linux-3.18.19.tar -rw-r--r--. 1 root root 121M Jul 28 17:25 linux-3.18.19.tar.gzIn this example, a gzipped copy of the Linux kernel has compressed to 78.3% of its original size, taking up 121MB of space rather than 554MB.
- 7. Adjust compression level
The level of compression applied to a file using gzip can be specified as a value between 1 (less compression) and 9 (best compression). Using option 1 will complete faster, but space saved from the compression will not be optimal. Using option 9 will take longer to complete, however you will have the largest amount of space saved.The below example compares the differences between -1 and -9, as shown while -1 finishes much faster it compresses around 5% less (approximately 30mb more space required).
[root@centos ~]# time gzip -1 linux-3.18.19.tar real 0m13.602s user 0m12.908s sys 0m0.662s [root@mirror1 ~]# gzip -l linux-3.18.19.tar.gz compressed uncompressed ratio uncompressed_name 156001021 580761600 73.1% linux-3.18.19.tar [root@centos ~]# time gzip -9 linux-3.18.19.tar real 0m58.129s user 0m57.193s sys 0m0.735s [root@centos ~]# gzip -l linux-3.18.19.tar.gz compressed uncompressed ratio uncompressed_name 125064095 580761600 78.5% linux-3.18.19.tar-1 can also be specified with the flag --fast, while option -9 can also be specified with the flag --best. By default gzip uses a compression level of -6, which is slightly biased towards higher compression at the expense of speed. When selecting a value between 1 and 9 it is important to consider what is more important to you, the amount of space saved or the amount of time spent compressing, the default -6 option provides a fair trade off.
- 8. Compress a directory
With the help of the tar command, we can create a tar file of a whole directory and gzip the result. We can perform the whole lot in one step, as the tar command allows us to specify a compression method to use.tar czvf etc.tar.gz /etc/This example creates a compressed etc.tar.gz file of the entire /etc/ directory. The tar flags are as follows, �c� creates a new tar archive, �z� specifies that we want to compress with gzip, �v� provides verbose information, and �f� specifies the file to create. The resulting etc.tar.gz file contains all files within /etc/ compressed using gzip.
- 9. Integrity test
The -t or --test flag can be used to check the integrity of a compressed file.On a normal file, the result will be listed as OK, shown below.
[root@centos test]# gzip -tv file1.txt.gz file1.txt.gz: OKI have now manually modified this file with a text editor and added a random value, essentially introducing corruption and it is now no longer valid.
[root@centos test]# gzip -tv file1.txt.gz file1.txt.gz: gzip: file1.txt.gz: invalid compressed data--crc error gzip: file1.txt.gz: invalid compressed data--length errorThe compressed .gz file makes use of cyclic redundancy check (CRC) in order to detect errors. The CRC value can be viewed by running gzip with the -l and -v flags, as shown below.
[root@centos test]# gzip -lv file1.txt.gz method crc date time compressed uncompressed ratio uncompressed_name defla 08db5c50 Jul 28 18:15 40 167772160 100.0% file1.txt- 10. Concatenate multiple files
Multiple files can be concatenated into a single .gz file.gzip -c file1.txt > files.gz gzip -c file2.txt >> files.gzThe files.gz now contains the contents of both file1.txt and file2.txt.
if you decompress files.gz you will get a file named �files� which contains the content of both .txt files.
The output is similar to running �cat file1.txt file2.txt�. If instead you want to create a single file that contains multiple files you can use the tar command which supports gzip compression, as covered above in example 8.
- 11. Additional commands included with gzip
The gzip package provides some very useful commands for working with compressed files, such as zcat, zgrep and zless/zmore.
As you can probably tell by the names of the commands, these are essentially the cat, grep, and less/more commands, however they work directly on compressed data. This means that you can easily view or search the contents of a compressed file without having to decompress it and then view or search it in a second step.
[root@centos test]# zcat test.txt.gz test example text [root@centos test]# zgrep exa test.txt.gz exampleThis is especially useful when searching through or reviewing log files which have been compressed during log rotation.
This is the output of the command `gzip -h':
gzip version-number usage: gzip [-cdfhlLnNrtvV19] [-S suffix] [file ...] -c --stdout write on standard output, keep original files unchanged -d --decompress decompress -f --force force overwrite of output file and compress links -h --help give this help -l --list list compressed file contents -L --license display software license -n --no-name do not save or restore the original name and time stamp -N --name save or restore the original name and time stamp -q --quiet suppress all warnings -r --recursive operate recursively on directories -S .suf --suffix .suf use suffix .suf on compressed files -t --test test compressed file integrity -v --verbose verbose mode -V --version display version number -1 --fast compress faster -9 --best compress better file... files to (de)compress. If none given, use standard input. Report bugs to <[email protected]>.
This is the output of the command `gzip -v texinfo.tex':
texinfo.tex: 69.7% -- replaced with texinfo.tex.gz
The following command will find all gzip files in the current directory and subdirectories, and extract them in place without destroying the original:
find . -name '*.gz' -print | sed 's/^\(.*\)[.]gz$/gunzip < "&" > "\1"/' | sh
The format for running the gzip program is:
gzip option ...
gzip supports the following options:
compressed size: size of the compressed file uncompressed size: size of the uncompressed file ratio: compression ratio (0.0% if unknown) uncompressed_name: name of the uncompressed file
The uncompressed size is given as `-1' for files not in gzip format, such as compressed `.Z' files. To get the uncompressed size for such a file, you can use:
zcat file.Z | wc -c
In combination with the `--verbose' option, the following fields are also displayed:
method: compression method (deflate,compress,lzh,pack) crc: the 32-bit CRC of the uncompressed data date & time: time stamp for the uncompressed file
The crc is given as ffffffff for a file not in gzip format.
With `--verbose', the size totals and compression ratio for all files is also displayed, unless some sizes are unknown. With `--quiet', the title and totals lines are not displayed.
The gzip format represents the input size modulo 2^32, so the uncompressed size and compression ratio are listed incorrectly for uncompressed files 4 GB and larger. To work around this problem, you can use the following command to discover a large uncompressed file's true size:
zcat file.gz | wc -c
gunzip -S "" * (*.* for MSDOS)
Previous versions of gzip used the `.z' suffix. This was changed to avoid a conflict
with pack.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. If one member is damaged, other members might still be recovered after removal of the damaged member. Better compression can be usually obtained if all members are decompressed and then recompressed in a single step.
This is an example of concatenating gzip files:
gzip -c file1 > foo.gz gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
In case of damage to one member of a `.gz' file, other members can still be recovered (if the damaged member is removed). However, you can get better compression by compressing all members at once:
cat file1 file2 | gzip > foo.gz
compresses better than
gzip -c file1 file2 > foo.gz
If you want to recompress concatenated files to get better compression, do:
zcat old.gz | gzip > new.gz
If a compressed file consists of several members, the uncompressed size and CRC reported by the `--list' option applies to the last member only. If you need the uncompressed size for all members, you can use:
zcat file.gz | wc -c
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the `-z' option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.
The environment variable GZIP can hold a set of default options for gzip. These options are interpreted first and can be overwritten by explicit command line parameters. For example:
for sh: GZIP="-8v --name"; export GZIP for csh: setenv GZIP "-8v --name" for MSDOS: set GZIP=-8v --name
When writing compressed data to a tape, it is generally necessary to pad the output with zeroes up to a block boundary. When the data is read and the whole block is passed to gunzip for decompression, gunzip detects that there is extra trailing garbage after the compressed data and emits a warning by default if the garbage contains nonzero bytes. You have to use the `--quiet' option to suppress the warning. This option can be set in the GZIP environment variable, as in:
for sh: GZIP="-q" tar -xfz --block-compress /dev/rst0 for csh: (setenv GZIP "-q"; tar -xfz --block-compress /dev/rst0)
In the above example, gzip is invoked implicitly by the `-z' option of GNU tar. Make sure that the same block size (`-b' option of tar) is used for reading and writing compressed data on tapes. (This example assumes you are using the GNU version of tar.)
The algorithms used in gzip came from MS DOS. The most popular DOS compression program was pkzip by late Phil Katz (who dies when he was just 37 years old) who created the company PKWARE. He essentially created zip format later replicated in InfoZIP programs for Linux:
Phillip Walter Katz (November 3, 1962 � April 14, 2000) was a computer programmer best known as the co-creator of the Zip file format for data compression, and the author of PKZIP, a program for creating zip files that ran under DOS. A copyright lawsuit between System Enhancement Associates (SEA) and Katz's company, PKWARE, Inc., was widely publicized in the BBS community in the late 1980s. Phil Katz's software business was very successful, but he struggled with social isolation and chronic alcoholism in the last years of his life.
Starting from pkzip 2.0 (released in 1993) it used so called "deflating" -- a lossless data compression algorithm based on a combination of the LZ77 algorithm and Huffman coding. The resulting file format has since become ubiquitous in DOS and later Windows as well as on BBS and later the Internet -- almost all files with the .ZIP (or .zip) extension are in PKZIP 2.x format.
Utilities to read and write these files are now available on all common platforms. The format itself was later specified in RFC 1951 and Windows, starting with 2000, is able to work with such files natively.
Gzip is an attempt to replicate part of functionality of pkzip in Unix environment. Like pkzip gzip uses deflate method, but it is not compatible with pkzip 2.0 format.
Gzip proved to be one of the most popular program from GNU tool chain and good knowledge of gzip is indispensible for any Unix/linux administrator as many files still use this format despite its age is less compression ration then provided by bz2 (which excels in text files and bioinformatics files) and xz.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Oct 09, 2019 | www.aaronrenn.com
So you thought you had your files backed up - until it came time to restore. Then you found out that you had bad sectors and you've lost almost everything because gzip craps out 10% of the way through your archive. The gzip Recovery Toolkit has a program - gzrecover - that attempts to skip over bad data in a gzip archive. This saved me from exactly the above situation. Hopefully it will help you as well.
I'm very eager for feedback on this program . If you download and try it, I'd appreciate and email letting me know what your results were. My email is [email protected] . Thanks.
ATTENTION99% of "corrupted" gzip archives are caused by transferring the file via FTP in ASCII mode instead of binary mode. Please re-transfer the file in the correct mode first before attempting to recover from a file you believe is corrupted.
Disclaimer and WarningThis program is provided AS IS with absolutely NO WARRANTY. It is not guaranteed to recover anything from your file, nor is what it does recover guaranteed to be good data. The bigger your file, the more likely that something will be extracted from it. Also keep in mind that this program gets faked out and is likely to "recover" some bad data. Everything should be manually verified.
Downloading and InstallingNote that version 0.8 contains major bug fixes and improvements. See the ChangeLog for details. Upgrading is recommended. The old version is provided in the event you run into troubles with the new release.
You need the following packages:
- gzrt-0.8.tar.gz (2013-10-03)
- gzrt-0.7.tar.gz (previous release) - gzrt sources
- github repository
- zlib - You might already have this.
- GNU cpio (version 2.6 or higher) - Only if your archive is a compressed tar file and you don't already have this (try "cpio --version" to find out)
First, build and install zlib if necessary. Next, unpack the gzrt sources. Then cd to the gzrt directory and build the gzrecover program by typing
Usagemake
. Install manually by copying to the directory of your choice.Run gzrecover on a corrupted .gz file. If you leave the filename blank, gzrecover will read from the standard input. Anything that can be read from the file will be written to a file with the same name, but with a .recovered appended (any .gz is stripped). You can override this with the -o
To get a verbose readout of exactly where gzrecover is finding bad bytes, use the -v option to enable verbose mode. This will probably overflow your screen with text so best to redirect the stderr stream to a file. Once gzrecover has finished, you will need to manually verify any data recovered as it is quite likely that our output file is corrupt and has some garbage data in it. Note that gzrecover will take longer than regular gunzip. The more corrupt your data the longer it takes. If your archive is a tarball, read on.
For tarballs, the tar program will choke because GNU tar cannot handle errors in the file format. Fortunately, GNU cpio (tested at version 2.6 or higher) handles corrupted files out of the box.
Here's an example:
$ ls *.gz my-corrupted-backup.tar.gz $ gzrecover my-corrupted-backup.tar.gz $ ls *.recovered my-corrupted-backup.tar.recovered $ cpio -F my-corrupted-backup.tar.recovered -i -vNote that newer versions of cpio can spew voluminous error messages to your terminal. You may want to redirect the stderr stream to /dev/null. Also, cpio might take quite a long while to run.
CopyrightThe gzip Recovery Toolkit v0.8
Copyright (c) 2002-2013 Aaron M. Renn ( [email protected] )The gzrecover program is licensed under the GNU General Public License .
Oct 09, 2019 | stackoverflow.com
15
George ,Jun 24, 2016 at 2:49
Are you sure that it is a gzip file? I would first run 'file SMS.tar.gz' to validate that.Then I would read the The gzip Recovery Toolkit page.
JohnEye ,Oct 4, 2016 at 11:27
Recovery is possible but it depends on what caused the corruption.If the file is just truncated, getting some partial result out is not too hard; just run
gunzip < SMS.tar.gz > SMS.tar.partialwhich will give some output despite the error at the end.
If the compressed file has large missing blocks, it's basically hopeless after the bad block.
If the compressed file is systematically corrupted in small ways (e.g. transferring the binary file in ASCII mode, which smashes carriage returns and newlines throughout the file), it is possible to recover but requires quite a bit of custom programming, it's really only worth it if you have absolutely no other recourse (no backups) and the data is worth a lot of effort. (I have done it successfully.) I mentioned this scenario in a previous question .
The answers for .zip files differ somewhat, since zip archives have multiple separately-compressed members, so there's more hope (though most commercial tools are rather bogus, they eliminate warnings by patching CRCs, not by recovering good data). But your question was about a .tar.gz file, which is an archive with one big member.
,
Here is one possible scenario that we encountered. We had a tar.gz file that would not decompress, trying to unzip gave the error:gzip -d A.tar.gz gzip: A.tar.gz: invalid compressed data--format violatedI figured out that the file may been originally uploaded over a non binary ftp connection (we don't know for sure).
The solution was relatively simple using the unix
dos2unix
utilitydos2unix A.tar.gz dos2unix: converting file A.tar.gz to UNIX format ... tar -xvf A.tar file1.txt file2.txt ....etc.It worked! This is one slim possibility, and maybe worth a try - it may help somebody out there.
Jul 30, 2019 | askubuntu.com
- @ChristopheDeTroyer Tarballs are compressed in such a way that you have to decompress them in full, then take out the file you want. I think that .zip folders are different, so if you want to be able to take out individual files fast, try them. – GKFX Jun 3 '16 at 13:04
Jul 28, 2019 | askubuntu.com
CMCDragonkai, Jun 3, 2016 at 13:04
1. Using the Command-line tarYes, just give the full stored path of the file after the tarball name.
Example: suppose you want file
etc/apt/sources.list
frometc.tar
:tar -xf etc.tar etc/apt/sources.listWill extract
sources.list
and create directoriesetc/apt
under the current directory.2. Extract it with the Archive Manager
- You can use the
-t
listing option instead of-x
, maybe along with grep , to find the path of the file you want- You can also extract a single directory
- tar has other options like
--wildcards
, etc. for more advanced partial extraction scenarios; seeman tar
Open the tar in Archive Manager from Nautilus, go down into the folder hierarchy to find the file you need, and extract it.
3. Using Nautilus/Archive-Mounter
- On a server or command-line system, use a text-based file manager such as Midnight Commander (
mc
) to accomplish the same.Right-click the tar in Nautilus, and select Open with ArchiveMounter.
The tar will now appear similar to a removable drive on the left, and you can explore/navigate it like a normal drive and drag/copy/paste any file(s) you need to any destination.
Jul 28, 2019 | unix.stackexchange.com
,
Midnight Commander
uses virtual filesystem (VFS
) for displying files, such as contents of a.tar.gz
archive, or of.iso
image. This is configured inmc.ext
with rules such as this one (Open
is Enter ,View
is F3 ):regex/\.([iI][sS][oO])$ Open=%cd %p/iso9660:// View=%view{ascii} isoinfo -d -i %fWhen I press Enter on an
.iso
file,mc
will open the.iso
and I can browse individual files. This is very useful.Now my question: I have also files which are disk images, i.e. created with
pv /dev/sda1 > sda1.img
I would like
mc
to "browse" the files inside these images in the same fashion as.iso
.Is this possible ? How would such rule look like ?
Jun 23, 2019 | stackoverflow.com
user1118764 , Sep 7, 2012 at 6:58
I normally compress usingtar zcvf
and decompress usingtar zxvf
(using gzip due to habit).I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?
Warren Severin , Nov 13, 2017 at 4:37
The solution proposed by Xiong Chiamiov above works beautifully. I had just backed up my laptop with .tar.bz2 and it took 132 minutes using only one cpu thread. Then I compiled and installed tar from source: gnu.org/software/tar I included the options mentioned in the configure step: ./configure --with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzip I ran the backup again and it took only 32 minutes. That's better than 4X improvement! I watched the system monitor and it kept all 4 cpus (8 threads) flatlined at 100% the whole time. THAT is the best solution. – Warren Severin Nov 13 '17 at 4:37Mark Adler , Sep 7, 2012 at 14:48
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:tar cf - paths-to-archive | pigz > archive.tar.gzBy default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gzuser788171 , Feb 20, 2013 at 12:43
How do you use pigz to decompress in the same fashion? Or does it only work for compression?Mark Adler , Feb 20, 2013 at 16:18
pigz does use multiple cores for decompression, but only with limited improvement over a single core. The deflate format does not lend itself to parallel decompression.The decompression portion must be done serially. The other cores for pigz decompression are used for reading, writing, and calculating the CRC. When compressing on the other hand, pigz gets close to a factor of n improvement with n cores.
Garrett , Mar 1, 2014 at 7:26
The hyphen here is stdout (see this page ).Mark Adler , Jul 2, 2014 at 21:29
Yes. 100% compatible in both directions.Mark Adler , Apr 23, 2015 at 5:23
There is effectively no CPU time spent tarring, so it wouldn't help much. The tar format is just a copy of the input file with header blocks in between files.Jen , Jun 14, 2013 at 14:34
You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zipValerio Schiavoni , Aug 5, 2014 at 22:38
Unfortunately by doing so the concurrent feature of pigz is lost. You can see for yourself by executing that command and monitoring the load on each of the cores. – Valerio Schiavoni Aug 5 '14 at 22:38bovender , Sep 18, 2015 at 10:14
@ValerioSchiavoni: Not here, I get full load on all 4 cores (Ubuntu 15.04 'Vivid'). – bovender Sep 18 '15 at 10:14Valerio Schiavoni , Sep 28, 2015 at 23:41
On compress or on decompress ? – Valerio Schiavoni Sep 28 '15 at 23:41Offenso , Jan 11, 2017 at 17:26
I prefertar - dir_to_zip | pv | pigz > tar.file
pv helps me estimate, you can skip it. But still it easier to write and remember. – Offenso Jan 11 '17 at 17:26Maxim Suslov , Dec 18, 2014 at 7:31
Common approachThere is option for
tar
program:-I, --use-compress-program PROG filter through PROG (must accept -d)You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive $ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archiveArchiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz $ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gzInput and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zipFor p7zip for compression you need a small shell script like the following:
#!/bin/sh case $1 in -d) 7za -txz -si -so e;; *) 7za -txz -si -so a .;; esac 2>/dev/nullSave it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive $ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7zxzRegarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting
-T
or--threads
to an appropriate value via the environmental variable XZ_DEFAULTS (e.g.XZ_DEFAULTS="-T 0"
).This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this option has no effect for now.
However this will not work for decompression of files that haven't also been compressed with threading enabled. From man for version 5.2.2:
Recompiling with replacementThreaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files compressed in single-threaded mode don't even if --block-size=size is used.
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzipAfter recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz" -j, --bzip2 filter the archive through lbzip2 --lzip filter the archive through plzip -z, --gzip, --gunzip, --ungzip filter the archive through pigzmpibzip2 , Apr 28, 2015 at 20:57
I just found pbzip2 and mpibzip2 . mpibzip2 looks very promising for clusters or if you have a laptop and a multicore desktop computer for instance. – user1985657 Apr 28 '15 at 20:57oᴉɹǝɥɔ , Jun 10, 2015 at 17:39
Processing STDIN may in fact be slower. – oᴉɹǝɥɔ Jun 10 '15 at 17:39selurvedu , May 26, 2016 at 22:13
Plus 1 forxz
option. It the simplest, yet effective approach. – selurvedu May 26 '16 at 22:13panticz.de , Sep 1, 2014 at 15:02
You can use the shortcut-I
for tar's--use-compress-program
switch, and invokepbzip2
for bzip2 compression on multiple cores:tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/einpoklum , Feb 11, 2017 at 15:59
A nice TL;DR for @MaximSuslov's answer . – einpoklum Feb 11 '17 at 15:59If you want to have more flexibility with filenames and compression options, you can use:find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \ tar -P --transform='s@/my/path/@@g' -cf - {} + | \ pigz -9 -p 4 > myarchive.tar.gzStep 1:find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case
/my/path/*.sql
and/my/path/*.log
. Add as many-o -name "pattern"
as you want.Step 2:
-exec
will execute the next command using the results offind
:tar
tar
tar -P --transform='s@/my/path/@@g' -cf - {} +
--transform
is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use-C
option to change directory as you'll lose benefits offind
: all files of the directory would be included.
-P
tellstar
to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by--transform
anyway.
-cf -
tellstar
to use the tarball name we'll specify laterStep 3:
{} +
uses everyfiles thatfind
found previouslypigz
pigz -9 -p 4
Use as many parameters as you want. In this case
Step 4: archive name-9
is the compression level and-p 4
is the number of cores dedicated to compression. If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
> myarchive.tar.gz
Finally.
Nov 08, 2018 | stackoverflow.com
Ask Question up vote 163 down vote favorite 67
user1118764 , Sep 7, 2012 at 6:58
I normally compress usingtar zcvf
and decompress usingtar zxvf
(using gzip due to habit).I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?
Warren Severin , Nov 13, 2017 at 4:37
The solution proposed by Xiong Chiamiov above works beautifully. I had just backed up my laptop with .tar.bz2 and it took 132 minutes using only one cpu thread. Then I compiled and installed tar from source: gnu.org/software/tar I included the options mentioned in the configure step: ./configure --with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzip I ran the backup again and it took only 32 minutes. That's better than 4X improvement! I watched the system monitor and it kept all 4 cpus (8 threads) flatlined at 100% the whole time. THAT is the best solution. – Warren Severin Nov 13 '17 at 4:37Mark Adler , Sep 7, 2012 at 14:48
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:tar cf - paths-to-archive | pigz > archive.tar.gzBy default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gzuser788171 , Feb 20, 2013 at 12:43
How do you use pigz to decompress in the same fashion? Or does it only work for compression? – user788171 Feb 20 '13 at 12:43Mark Adler , Feb 20, 2013 at 16:18
pigz does use multiple cores for decompression, but only with limited improvement over a single core. The deflate format does not lend itself to parallel decompression. The decompression portion must be done serially. The other cores for pigz decompression are used for reading, writing, and calculating the CRC. When compressing on the other hand, pigz gets close to a factor of n improvement with n cores. – Mark Adler Feb 20 '13 at 16:18Garrett , Mar 1, 2014 at 7:26
The hyphen here is stdout (see this page ). – Garrett Mar 1 '14 at 7:26Mark Adler , Jul 2, 2014 at 21:29
Yes. 100% compatible in both directions. – Mark Adler Jul 2 '14 at 21:29Mark Adler , Apr 23, 2015 at 5:23
There is effectively no CPU time spent tarring, so it wouldn't help much. The tar format is just a copy of the input file with header blocks in between files. – Mark Adler Apr 23 '15 at 5:23Jen , Jun 14, 2013 at 14:34
You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zipranman , Nov 13, 2013 at 10:01
This is an awesome little nugget of knowledge and deserves more upvotes. I had no idea this option even existed and I've read the man page a few times over the years. – ranman Nov 13 '13 at 10:01Valerio Schiavoni , Aug 5, 2014 at 22:38
Unfortunately by doing so the concurrent feature of pigz is lost. You can see for yourself by executing that command and monitoring the load on each of the cores. – Valerio Schiavoni Aug 5 '14 at 22:38bovender , Sep 18, 2015 at 10:14
@ValerioSchiavoni: Not here, I get full load on all 4 cores (Ubuntu 15.04 'Vivid'). – bovender Sep 18 '15 at 10:14Valerio Schiavoni , Sep 28, 2015 at 23:41
On compress or on decompress ? – Valerio Schiavoni Sep 28 '15 at 23:41Offenso , Jan 11, 2017 at 17:26
I prefertar - dir_to_zip | pv | pigz > tar.file
pv helps me estimate, you can skip it. But still it easier to write and remember. – Offenso Jan 11 '17 at 17:26Maxim Suslov , Dec 18, 2014 at 7:31
Common approachThere is option for
tar
program:-I, --use-compress-program PROG filter through PROG (must accept -d)You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive $ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archiveArchiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz $ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gzInput and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zipFor p7zip for compression you need a small shell script like the following:
#!/bin/sh case $1 in -d) 7za -txz -si -so e;; *) 7za -txz -si -so a .;; esac 2>/dev/nullSave it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive $ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7zxzRegarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting
-T
or--threads
to an appropriate value via the environmental variable XZ_DEFAULTS (e.g.XZ_DEFAULTS="-T 0"
).This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this option has no effect for now.
However this will not work for decompression of files that haven't also been compressed with threading enabled. From man for version 5.2.2:
Recompiling with replacementThreaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files compressed in single-threaded mode don't even if --block-size=size is used.
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzipAfter recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz" -j, --bzip2 filter the archive through lbzip2 --lzip filter the archive through plzip -z, --gzip, --gunzip, --ungzip filter the archive through pigz> , Apr 28, 2015 at 20:41
This is indeed the best answer. I'll definitely rebuild my tar! – user1985657 Apr 28 '15 at 20:41mpibzip2 , Apr 28, 2015 at 20:57
I just found pbzip2 and mpibzip2 . mpibzip2 looks very promising for clusters or if you have a laptop and a multicore desktop computer for instance. – user1985657 Apr 28 '15 at 20:57oᴉɹǝɥɔ , Jun 10, 2015 at 17:39
This is a great and elaborate answer. It may be good to mention that multithreaded compression (e.g. withpigz
) is only enabled when it reads from the file. Processing STDIN may in fact be slower. – oᴉɹǝɥɔ Jun 10 '15 at 17:39selurvedu , May 26, 2016 at 22:13
Plus 1 forxz
option. It the simplest, yet effective approach. – selurvedu May 26 '16 at 22:13panticz.de , Sep 1, 2014 at 15:02
You can use the shortcut-I
for tar's--use-compress-program
switch, and invokepbzip2
for bzip2 compression on multiple cores:tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/einpoklum , Feb 11, 2017 at 15:59
A nice TL;DR for @MaximSuslov's answer . – einpoklum Feb 11 '17 at 15:59,
If you want to have more flexibility with filenames and compression options, you can use:find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \ tar -P --transform='s@/my/path/@@g' -cf - {} + | \ pigz -9 -p 4 > myarchive.tar.gzStep 1:find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case
/my/path/*.sql
and/my/path/*.log
. Add as many-o -name "pattern"
as you want.Step 2:
-exec
will execute the next command using the results offind
:tar
tar
tar -P --transform='s@/my/path/@@g' -cf - {} +
--transform
is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use-C
option to change directory as you'll lose benefits offind
: all files of the directory would be included.
-P
tellstar
to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by--transform
anyway.
-cf -
tellstar
to use the tarball name we'll specify laterStep 3:
{} +
uses everyfiles thatfind
found previouslypigz
pigz -9 -p 4
Use as many parameters as you want. In this case
Step 4: archive name-9
is the compression level and-p 4
is the number of cores dedicated to compression. If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
> myarchive.tar.gz
Finally.
May 28, 2018 | www.reddit.com
TIP: 7-zip's XZ compression on a multiprocessor system is often faster and compresses better than gzip ( self.linuxadmin )
TyIzaeL line"> [�] kristopolous 4 years ago (4 children)
I did this a while back also. Here's a graph: http://i.imgur.com/gPOQBfG.pngTyIzaeL 4 years ago (3 children)X axis is compression level (min to max) Y is the size of the file that was compressed
I forget what the file was.
That is a great start (probably better than what I am doing). Do you have time comparisons as well?kristopolous 4 years ago (1 child)http://www.reddit.com/r/linuxquestions/comments/1gdvnc/best_file_compression_format/caje4hm there's the postTyIzaeL 4 years ago (0 children)Very nice. I might work on something similar to this soon next time I'm bored.kristopolous 4 years ago (0 children)nope.TyIzaeL 4 years ago (0 children)That's a great point to consider among all of this. Compression is always a tradeoff between how much CPU and memory you want to throw at something and how much space you would like to save. In my case, hammering the server for 3 minutes in order to take a backup is necessary because the uncompressed data would bottleneck at the LAN speed.randomfrequency 4 years ago (0 children)You might want to play with 'pigz' - it's gzip, multi-threaded. You can 'pv' to restrict the rate of the output, and it accepts signals to control the rate limiting.rrohbeck 4 years ago (1 child)Also pbzip2 -1 to -9 and pigz -1 to -9.TyIzaeL 4 years ago (0 children)With -9 you can surely make backup CPU bound. I've given up on compression though: rsync is much faster than straight backup and I use btrfs compression/deduplication/snapshotting on the backup server.
pigz -9 is already on the chart as pigz --best. I'm working on adding the others though.TyIzaeL 4 years ago (0 children)I'm running gzip, bzip2, and pbzip2 now (not at the same time, of course) and will add results soon. But in my case the compression keeps my db dumps from being IO bound by the 100mbit LAN connection. For example, lzop in the results above puts out 6041.632 megabits in 53.82 seconds for a total compressed data rate of 112 megabits per second, which would make the transfer IO bound. Whereas the pigz example puts out 3339.872 megabits in 81.892 seconds, for an output data rate of 40.8 megabits per second. This is just on my dual-core box with a static file, on the 8-core server I see the transfer takes a total of about three minutes. It's probably being limited more by the rate at which the MySQL server can dump text from the database, but if there was no compression it'd be limited by the LAN speed. If we were dumping 2.7GB over the LAN directly, we would need 122mbit/s of real throughput to complete it in three minutes.Shammyhealz 4 years ago (2 children)I thought the best compression was supposed to be LZMA? Which is what the .7z archives are. I have no idea of the relative speed of LZMA and gzipTyIzaeL 4 years ago (1 child)xz archives use the LZMA2 format (which is also used in 7z archives). LZMA2 speed seems to range from a little slower than gzip to much slower than bzip2, but results in better compression all around.primitive_screwhead 4 years ago (0 children)However LZMA2 decompression speed is generally much faster than bzip2, in my experience, though not as fast as gzip. This is why we use it, as we decompress our data much more often than we compress it, and the space saving/decompression speed tradeoff is much more favorable for us than either gzip of bzip2.crustang 4 years ago (2 children)I mentioned how 7zip was superior to all other zip programs in /r/osx a few days ago and my comment was burried in favor of the the osx circlejerk .. it feels good seeing this data.RTFMorGTFO 4 years ago (1 child)I love 7zip
Why... Tar supports xz, lzma, lzop, lzip, and any other kernel based compression algorithms. Its also much more likely to be preinstalled on your given distro.crustang 4 years ago (0 children)I've used 7zip at my old job for a backup of our business software's database. We needed speed, high level of compression, and encryption. Portability wasn't high on the list since only a handful of machines needed access to the data. All machines were multi-processor and 7zip gave us the best of everything given the requirements. I haven't really looked at anything deeply - including tar, which my old boss didn't care for.
May 28, 2018 | rpm.pbone.net
p7zip rpm build for : RedHat EL 6 . For other distributions click p7zip .
Name : p7zip Version : 9.20.1 Vendor : Dag Apt Repository, http://dag_wieers_com/apt/ Release : 1.el6.rf Date : 2011-04-20 15:23:34 Group : Applications/Archiving Source RPM : p7zip-9.20.1-1.el6.rf.src.rpm Size : 14.84 MB Packager : Dag Wieers < dag_wieers_com> Summary : Very high compression ratio file archiver Description :
p7zip is a port of 7za.exe for Unix. 7-Zip is a file archiver with a very high
compression ratio. The original version can be found at http://www.7-zip.org/.
RPM found in directory: /mirror/apt.sw.be/redhat/el6/en/x86_64/rpmforge/RPMS
Content of RPM Changelog Provides Requires
Download
ftp.univie.ac.at p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.rediris.es p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.icm.edu.pl p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.pbone.net p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.pbone.net p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.pbone.net p7zip-9.20.1-1.el6.rf.x86_64.rpm ftp.is.co.za p7zip-9.20.1-1.el6.rf.x86_64.rpm
May 28, 2018 | www.reddit.com
TIL pigz exists "A parallel implementation of gzip for modern multi-processor, multi-core machines" ( self.linux )
submitted 3 years ago by
msiekkinen y unvoted">
[–] tangre 3 years ago (74 children)
Why wouldn't gzip be updated with this functionality instead? Is there a point in keeping it separate?ilikerackmounts 3 years ago (59 children)There are certain file sizes were pigz makes no difference, in general you need at least 2 cores to feel the benefits, there are quite a few reasons. That being said, pigz and its bzip counterpart pbzip2 can be symlinked in place when emerged with gentoo and using the "symlink" use flag.msiekkinen 3 years ago (38 children)adam@eggsbenedict ~ $ eix pigz [I] app-arch/pigz Available versions: 2.2.5 2.3 2.3.1 (~)2.3.1-r1 {static symlink |test} Installed versions: 2.3.1-r1(02:06:01 01/25/14)(symlink -static -|test) Homepage: http://www.zlib.net/pigz/ Description: A parallel implementation of gzipexdirrk 3 years ago (5 children)in general you need at least 2 cores to feel the benefits
Is it even possible to buy any single core cpus outside of some kind of specialized embedded system these days?
Virtualization.tw4 3 years ago (2 children)Yes, but nevertheless it's possible to allocate only one.too_many_secrets 3 years ago (0 children)FakingItEveryDay 3 years ago (0 children)Giving a VM more than one CPU is quite a rare circumstance.
Depends on your circumstances. It's rare that we have any VMs with a single CPU, but we have thousands of servers and a lot of things going on.
You can, but often shouldn't. I can only speak for vmware here, other hypervisors may work differently. Generally you want to size your VMware vm's so that they are around 80% cpu utilization. When any VM with multiple cores needs compute power the hypervisor will make it wait to until it can free that number of CPUs, even if the task in the VM only needs one core. This makes the multi-core VM slower by having to wait longer to do it's work, as well as makes other VMs on the hypervisor slower as they must all wait for it to finish before they can get a core allocated.
May 28, 2018 | hadafq8.wordpress.com
Posted on January 26, 2015 by Sandeep Shenoy This topic is not Solaris specific, but certainly helps Solaris users who are frustrated with the single threaded implementation of all officially supported compression tools such as compress, gzip, zip. pigz (pig-zee) is a parallel implementation of gzip that suits well for the latest multi-processor, multi-core machines. By default, pigz breaks up the input into multiple chunks of size 128 KB, and compress each chunk in parallel with the help of light-weight threads. The number of compress threads is set by default to the number of online processors. The chunk size and the number of threads are configurable. Compressed files can be restored to their original form using -d option of pigz or gzip tools. As per the man page, decompression is not parallelized out of the box, but may show some improvement compared to the existing old tools. The following example demonstrates the advantage of using pigz over gzip in compressing and decompressing a large file. eg.,
Original file, and the target hardware. $ ls -lh PT8.53.04.tar -rw-r–r– 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar
$ psrinfo -pv The physical processor has 8 cores and 64 virtual processors (0-63) The core has 8 virtual processors (0-7) The core has 8 virtual processors (56-63) SPARC-T5 (chipid 0, clock 3600 MHz)
gzip compression.
$ time gzip –fast PT8.53.04.tar
real 3m40.125s user 3m27.105s sys 0m13.008s
$ ls -lh PT8.53* -rw-r–r– 1 psft dba 3.1G Feb 28 14:03 PT8.53.04.tar.gz /* the following prstat, vmstat outputs show that gzip is compressing the tar file using a single thread – hence low CPU utilization. */
$ prstat -p 42510 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 42510 psft 2616K 2200K cpu16 10 0 0:01:00 1.5% gzip/ 1$ prstat -m -p 42510 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 42510 psft 95 4.6 0.0 0.0 0.0 0.0 0.0 0.0 0 35 7K 0 gzip/1 $ vmstat 2 r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 776242104 917016008 0 7 0 0 0 0 0 0 0 52 52 3286 2606 2178 2 0 98
1 0 0 776242104 916987888 0 14 0 0 0 0 0 0 0 0 0 3851 3359 2978 2 1 97
0 0 0 776242104 916962440 0 0 0 0 0 0 0 0 0 0 0 3184 1687 2023 1 0 98
0 0 0 775971768 916930720 0 0 0 0 0 0 0 0 0 39 37 3392 1819 2210 2 0 98
0 0 0 775971768 916898016 0 0 0 0 0 0 0 0 0 0 0 3452 1861 2106 2 0 98pigz compression. $ time ./pigz PT8.53.04.tar real 0m25.111s <== wall clock time is 25s compared to gzip's 3m 27s
user 17m18.398s sys 0m37.718s
/* the following prstat, vmstat outputs show that pigz is compressing the tar file using many threads – hence busy system with high CPU utilization. */
$ prstat -p 49734 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 49734 psft 59M 58M sleep 11 0 0:12:58 38% pigz/ 66$ vmstat 2 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 778097840 919076008 6 113 0 0 0 0 0 0 0 40 36 39330 45797 74148 61 4 35
0 0 0 777956280 918841720 0 1 0 0 0 0 0 0 0 0 0 38752 43292 71411 64 4 32
0 0 0 777490336 918334176 0 3 0 0 0 0 0 0 0 17 15 46553 53350 86840 60 4 35
1 0 0 777274072 918141936 0 1 0 0 0 0 0 0 0 39 34 16122 20202 28319 88 4 9
1 0 0 777138800 917917376 0 0 0 0 0 0 0 0 0 3 3 46597 51005 86673 56 5 39$ ls -lh PT8.53.04.tar.gz -rw-r–r– 1 psft dba 3.0G Feb 28 14:03 PT8.53.04.tar.gz
$ gunzip PT8.53.04.tar.gz <== shows that the pigz compressed file is compatible with gzip/gunzip$ ls -lh PT8.53* -rw-r–r– 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar
Decompression. $ time ./pigz -d PT8.53.04.tar.gz real 0m18.068s
user 0m22.437s sys 0m12.857s
$ time gzip -d PT8.53.04.tar.gz real 0m52.806s <== compare gzip's 52s decompression time with pigz's 18s
user 0m42.068s sys 0m10.736s
$ ls -lh PT8.53.04.tar -rw-r–r– 1 psft dba 4.8G Feb 28 14:03 PT8.53.04.tar
Of course, there are other tools such as Parallel BZIP2 (PBZIP2), which is a parallel implementation of the bzip2 tool are worth a try too. The idea here is to highlight the fact that there are better tools out there to get the job done in a quick manner compared to the existing/old tools that are bundled with the operating system distribution.
May 21, 2018 | superuser.com
I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using
top
, I seem to be using a single CPU at approximately 100%.I am pretty sure it's not (just) disk access speed, since creating a
tar
file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.In summary:
tar -cvf myStuff.tar myDir/*Took <5 minutes to create an 87 G tar file
gzip myStuff.tarTook two hours and 10 minutes, creating a 55G zip file.
My question: Is this normal? Are there certain options in
gzip
to speed things up?Would it be faster to concatenate the commands and use
tar -cvfz
?I saw reference to
pigz
- Parallel Implementation of GZip - but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question .I am intending to try some of these options myself and time them - but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.
When I have the results of other trials available I will update this question - but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized...
UPDATE
As promised, I tried the tricks suggested below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:
flag user system size sameDisk -1 189.77s 13.64s 2.786G +7.2s -2 197.20s 12.88s 2.776G +3.4s -3 207.03s 10.49s 2.739G +1.2s -4 223.28s 13.73s 2.735G +0.9s -5 237.79s 9.28s 2.704G -0.4s -6 271.69s 14.56s 2.700G +1.4s -7 307.70s 10.97s 2.699G +0.9s -8 528.66s 10.51s 2.698G -6.3s -9 722.61s 12.24s 2.698G -4.0sSo yes, changing the flag from the default
-6
to the fastest-1
gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).If anyone is interested, I generated these timing benchmarks using the following two scripts:
#!/bin/bash # compare compression speeds with different options sameDisk='./' otherDisk='/tmp/' sourceDir='/dirToCompress' logFile='./timerOutput' rm $logFile for i in {1..9} do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile doneAnd the second script (
compressWith
):#!/bin/bash # use: compressWith sourceDir compressionFlag destinationDisk logFile echo "compressing $1 to $3 with setting $2" >> $4 tar -c $1 | gzip -$2 > $3test-$2.tar.gzThree things to note:
- Using
/usr/bin/time
rather thantime
, since the built-in command ofbash
has many fewer options than the GNU command- I did not bother using the
--format
option although that would make the log file easier to read- I used a script-in-a-script since
time
seemed to operate only on the first command in a piped sequence (so I made it look like a single command...).With all this learnt, my conclusions are
- Speed things up with the
-1
flag (accepted answer)- Much more time is spend compressing the data than reading from disk
- Invest in faster compression software (
pigz
seems like a good choice).Thanks everyone who helped me learn all this! You can change the speed of gzip using
--fast
--best
or-#
where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.The problem is that gzip is constrained (as you discovered) to a single thread.
Enter pigz , which can use multiple threads to perform the compression. An example of how to use this would be:
tar -c --use-compress-program=pigz -f tar.file dir_to_zipThere is a nice succint summary of the --use-compress-program option over on a sister site .
Thanks for your answer and links. I actually mentioned pigz in the question. �
David Spillett 21.6k 39 62
I seem to be using a single CPU at approximately 100%.
That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).
If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.
If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel.
Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.
Thanks for your input. You gave me an idea (for which you get an upvote): since I have multiple archives to create I can just write the individual commands followed by a
&
- then let the system take care of it from there. Each will run on its own processor, and since I spend far more time on compression than on I/O, it will take the same time to do one as to do all 10 of them. So I get "multi core performance" from an executable that's single threaded... �
Google matched content |
[Jun 23, 2019] Utilizing multi core for tar+gzip-bzip compression-decompression Published on Jun 23, 2019 | stackoverflow.com
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater�s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright � 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: October 09, 2019