Introduction to Linux - A Hands on Guide | Linux Bible | Linux From Scratch | A Newbie's Getting Started Guide to Linux | Linux Command Line Cheat Sheet | More Linux eBooks



Friday, 11 November 2016

'split' command in Linux to break large file into smaller chunks

Split command in Linux - In this article, we are going to study split command in Linux with which we can break a large file into smaller pieces.

linux-split-command

To demonstrate this, we create a file testFile.txt using seq command in Linux. For those who do not know about seq command, it prints sequence of numbers, which we would dump into a file. Lets do it.

# Dump 'seq' output to 'testfile.txt'
[root@LinuxBox split_test]# seq 10000 > testfile.txt

# First 10 lines
[root@LinuxBox split_test]$ head testfile.txt
1
2
3
4
5
6
7
8
9
10

# Last 10 lines
[root@LinuxBox split_test]$ tail testfile.txt
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000

1. Basic use of split

The basic usage of any command is when it is not used with any option. In this case, we would supply file name as an argument or parameter to split command as shown below. When it gets executed, run ls command to list the smaller parts of the file.

[root@LinuxBox split_test]$ split testfile.txt

# Large file has been split into number of smaller files
[root@LinuxBox split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj

We could see a number of files with name in the format x-- have been created. In order to make sure that they are the parts of the original file, we check the number of lines and even their contents.

# Original file -> 10000 Lines
# 10 parts -> 1000 lines each
[root@LinuxBox split_test]$ wc -l *
10000 testfile.txt
 1000 xaa
 1000 xab
 1000 xac
 1000 xad
 1000 xae
 1000 xaf
 1000 xag
 1000 xah
 1000 xai
 1000 xaj
20000 total

# Check the contents of first part
[root@LinuxBox split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10

# Check the contents of last file
[root@LinuxBox split_test]$ tail xaj
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000
[root@LinuxBox split_test]$ 

In this case, file has been split into 10 smaller chunks based on number of lines, such that every chunk consists of 1000 lines. Instead, you might want the file to be split into a specific number of chunks, say 5 chunks (so that, every chunk will contain 2000 lines). Lets see how to do that.

2. Split a file in 'n' smaller parts - Option -n

We can define the number of parts a file should be split into using option -n. The syntax for this is split -n [No. of chunks] [file name]. Lets create 5 chunks of our file testfile.txt.

# Specify the number of chunks
[root@LinuxBox split_test]$ split -n 5 testfile.txt

# There are 5 chunks created
[root@LinuxBox split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae

# Their sizes may vary
[root@LinuxBox split_test]$ wc -l *
10000 testfile.txt
 2177 xaa
 1955 xab
 1956 xac
 1955 xad
 1957 xae
20000 total

# But, they contribute to the same information
[root@LinuxBox split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10
[root@LinuxBox split_test]$ tail xae
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000
[root@LinuxBox split_test]$ 

So, this command has created 5 chunks of the file for us, which might differ in their sizes, but eventually have same contents as in original file when put together. Next, we see how chunks can be created based on size of every chunk.

3. Split a file into chunks of equal sizes - Option -b

We've seen how file can be split based on number of lines and number of chunks, now we see how to split a file based on the size of every chunk so as to create chunks of equal sizes. For this, we use option -b as split -b [size] [file name], where the size must be mentioned in bytes.

# We specify the chunk size to be 10000 bytes
[root@LinuxBox split_test]$ split -b 10000 testfile.txt

# It creates 5 chunks for us
[root@LinuxBox split_test]$ ll
total 108
-rw-r--r--. 1 root root 48894 Nov 11 14:09 testfile.txt
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xaa
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xab
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xac
-rw-r--r--. 1 root root 10000 Nov 11 14:38 xad
-rw-r--r--. 1 root root  8894 Nov 11 14:38 xae

We can see that there are 5 chunks created, 4 of which with a size of 10000 bytes and other with leftover data. Now that, we can split a file based on the size of each chunk, number of chunks and lines in each chunk. Lines? Not yet. 1000 lines is the default value and we can modify it as per our need.

4. Creating chunks with 'n' lines each - Option -l

With -l option of split command, we can set the number of lines each chunk should contain. The syntax is the same, with different option this time. Lets split the file with each chunk having 1200 lines.

# Specify the number of lines -> 1200
[root@LinuxBox split_test]$ split -l 1200 testfile.txt

[root@LinuxBox split_test]$ ls
testfile.txt  xaa  xab  xac  xad  xae  xaf  xag  xah  xai

[root@LinuxBox split_test]$ wc -l *
10000 testfile.txt
 1200 xaa
 1200 xab
 1200 xac
 1200 xad
 1200 xae
 1200 xaf
 1200 xag
 1200 xah
  400 xai
20000 total

# Verify their contents
[root@LinuxBox split_test]$ head xaa
1
2
3
4
5
6
7
8
9
10
[root@LinuxBox split_test]$ tail xai
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000

5. Numeric suffixes - Option -d

We have seen that the names of the chunks created are alphabetical if the format x--, where - is also an alphabet. We can change this to a digit so that it reads as x01, x02 and so on (it makes more sense), using option -d as below.

# Numeric Suffixes
[root@LinuxBox split_test]$ split -d testfile.txt

[root@LinuxBox split_test]$ ls
testfile.txt  x00  x01  x02  x03  x04  x05  x06  x07  x08  x09

6. Suffix length - Option -a

We can also change the suffix length using option -a, so that x01 would read as x0001 if we specify suffix length = 4. Lets check this.

# Suffix Length = 4
[root@LinuxBox split_test]$ split -d -a 4 testfile.txt

[root@LinuxBox split_test]$ ls
testfile.txt  x0000  x0001  x0002  x0003  x0004  x0005  x0006  x0007  x0008  x0009

That's it! Thank you.

0 comments:

Post a Comment