Friday, 30 March 2018

AWK Programming Tutorial - Field separator and Field references

Awk Field separator and field references: This is the third article from our tutorial series on awk. In first article, we had an introduction with awk and in second one, we created Hello world program in awk. In this article, we will be learning about separating fields and referencing them using awk.


Referencing Fields and Records

In the first article from this tutorial series, Introduction to awk, we covered following points:
  • awk presumes that the input is a structured type of data
  • It interprets each line from input file(s) as a Record
  • Each line will have strings/words separated (or delimited) by whitespaces or some character. These separators are referred to as delimiters.
  • Each of those strings/words separated by delimiter is called as a Field.

Lets consider a familiar example to know about records, fields and delimiters, /etc/passwd file:

messagebus:x:107:111::/var/run/dbus:/bin/false
uuidd:x:108:112::/run/uuidd:/bin/false
sshd:x:110:65534::/var/run/sshd:/usr/sbin/nologin
foouser:x:1001:1001:,,,:/home/foouser:/bin/bash

In above file, each of the line is interpreted as a record. As each word/string is separated by a colon ( : ), it becomes a delimiter and each word separated by the delimiter i.e. foouser, 1001, /bin/bash, etc. are the fields.





In awk, we reference each field using $ operator, followed by a number or an awk variable. We learn more about awk variables in later articles to keep things simple here. Thus, we can reference first field from the record using $1, second field with $2, third field with $3 and so on. $0 is used to reference the record (or the input line).

Lets take a look at following example. We have an input file result.txt with contents as below [snipped]:

Student Subject Marks
James Biology 31
Velma Biology 43
Kibo Biology 81
Louis Biology 11
Phyllis Biology 18
Zenaida Biology 55
Gillian Biology 38
Constance Biology 16
Giselle Biology 73

We can see that there are 10 records and each record has 3 fields. Now we refer to each record and every field with their respective identifiers.

# Referencing first field
$ awk '{ print $1 }' result.txt
Student
James
Velma
Kibo
...
...

# Referencing second field
$ awk '{ print $2 }' result.txt
Subject
Biology
Biology
Biology
...
...

# Referencing third field
$ awk '{ print $3 }' result.txt 
Marks
31
43
81
...
...

# Referencing all fields
$ awk '{ print $3, $1, $2 }' result.txt
Marks Student Subject
31 James Biology
43 Velma Biology
81 Kibo Biology
...
...

# Referencing a record
$ awk '{ print $0 }' result.txt
Student Subject Marks
James Biology 31
Velma Biology 43
Kibo Biology 81
...
...

Field Separator

In above example, we have not used any field separator or delimiter anywhere in the awk command. So, it can be concluded that, awk considers whitespace as a default field separator. awk allows us to set a field separator of our own choice with -F option followed by the delimiter. Lets check this with /etc/passwd file, that has fields delimited by a colon.

# /etc/passwd file contents (snipped)
$ cat /etc/passwd
...
messagebus:x:107:111::/var/run/dbus:/bin/false
uuidd:x:108:112::/run/uuidd:/bin/false
sshd:x:110:65534::/var/run/sshd:/usr/sbin/nologin
foouser:x:1001:1001:,,,:/home/foouser:/bin/bash
...

$ awk -F ':' '{ print $3, $1, $7 }' /etc/passwd
...
107 messagebus /bin/false
108 uuidd /bin/false
110 sshd /usr/sbin/nologin
1001 foouser /bin/bash
...

While writing an awk script, we can change the field separator by using awk variable FS. We need to instruct awk to consider a custom delimiter before it start reading lines from input file. Here, BEGIN block comes handy. BEGIN block is executed before any input lines are read. Similarly, we have END block which gets executed once all of the lines from input file are read. Both BEGIN and END blocks are optional.

So, we can write an awk script passwd.awk as:

BEGIN { FS = ":" }
{
    print $3, $1, $7
}

As covered in our first tutorial (link), we can use the instructions from this script using option -f as below:

$ awk -f passwd.awk /etc/passwd
...
107 messagebus /bin/false
108 uuidd /bin/false
110 sshd /usr/sbin/nologin
1001 foouser /bin/bash
...

To make the output comprehensible, we can introduce a tab ( \t ) character between two output fields.

$ cat passwd.awk
BEGIN { FS = ":" }
{
    print $3 "\t" $1 "\t" $7
}

$ awk -f passwd.awk /etc/passwd
107	messagebus	/bin/false
108	uuidd	/bin/false
110	sshd	/usr/sbin/nologin
1001	foouser	/bin/bash

By default, all the instructions from the script are executed on every single line from the input file. To execute these instructions on selected lines, we can also introduce pattern matching by enclosing the regular expression within slashes ( /[REGEX]/ ). This will execute the instructions from awk script on only those lines matching the regex.

To verify this, we use our results.txt file again. From the entire list of students and their marks in certain subjects, we can filter only those records of students who got exactly 50 marks, whichever may be the subject. So, we can use 50 as the pattern to match, as shown below:

awk ' /50/ {print $1"\t"$2"\t"$3} ' result.txt 
Ori	Chemistry	50
Hyatt	Mathematics	50

Or we can filter only those records in which students who have their names starting with string Jo. For this, we can use a regex ^Jo with tilde ( ~ ) operator to match against first field ( $1 ) which is name of the student.

$ awk ' $1 ~ /^Jo/ { print $1"\t"$2"\t"$3 }' result.txt 
John	Biology	55
Jonas	Mathematics	40

Or we can negate the same using the bang or logical not operator ( ! ) as shown below (result is be too long, hence now shown):

$ awk ' $1 !~ /^Jo/ { print $1"\t"$2"\t"$3 }' result.txt

That's all for the scope of this article. Please share your feedback and suggestions in the comments section below and stay tuned for more articles on this topic.

0 comments:

Post a Comment