Saturday, 24 March 2018

AWK Programming Tutorial - Introduction to awk

In some of our articles we have learned about one of the most important and useful utilities in Linux for text processing - sed. Using sed, we can:
  • Edit () one or more files in place
  • Simplify and automate file edits on one or more files at a time without using vi
  • Write scripts to process and convert text
Link to articles on sed : sed command in Linux

In this article, we are introducing you to another immensely powerful text processing utilities in Bash - awk. awk programming is very useful while manipulating structured data, especially while creating a summerized reports out of some data that has some structure (like a table, which has columns and rows). Below are some of the common tasks which we can perform using awk:

  • Interpret a text file as a SQL database and process fields and records.
  • Perform arithmetic and logical operations
  • Perform string operations
  • Create and use conditionals and loops
  • Define and use functions
  • Execute a shell command and process its output
  • Extract, analyze and create reports from the data
With the power and range that awk offers us, we can replace an entire shell script with an awk single liner.

How sed and awk operate

  • awk and sed, both of them work in a similar way.
  • Read an input line from a file
  • Store it in a buffer (or create a copy of it, whatever is easier to understand)
  • Run instructions from the script on the buffered (or copied) input line. This won't actually change the original input line.
  • Replace the original data with processed data (in case of sed).

sed and awk Instructions
  • A sed and an awk instruction consist of two components - a pattern and a process
  • The pattern part is a regular expression
  • The process part is the action we wish to perform
  • It reads first line from the input file and first instruction from the script
  • It then matches the pattern against the line.
  • If there is no match, current instruction is skipped and next one is picked up.
  • If a match is found, instruction is executed on the line.
  • Once all the instructions are executed on the current line, the cycle repeats for the all other lines from the input file.
  • As soon as all instructions are executed on the current line, sed prints the output. However, awk's behaviour is a bit different, as the subsequent instructions in the script control further processing on the line.

How to use awk?

Till now, it is clear that both sed and awk run instructions on a single line from input file at a time. We can provide these instructions either on command line itself or we can store those in a file. Depending on the scenario, we have two syntaxes to use awk command:

When using command line:

$ awk '<instructions>' <input_file/s>

When using a script file:

$ awk -f <script_file> <input_file/s>

The awk instructions in the script file have the same components we discussed earlier - pattern and process. The later component is a bit more complex than that in sed, as it will have variables, conditionals, loops, functions etc.

awk works better on a structured data, so it considers each line from input file(s) as a record. Being a structured data, each line will have strings/words delimited or separated by spaces or tabs or some character (comma in case of CSV files). awk interprets each of those strings/words as a field. We can reference these fields using their column numbers as $1, $2, $3 and so on, where $1 represents first field from the record, $2 is second field and so on. $0 is used to reference the whole line or record. Lets understand this from below example.

Consider that, we have an input file with contents as below (snipped):

$ head result.txt
Student Subject Marks
James Biology 31
Velma Biology 43
Kibo Biology 81
Louis Biology 11
Phyllis Biology 18
Zenaida Biology 55
Gillian Biology 38
Constance Biology 16
Giselle Biology 73

In order to print first field of every line from the input file, we use the reference variable $1 as :

$ awk '{ print $1 }' result.txt

Note that, we have not used any pattern here. So the instructions will simply be executed on each and every line from the input file. In the next example, we take a look at using regular expression to process only selected lines (that matches the pattern) from the input file. Consider that, we just need to find the lines that matches the pattern /Ki/ (Remember to include any pattern inside forward slashes).

$ awk '/Ki/' result.txt 
Kibo Biology 81
Kirsten Biology 16
Kieran Biology 45
Kitra Chemistry 47

In above example, we have not mentioned any instructions. So, the default instruction is to print every line that matches the pattern.

In the next example, we will use both pattern and the instruction. Lets's say, we want to extract the marks from those records that match the pattern /Ki/. Remember, marks is the 3rd field, so we should reference it accordingly.

$ awk '/Ki/ { print $3 }' result.txt 

So far, we have not used delimiters in our examples. But, just for your information, awk interprets whitespace as a delimiter by default. In order to specify a custom delimiter, say a comma for a CSV file, we can use option -F (and not -f which we use to provide script file as an input) as below:

$ awk -F, '/some_pattern/ {print $2}' input_csv_file

Or we can use multiple instructions in the awk command. These instructions need to be separated by a semicolon as shown below:

awk '/Ki/ { print $3; print $2; print $1 }' result.txt 

Awesome! Don't worry if its too much to hog. We have many articles coming up covering each topic in brief. Just give it a try with simple awk commands. You may face some errors while using awk. You just need to correct your syntax and you should be good. Normally, the cause of errors are:

  • Not opening/closing the braces ({ })
  • Not opening/closing the single quotes (' ')
  • Not opening or closing the slashes for patterns (/ /)

Please put your comments or suggestions in the comments section below and stay tuned for more articles on Awesome awk!


Post a Comment