Pattern matching
In awk, simple matching goes like this using the ~ operator. (all examples use myFile.txt)
Code: Select all
C:\> type myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
C:\>awk "/divID/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
The above says to find any lines that has the string "divID" . For pattern matching, the regex pattern to find is usually enclosed in / /.
If you want case-insensitive search , use the IGNORECASE variable
Code: Select all
C:\>awk "BEGIN{IGNORECASE=1}/divid/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
Setting IGNORECASE to 0 toggles it back to case-sensitive.
If you want to find all records with 2nd column starting with "A", then
Code: Select all
C:\>awk -F"," "$2 ~ /^A/ {print}" myFile.txt
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
First, give the -F"," option because the file is "," delimited. Then use $2 because its the 2nd column. Then using the regex /^A/. "^" means "starts with". After that "{print}" action will print the relevant records.
In awk, you can negate matches using !~ operator. For example , you want to find
records that doesn't have "DET" as the 3rd field
Code: Select all
C:\>awk -F"," "$3 !~ /DET/{print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
If you just want to find records that doesn't have the string "DET", just do a !/DET/ using the "!" operator
Code: Select all
C:\>awk -F"," "!/DET/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
These are very simple examples on using regex operator ~, !~ for searching strings.
String replacement
Awk provides the sub() and gsub() functions to replace strings in files
The syntax for sub() is
sub(regexp, replacement [, target])
for example, replace "LAN" with "NAL"
Code: Select all
C:\>awk "{sub(\"LAN\",\"NAL\", $0); print }" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,[color=#800000]NAL[/color],1,W,N,1,57,36,21
1981,NL,[color=#800000]NAL[/color],2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
sub() only replaces one occurence of the string. For global replacement, use gsub() which has the same syntax as sub().
To replace the "BAL" string from the 4th line only, use NR==4 as the "pattern". then use sub().
Code: Select all
C:\>awk "NR==4 { sub(\"BAL\",\"LAB\") } {print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,[color=#800000]LAB[/color],1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
.....
to be continued
- berserker