When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.
Assume your text is provided as PDF. Convert PDF to text using pdftotext
, which is part of package poppler
. Replace line breaks in text file with spaces using below C program (called linebreak.c
):
#include int main(int argc, char *argv[]) { int c, flag=0; FILE *fp; if (argc >= 2) { if ((fp = fopen(argv[1],"rb")) == NULL) return 1; } else { fp = stdin; } while ((c = fgetc(fp)) != EOF) { if (c == '\n') { flag += 1; if (flag > 1) { putchar(c); flag = 0; } else putchar(' '); } else { flag = 0; putchar(c); } } return 0; }
Then generate a list of (single) words with below Perl program:
#!/bin/perl -W # Print word concordances use strict; my (%H,@F); while () { chomp; s/\s+$//; # rtrim @F = split; foreach my $w (@F) { $w =~ s/^\s+//; # ltrim $w =~ s/\s+$//; # rtrim $H{$w} += 1; } } foreach my $w (sort keys %H) { printf("\t%6d\t%s\n",$H{$w},$w); }
To print all word pairs replace above loop with
while () { chomp; s/\s+$//; # rtrim @F = split; for(my $i=0; $i<$#F; ++$i) { $F[$i] =~ s/^\s+//; # ltrim $F[$i] =~ s/\s+$//; # rtrim $F[$i+1] =~ s/^\s+//; # ltrim $F[$i+1] =~ s/\s+$//; # rtrim $H{$F[$i] . " " . $F[$i+1]} += 1; } }
Similar, for word triples replace the loop with
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
for(my $i=0; $i+1<$#F; ++$i) {
$F[$i] =~ s/^\s+//; # ltrim
$F[$i] =~ s/\s+$//; # rtrim
$F[$i+1] =~ s/^\s+//; # ltrim
$F[$i+1] =~ s/\s+$//; # rtrim
$F[$i+2] =~ s/^\s+//; # ltrim
$F[$i+2] =~ s/\s+$//; # rtrim
$H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
}
}
Printing concordances using Perl hashes is very simple, as one can see.
Here is an example from the man-page of expect
using below sequence of commands:
( TERM=dumb; man expect ) | linebreak | word3concord | sort -r
Truncated result is
16 For example, the 13 example, the following 12 the current process. 9 the end of 8 using Expectk, this 8 this option is 8 sent to the 8 flag causes the 8 body is executed 8 Expectk, this option 8 (When using Expectk, 7 to the current 7 the spawn id 7 the most recent 7 the current process 7 the corresponding body 7 option is specified 7 is specified as 7 corresponding body is 7 by Don Libes, 7 be used to 6 set for the 6 of the current 6 is set for 6 is an alias