Text Analysis using Concordance

When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.

Assume your text is provided as PDF. Convert PDF to text using pdftotext, which part of package poppler. Replace line breaks in text file with spaces using below C program (called linebreak.c):

#include <stdio.h>

int main(int argc, char *argv[]) {
        int c, flag=0;
        FILE *fp;

        if (argc >= 2) {
                if ((fp = fopen(argv[1],"rb")) == NULL)
                        return 1;
        } else {
                fp = stdin;
        }

        while ((c = fgetc(fp)) != EOF) {
                if (c == '\n') {
                        flag += 1;
                        if (flag > 1) { putchar(c); flag = 0; }
                        else putchar(' ');
                } else {
                        flag = 0;
                        putchar(c);
                }
        }

        return 0;
}

Then generate a list of (single) words with below Perl program:

#!/bin/perl -W
# Print word concordances

use strict;

my (%H,@F);

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        foreach my $w (@F) {
                $w =~ s/^\s+//; # ltrim
                $w =~ s/\s+$//; # rtrim
                $H{$w} += 1;
        }
}

foreach my $w (sort keys %H) {
        printf("\t%6d\t%s\n",$H{$w},$w);
}

To print all word pairs replace above loop with

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1]} += 1;
        }
}

Similar, for word triples replace the loop with

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i+1<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $F[$i+2] =~ s/^\s+//;   # ltrim
                $F[$i+2] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
        }
}

Printing concordances using Perl hashes is very simple, as one can see.

Here is an example from the man-page of expect using below sequence of commands:

( TERM=dumb; man expect ) | linebreak | word3concord | sort -r

Truncated result is

            16  For example, the
            13  example, the following
            12  the current process.
             9  the end of
             8  using Expectk, this
             8  this option is
             8  sent to the
             8  flag causes the
             8  body is executed
             8  Expectk, this option
             8  (When using Expectk,
             7  to the current
             7  the spawn id
             7  the most recent
             7  the current process
             7  the corresponding body
             7  option is specified
             7  is specified as
             7  corresponding body is
             7  by Don Libes,
             7  be used to
             6  set for the
             6  of the current
             6  is set for
             6  is an alias
Advertisements

Migrating from delicious.com to WordPress

I have been a loyal user of del.icio.us since 2006. I have written on this in my post Saving URLs in del.icio.us Still Troublesome. But now enough is enough. Here is a list of annoyances:

  1. You can neither export nor import your data anymore.
  2. The service is generally slow, i.e., it takes a lot of time to just load the site in your browser.
  3. The service is sometimes not available.
  4. You cannot change URLs without deleting the entire post.
  5. The company behind the service does not answer any inquires.
  6. The site is blocked by a number of company firewalls because it is marked as “social”.

Continue reading

No Perl and PHP on Mainframe from IBM

IBM no longer provides Perl for its mainframe machines, see Software withdrawal: Selected IBM System z platform products (a copy is here: IBM-Withdrawal-ENUS913-252. It looks like they have not heard that Perl is the duct tape that holds the internet together. Similarly IBM withdraw PHP from their mainframe platform. So Wikipedia and Facebook will not run on big iron. Not that Wikipedia or Facebook ever wanted to, but now IBM pulled the plug.

In the same vein all IBM has to offer their customers is 32-bit COBOL on their mainframe, so customers can only use less than 2 GB, see Memory Limitations with IBM Enterprise COBOL Compiler.

In earlier times IBM tried to sell their VisualAge products, which where notoriously slow, and never really took off. Now they aggressively sell WebSphere.

Who makes these decisions? And who approves this?

In defense of IBM, there is a company called Rocket Software which provides Perl and PHP. So it’s like going to McDonald’s ordering a hamburger, but the clerk tells you that you should buy the bread separately from the nearby bakery.

Calculating number of seats in parliament using d’Hondt’s method

Wikipedia contains an article on d’Hondt’s method for calculating the number of seats given the number of votes for each party. I wrote a short Perl program for its calculation including the case when d’Hondt’s method by its design leads to drawing the lots. Its input contains a list of party names and its corresponding votes. The number of seats is given as parameter -s. This implementation of d’Hondt uses integer division and rounds the division to the lower integer (floor).

Continue reading

Working with System V IPC queues in Perl and PHP

In continuation of Working with System V IPC queues a month ago this post will show how to access IPC queues with Perl and PHP. A typical scenario is that a web application wants an external application to process data coming from the web application. In that scenario a lot of messages/tasks from the web application can be queued up in an IPC queue for succesive processing by another program independent from the web application and possibly with more access rights.

For using System V queues in PHP you have to make sure that PHP has been compiled with POSIX support. With Red Hat you need php-process, in Ubuntu it is present by default.

Continue reading