Text Analysis using Concordance

When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.

Assume your text is provided as PDF. Convert PDF to text using pdftotext, which part of package poppler. Replace line breaks in text file with spaces using below C program (called linebreak.c):

#include <stdio.h>

int main(int argc, char *argv[]) {
        int c, flag=0;
        FILE *fp;

        if (argc >= 2) {
                if ((fp = fopen(argv[1],"rb")) == NULL)
                        return 1;
        } else {
                fp = stdin;
        }

        while ((c = fgetc(fp)) != EOF) {
                if (c == '\n') {
                        flag += 1;
                        if (flag > 1) { putchar(c); flag = 0; }
                        else putchar(' ');
                } else {
                        flag = 0;
                        putchar(c);
                }
        }

        return 0;
}

Then generate a list of (single) words with below Perl program:

#!/bin/perl -W
# Print word concordances

use strict;

my (%H,@F);

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        foreach my $w (@F) {
                $w =~ s/^\s+//; # ltrim
                $w =~ s/\s+$//; # rtrim
                $H{$w} += 1;
        }
}

foreach my $w (sort keys %H) {
        printf("\t%6d\t%s\n",$H{$w},$w);
}

To print all word pairs replace above loop with

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1]} += 1;
        }
}

Similar, for word triples replace the loop with

while (<>) {
        chomp;
        s/\s+$//;       # rtrim
        @F = split;
        for(my $i=0; $i+1<$#F; ++$i) {
                $F[$i] =~ s/^\s+//;     # ltrim
                $F[$i] =~ s/\s+$//;     # rtrim
                $F[$i+1] =~ s/^\s+//;   # ltrim
                $F[$i+1] =~ s/\s+$//;   # rtrim
                $F[$i+2] =~ s/^\s+//;   # ltrim
                $F[$i+2] =~ s/\s+$//;   # rtrim
                $H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
        }
}

Printing concordances using Perl hashes is very simple, as one can see.

Here is an example from the man-page of expect using below sequence of commands:

( TERM=dumb; man expect ) | linebreak | word3concord | sort -r

Truncated result is

            16  For example, the
            13  example, the following
            12  the current process.
             9  the end of
             8  using Expectk, this
             8  this option is
             8  sent to the
             8  flag causes the
             8  body is executed
             8  Expectk, this option
             8  (When using Expectk,
             7  to the current
             7  the spawn id
             7  the most recent
             7  the current process
             7  the corresponding body
             7  option is specified
             7  is specified as
             7  corresponding body is
             7  by Don Libes,
             7  be used to
             6  set for the
             6  of the current
             6  is set for
             6  is an alias

Contributing to Hugo Static

The discussion forum for Hugo contains a description: Hugo development – how to contribute code. Also see Contributing to Hugo.

1. Preparation

First set GOPATH as

export GOPATH=$HOME/tmp/H

then

cd $GOPATH

Fetch source with go get

time go get -u -v github.com/spf13/hugo

takes around 1-2 minutes as it has to download almost 200MB.

Now change to the Hugo source code and compile

cd src/github.com/spf13/hugo/
time make hugo

Compilation from scratch takes roughly 1-2 minutes. Recompiling a single file usually takes less than 10 seconds.

In the same directory, run test-cases with

time make check

which takes less than a minute.

All timings are on an AMD FX(tm)-8120 Eight-Core Processor clocked with 3.1 GHz running Linux 4.11.3, and using Go 1.8.3.

2. Fork in Github, git branch and pull-request

Fork https://github.com/spf13/hugo by pressing the “Fork” icon:

Move original Git repository out of your way, clone the new fork, add or modify files as required, add, and commit them:

cd $GOPATH/src/github.com/spf13/
mv hugo hugo.original
time git clone git@github.com:eklausme/hugo.git

cd hugo
git branch YOURNAME
git checkout YOURNAME

go fmt
git add YOURFILE
git commit

A git clone of hugo alone takes less than 10 seconds. Watch out to run go fmt before git add.

Contributors are asked to provide single commits. In case you have multiple, then squash them into one, i.e., git rebase -i and git push -f.

Finally press the pull-request button in Github:

Be prepared to wait weeks or even months before your pull-request will be accepted or even rejected, so patience is required.

Converting WordPress Export File to Hugo

I have written on the Hugo static site generator here. Now I have written a migration program in the Go programming language to convert from WordPress export format to Hugo format. This program wp2hugo.go is in GitHub. It can be freely downloaded and does not need any further dependencies, except, of course, Go. The Go software is in Arch Linux or Ubuntu.

To convert a blog from WordPress you have to create an export file.

If the blog is not too voluminous one downloads a single XML-file which contains all posts and pages. If the blog in question is larger then you will receive an e-mail from WordPress.com that you can download a ZIP which contains two or more XML files in them. If you have such a ZIP-file, then unpack it, for example by using p7zip. Then run

go run wp2hugo.go XML1 XML2 ...

Continue reading

Exporting Exchange/Outlook GAL to vCard

This post is about the Microsoft Exchange GAL, i.e., the global address list. The task is to export the data in the GAL to vCard format.

Microsoft Outlook stores local caches of the GAL in %userprofile%\Local Settings\Application Data\Microsoft\Outlook, see Administering the offline address book in Outlook. On my computer they look like this

 Listing of D:\Users\...\AppData\Local\Microsoft\Outlook\Offline Address Books\...

21.05.2016  20:44    <DIR>          .
21.05.2016  20:44    <DIR>          ..
21.05.2016  20:44         3.818.260 uanrdex.oab
21.05.2016  20:44           686.956 ubrowse.oab
21.05.2016  20:44        56.310.184 udetails.oab
21.05.2016  20:44                20 updndex.oab
21.05.2016  20:44         1.373.676 urdndex.oab
21.05.2016  20:44            25.915 utmplts.oab
               6 Files,      62.215.011 Bytes

Continue reading

GCC 6.1 Compiler Optimization Level Benchmarks

In Effect of Optimizer in gcc on Intel/AMD and Power8 I measured speed ratios between optimized and non-optimized C code of three on Intel/AMD, and eight on Power8 (PowerPC) for integer calculations. For floating-point calculations the factors were two and three, respectively.

Michael Larabel in GCC 6.1 Compiler Optimization Level Benchmarks: -O0 To -Ofast + FLTO measured various optimization flags of the newest GCC.

For a Poisson solver the speed ratio between optimized and non-optimized code was five.

HimenoBenchmarkGCC61

Convert ASCII to Hex and vice versa in C and Excel VBA

In Downloading Binary Data, for example Boost C++ Library I already complained about some company policies regarding the transfer of binary data. If the openssl command is available on the receiving end, then things are pretty straightforard as the aforementioned link shows, in particular you then have Base64 encoding. If that is not the case but you have a C compiler, or at least Excel, then you can work around it.

C program ascii2hex.c converts from arbitrary data to hex, and vice versa. Excel VBA (Visual Basic for Applications) ascii2hex.xls converts from hex to arbitrary data.

To convert from arbitrary data to a hex representation

ascii2hex -h yourBinary outputInHex

Back from hex to ASCII:

ascii2hex -a inHex outputInBinary

Continue reading