Rewriting Perl to plain C

Perl script was running too slow. Rewriting it in C made it 20 times faster.

1. Problem statement. Analyze call-tree dependency in COBOL programs. There are 77 million lines of COBOL code in ca. 30,000 files. These 30,000 COBOL programs could potentially include 74,000 COPY-books comprising 10 million lines of additional code. COBOL COPY-books are analogous to C header-files. So in total there are around 88 million lines of COBOL code. Just for comparison: the Linux kernel has ca. 20 million lines of code.

COBOL program analysis started with a simple Perl script. This Perl script is less than 200 lines, including comments. This script produced the desired dependency information.

Wading through all this COBOL code took up to 25 minutes in serial mode, and 13 minutes using 4 cores on an HP EliteBook notebook using Intel Skylake i7-6600U clocked 2.8 GHz. It took 36 minutes on an AMD FX-8120 clocked with 3.1 GHz. This execution time was deemed too long to see any changes in the output changing something in the Perl script. All runs are on Arch Linux 4.14.11-1 SMP PREEMPT.

2. Result. Rewriting the Perl script in C resulted in a speed improvement of factor 20 when run in serial mode, i.e., run times are now 110s on one core. It runs in 32s when using 8 cores on an AMD FX-8120. C program uses taylormade hashing routines.
Continue reading

Advertisements

Hashing with Iterator in C

Many scripting languages, like Awk, Perl, Python, Lua, and so on, have hash-tables already built-in. I needed hashing functionality similar to Perl’s hash-variables. In particular I needed lookup, insert, and iterating over all entries.

I wrote on hashing in Hashing Just Random Numbers and Hash functions: An empirical comparison — article by Peter Kankowski. See also Revisiting hash table performance, or TommyDS: a C library of hashtables and tries.

1. Structure. Each hash element contains a key and a value. Additionally there is an indicator whether storage allocation using malloc() for key and value are done during hash insert, or are done elsewhere, e.g., constant strings. All hash elements having the same hash value, see below, are inserted into a singly linked list. Additionally all elements are part of singly linked list.

struct hashElem {
        int dup;        // memory ownership: 0, 0x01, 0x02, 0x03
        char *key, *val;
        struct hashElem *next;  // next hash element with same key or NULL
        struct hashElem *nextAll;       // for iterator over all hash entries
};

There is a hash header which contains the start of the list, and the actual table.

typedef struct {
        int size, collisions, load;
        const char *name;       // only for diagnostic and debugging
        struct hashElem **tab;  // hash table with size times (hashElem*)
        struct hashElem *first; // first element in hash
} hash_t;

Continue reading