# Unix sort Issue

I wondered why Unix sort behaved strangely.

printf "A0 1\nA  1\n" | sort


delivered

A0 1
A  1


Of course, I expected A to come before A0. This was strange, as printf "A1 1\nA 1\n" | sort produced

A  1
A1 1


just as expected. Also, printf "A0\nA\n" | sort orders A before A0, as expected.

Solution: Use LC_ALL before sort. So

printf "A0 1\nA  1\n" | LC_ALL=C sort


delivered

A  1
A0 1


I realized this when I called sort with --debug flag,

printf "A0 1\nA  1\n" | sort --debug


which shows the empleyed locale:

sort: using ‘en_US.UTF-8’ sorting rules
A0 1
____
A  1
____


To check that my expected sort-order was indeed the “right” order, I wrote the following simple Perl-script to sort, which confirmed my understanding of ASCII sorting:

#!/bin/perl -W
use strict;
my @F = <>;     # slurp
for my $i (sort @F) { print$i; }


# Parallelization and CPU Cache Overflow

In the post Rewriting Perl to plain C the runtime of the serial runs were reported. As expected the C program was a lot faster than the Perl script. Now running programs in parallel showed two unexpected behaviours: (1) more parallelizations can degrade runtime, and (2) running unoptimized programs can be faster.

In the following we use the C program siriusDynCall and the Perl script siriusDynUpro which was described in above mentioned post. The program or scripts reads roughly 3GB of data. Before starting the program or script all this data has been already read into memory by using something like wc or grep.

1. AMD Processor. Running 8 parallel instances, s=size=8, p=partition=1(1)8:

for i in 1 2 3 4 5 6 7 8; do time siriusDynCall -p$i -s8 * > ../resultCp$i & done
real 50.85s
user 50.01s
sys 0


Merging the results with the sort command takes a negligible amount of time

sort -m -t, -k3.1 resultCp* > resultCmerged


Best results are obtained when running just s=4 instances in parallel:

$for i in 1 2 3 4 ; do /bin/time -p siriusDynCall -p$i -s4 * > ../dyn4413c1p\$i & done
real 33.68
user 32.48
sys 1.18


# Rewriting Perl to plain C

Perl script was running too slow. Rewriting it in C made it 20 times faster.

1. Problem statement. Analyze call-tree dependency in COBOL programs. There are 77 million lines of COBOL code in ca. 30,000 files. These 30,000 COBOL programs could potentially include 74,000 COPY-books comprising 10 million lines of additional code. COBOL COPY-books are analogous to C header-files. So in total there are around 88 million lines of COBOL code. Just for comparison: the Linux kernel has ca. 20 million lines of code.

COBOL program analysis started with a simple Perl script. This Perl script is less than 200 lines, including comments. This script produced the desired dependency information.

Wading through all this COBOL code took up to 25 minutes in serial mode, and 13 minutes using 4 cores on an HP EliteBook notebook using Intel Skylake i7-6600U clocked 2.8 GHz. It took 36 minutes on an AMD FX-8120 clocked with 3.1 GHz. This execution time was deemed too long to see any changes in the output changing something in the Perl script. All runs are on Arch Linux 4.14.11-1 SMP PREEMPT.

2. Result. Rewriting the Perl script in C resulted in a speed improvement of factor 20 when run in serial mode, i.e., run times are now 110s on one core. It runs in 32s when using 8 cores on an AMD FX-8120. C program uses taylormade hashing routines.

# Hashing with Iterator in C

Many scripting languages, like Awk, Perl, Python, Lua, and so on, have hash-tables already built-in. I needed hashing functionality similar to Perl’s hash-variables. In particular I needed lookup, insert, and iterating over all entries.

1. Structure. Each hash element contains a key and a value. Additionally there is an indicator whether storage allocation using malloc() for key and value are done during hash insert, or are done elsewhere, e.g., constant strings. All hash elements having the same hash value, see below, are inserted into a singly linked list. Additionally all elements are part of singly linked list.

struct hashElem {
int dup;        // memory ownership: 0, 0x01, 0x02, 0x03
char *key, *val;
struct hashElem *next;  // next hash element with same key or NULL
struct hashElem *nextAll;       // for iterator over all hash entries
};


There is a hash header which contains the start of the list, and the actual table.

typedef struct {
const char *name;       // only for diagnostic and debugging
struct hashElem **tab;  // hash table with size times (hashElem*)
struct hashElem *first; // first element in hash
} hash_t;


# Filippo Mantovani: ARM for HPC

On 23-Oct-2017 Filippo Mantovani held a talk in Darmstadt on “Mobile technology for production-ready high-performance computing systems: The path of the Mont-Blanc project”. Unfortunately I was unable to attend, but Mr. Mantovani sent me his Darmstadt Seminar slides. As his slides and documents are very interesting to people using or intending to use ARM in HPC, I copy these documents here, so they are easily available. I also copied a report on “MB3_D6.4 Report on application tuning and optimization on ARM platform“.

# Optimal Product Portfolio

## 1. Prerequisites

Assume one owns $n>0$ different products, $n\in\mathbb{N}$, usually $n\approx 100$. Each product has a cost $c_i>0$, $c_i\in\mathbb{R}$, associated with this ownership, $i=1,\ldots, n$. Let $N=\left\{1,\ldots,n\right\}$, and $\mathbb{P}(N)$ be the powerset of $N$, i.e., the set of all subsets. The powerset $\mathbb{P}(N)$ has $2^n$ elements.

There are two real-valued functions $f$ and $g$ operating on $\mathbb{P}(N)$, i.e., $f,g\colon\mathbb{P}(N)\to\mathbb{R^+}$. Function $g(I)$ denotes the sum of the cost of the product set given by $I\in\mathbb{P}(N)$, $g$ is therefore easy to evaluate. I.e.,

$\displaystyle{ g(I) = \sum_{i\in I} c_i }$

Function $f(I)$ denotes the buying cost if one can dispose the set of products given by $I\in\mathbb{P}(N)$. This means, it costs money to get rid of the products given by set $I$. As the products have interrelationships, $f$ is more costly to evaluate than $g$ and is given by some table lookup. Function $f$ does not depend on $c_i$.

Some notes on $f$: Assume a non-negative matrix $A=(a_{ij})$, where $a_{ij}$ denotes the cost to decouple the $i$-th product from the $j$-th. Then

$\displaystyle{ f(I) = \sum_{i\in I} \sum_{j\in N} a_{ij} }$

Usually $a_{ij}=0$, if $i, i.e., $A$ is upper triangular, because decoupling product $i$ from product $j$ does not need to be done again for decoupling the other direction from $j$ to $i$. More zeros in above sum are necessary if decoupling product $i$ is sufficient if done once and once only. In practical application cases $f$ depends on a real-valued parameter $\gamma\in[0,1]$ giving different solution scenarios.

Example: For three products let $c_1=8, c_2=4, c_3=5$, thus cost of ownership is $g(\emptyset)=0$, $g(\left\{1\right\})=8$, $g(\left\{1,2\right\})=8+4=12$, $g(\left\{1,3\right\})=8+5=13$, $g(\left\{1,2,3\right\})=8+4+5=17$, and so on for the rest of $2^3-5=3$ sets. Function $f$ gives positive values in a similar fashion, although, as mentioned before, these values are not related to $c_i$.

## 2. Problem statement

Find the set $I$ of products which gives the least cost, i.e., find $I$ for

$\displaystyle{ \min_{I\in\mathbb{P}(N)}\left(f(I)-g(I)\right) }$

Rationale: One invests $f(I)$ to get rid of the products in set $I$ but saves ownership costs given by $g(I)$.

$I$ does not need to be unique. If $f$ depends on $\gamma$ then $I$ will likely also depend on $\gamma$.

## 3. Challenges

As $n$ can be “large”, evaluating all possible combinations becomes prohibitive on todays computers. Maybe one of our gentle readers knows a method to find the optimum, or maybe just a nearby solution to above problem, or a probabilistic approach. Any comment is welcome.

# Statistics of this Blog: Crossed 60.000 Views

This blog was viewed more than 60.000 times since its inception and had more than 45.000 visitors. I wrote about the development of this blog as follows:

1. 2014/05/07: 5,000 Views
2. 2014/09/10: 10,000 Views
3. 2014/12/27: 15,000 Views
4. 2015/04/23: 20,000 Views
5. 2016/01/24: 30,000 Views
6. 2016/11/12: 40,000 Views
7. 2017/05/18: 50,000 views

The averages per month are:

Between 1,000 and 2,000 views per month in bar chart form:

As Countries, USA, Germany, and India at the top:

Referrers, Google by far the most important referrer:

The most popular blog posts are: