Reply to: Neural Network Back-Propagation Revisited with Ordinary Differential Equations

This article is worth reading: Neural Network Back-Propagation Revisited with Ordinary Differential Equations

I replied:

Thank you very much for this very informative article providing many links, the Python code, and the results.

According the mentioned paper from Owens + Filkin the speedup expected by using a stiff ODE solver should be two to 1,000. So your results demonstrate that the number of iterations clearly is way less than for gradient descent and all its variants. Unfortunately, the run times reported are way slower than for gradient descent. This was not expected.

The following factors could play a role in this:

  1. All the methods search for a local minimum (gradient should be zero, Hessian is not checked or known). They do not necessarily find a global minimum. So when these different methods are run, they each probably iterate towards different local minima. So each methods likely has done something different.
  2. I wonder why you have used the zvode solver from scipy.integrate. I would recommend vode, or even better lsoda. Your chosen tolerances are quite high (atol=1e-8, rtol=1e-6), at least for the beginning. These high tolerances may force smaller step-sizes than actually required. In particular, as the start-values are random, there seems to be no compelling reason to use these strict tolerances right from the start. Also it is known, that strict tolerances might lead the ODE code to use higher order BDF which in turn are not stable enough for highly stiff ODEs. Only BDF up to order 2 are A-stable. So atol=1e-4, rtol=1e-4 might show different behaviour.
  3. Although it is expected that the resulting ODE is stiff, it might be the case that in your particular setting the system was only very mildly stiff. Your charts give some indications of that, at least in the beginning. This can be checked by simply re-running with a non-stiff code, e.g., like dopri5. Again with lower tolerances.
  4. As can be seen in your chart, above learning accuracy 80% the ODE solver takes a huge number of steps. It would be interesting to know whether at this stage there are many Jacobian evaluations for the ODE solver, or whether there are many rejected steps due to Newton convergence failures within the ODE code.
  5. scipy.integrate apparently does not allow for automatic differentiation. Therefore the ODE solver must resort to numerical differencing for evaluating the Jacobian, which is very slow. So using something like Zygote might improve here.

I always wondered why the findings from Owens+Filkin where not widely adopted. Your paper provides an answer, although a negative one. Taking into
account the above points, I still have hope that stiff ODE solvers have great potential for machine learning with regard to performance. You already mentioned that ODE solvers provide the benefit that hyperparameters no longer need to be estimated.

Optimal Product Portfolio

1. Prerequisites

Assume one owns n>0 different products, n\in\mathbb{N}, usually n\approx 100. Each product has a cost c_i>0, c_i\in\mathbb{R}, associated with this ownership, i=1,\ldots, n. Let N=\left\{1,\ldots,n\right\}, and \mathbb{P}(N) be the powerset of N, i.e., the set of all subsets. The powerset \mathbb{P}(N) has 2^n elements.

There are two real-valued functions f and g operating on \mathbb{P}(N), i.e., f,g\colon\mathbb{P}(N)\to\mathbb{R^+}. Function g(I) denotes the sum of the cost of the product set given by I\in\mathbb{P}(N), g is therefore easy to evaluate. I.e.,

\displaystyle{      g(I) = \sum_{i\in I} c_i  }

Function f(I) denotes the buying cost if one can dispose the set of products given by I\in\mathbb{P}(N). This means, it costs money to get rid of the products given by set I. As the products have interrelationships, f is more costly to evaluate than g and is given by some table lookup. Function f does not depend on c_i.

Some notes on f: Assume a non-negative matrix A=(a_{ij}), where a_{ij} denotes the cost to decouple the i-th product from the j-th. Then

\displaystyle{      f(I) = \sum_{i\in I} \sum_{j\in N} a_{ij}  }

Usually a_{ij}=0, if i<j, i.e., A is upper triangular, because decoupling product i from product j does not need to be done again for decoupling the other direction from j to i. More zeros in above sum are necessary if decoupling product i is sufficient if done once and once only. In practical application cases f depends on a real-valued parameter \gamma\in[0,1] giving different solution scenarios.

Example: For three products let c_1=8, c_2=4, c_3=5, thus cost of ownership is g(\emptyset)=0, g(\left\{1\right\})=8, g(\left\{1,2\right\})=8+4=12, g(\left\{1,3\right\})=8+5=13, g(\left\{1,2,3\right\})=8+4+5=17, and so on for the rest of 2^3-5=3 sets. Function f gives positive values in a similar fashion, although, as mentioned before, these values are not related to c_i.

2. Problem statement

Find the set I of products which gives the least cost, i.e., find I for

\displaystyle{      \min_{I\in\mathbb{P}(N)}\left(f(I)-g(I)\right)  }

Rationale: One invests f(I) to get rid of the products in set I but saves ownership costs given by g(I).

I does not need to be unique. If f depends on \gamma then I will likely also depend on \gamma.

3. Challenges

As n can be “large”, evaluating all possible combinations becomes prohibitive on todays computers. Maybe one of our gentle readers knows a method to find the optimum, or maybe just a nearby solution to above problem, or a probabilistic approach. Any comment is welcome.

Five-Value Theorem of Nevanlinna

In German known as Fünf-Punkte-Satz. This theorem is astounding. It says: If two meromorphic functions share five values ignoring multiplicity, then both functions are equal. Two functions, f(z) and g(z), are said to share the value a if f(z) - a = 0 and g(z) - a = 0 have the same solutions (zeros).

More precisely, suppose f(z) and g(z) are meromorphic functions and a_1, a_2, \ldots, a_5 are five distinct values. If

\displaystyle{      E(a_i,f) = E(a_i,g), \qquad 1\le i\le 5,  }

where

\displaystyle{      E(a,h) = \left\{ z | h(z) = a \right\},  }

then f(z) \equiv g(z).

For a generalization see Some generalizations of Nevanlinna’s five-value theorem. Above statement has been reproduced from this paper.

The identity theorem makes assumption on values in the codomain and concludes that the functions are identical. The five-value theorem makes assumptions on values in the domain of the functions in question.

Taking e^z and e^{-z} as examples, one sees that these two meromorphic functions share the four values a_1=0, a_2=1, a_3=-1, a_4=\infty but are not equal. So sharing four values is not enough.

There is also a four-value theorem of Nevanlinna. If two meromorphic functions, f(z) and g(z), share four values counting multiplicities, then f(z) is a Möbius transformation of g(z).

According Frank and Hua: We simply say “2 CM + 2 IM implies 4 CM”. So far it is still not known whether “1 CM + 3 IM implies 4 CM”; CM meaning counting multiplicities, IM meaning ignoring multiplicities.

For a full proof there are books which are unfortunately paywall protected, e.g.,

  1. Gerhard Jank, Lutz Volkmann: Einführung in die Theorie der ganzen und meromorphen Funktionen mit Anwendungen auf Differentialgleichungen
  2. Lee A. Rubel, James Colliander: Entire and Meromorphic Functions
  3. Chung-Chun Yang, Hong-Xun Yi: Uniqueness Theory of Meromorphic Functions, five-value theorem proved in §3

For an introduction to complex analysis, see for example Terry Tao:

  1. 246A, Notes 0: the complex numbers
  2. 246A, Notes 1: complex differentiation
  3. 246A, Notes 2: complex integration
  4. Math 246A, Notes 3: Cauchy’s theorem and its consequences
  5. Math 246A, Notes 4: singularities of holomorphic functions
  6. 246A, Notes 5: conformal mapping, covers Picard’s great theorem
  7. 254A, Supplement 2: A little bit of complex and Fourier analysis, proves Poisson-Jensen formula for the logarithm of a meromorphic function in relation to its zeros within a disk

Why does deep and cheap learning work so well?

Very interesting.

the morning paper

Why does deep and cheap learning work so well Lin & Tegmark 2016

Deep learning works remarkably well, and has helped dramatically improve the state-of-the-art in areas ranging from speech recognition, translation, and visual object recognition to drug discovery, genomics, and automatic game playing. However, it is still not fully understood why deep learning works so well.

So begins a fascinating paper looking at connections between machine learning and the laws of physics – showing us how properties of the real world help to make many machine learning tasks much more tractable than they otherwise would be, and giving us insights into why depth is important in networks. It’s a paper I enjoyed reading, but my abilities stop at appreciating the form and outline of the authors’ arguments – for the proofs and finer details I refer you to the full paper.

A paradox

How do neural networks with comparatively…

View original post 1,545 more words

Methods of Proof — Diagonalization

Very clear presentation on the uncountability of the real numbers, and the halting problem.

Further keywords: Cantor, natural numbers, real numbers, diagonalization, bijection, Turing halting problem, proof by contradiction.

Math ∩ Programming

A while back we featured a post about why learning mathematics can be hard for programmers, and I claimed a major issue was not understanding the basic methods of proof (the lingua franca between intuition and rigorous mathematics). I boiled these down to the “basic four,” direct implication, contrapositive, contradiction, and induction. But in mathematics there is an ever growing supply of proof methods. There are books written about the “probabilistic method,” and I recently went to a lecture where the “linear algebra method” was displayed. There has been recent talk of a “quantum method” for proving theorems unrelated to quantum mechanics, and many more.

So in continuing our series of methods of proof, we’ll move up to some of the more advanced methods of proof. And in keeping with the spirit of the series, we’ll spend most of our time discussing the structural form of the proofs…

View original post 1,841 more words

On Differential Forms

Abstract. This article will give a very simple definition of k-forms or differential forms. It just requires basic knowledge about matrices and determinants. Furthermore a very simple proof will be given for the proposition that the double outer differentiation of k-forms vanishes.

MSC 2010: 58A10

1. Basic definitions.

We denote the submatrix of A=(a_{ij})\in R^{m\times n} consisting of the rows i_1,\ldots,i_k and the columns j_1,\ldots,j_k with

\displaystyle{          [A]{\textstyle\!{\scriptstyle j_1\atop\scriptstyle i_1}          \!{\scriptstyle\ldots\atop\scriptstyle\ldots}          \!{\scriptstyle j_k\atop\scriptstyle i_k}} := \begin{pmatrix}                  a_{i_1j_1} & \ldots & a_{i_1j_k}\\                  \vdots     & \ddots & \vdots\\                  a_{i_kj_1} & \ldots & a_{i_kj_k}\\          \end{pmatrix}  }

Continue reading

Announcement: 11th International Conference on Parallel Programming and Applied Mathematics

The conference will start in September 6-9, 2015, Krakow, Poland.

Parallel Programming and Applied Mathematics, PPAM for short, is a biennial conference started in 1994, with the proceedings published by Springer in the Lecture Notes in Computer Sciences series, see PPAM. It is sponsored by IBM, Intel, Springer, AMD, RogueWave, and HP. The last conference had a fee of 420 EUR.

It is held in conjunction with 6th Workshop on Language based Parallel Programming.

Prominent speakers are:

Continue reading

Day 2, Workshop Programming of Heterogeneous Systems in Physics

Day 2 of the conference had below talks. Prof. Dr. Bernd Brügmann gave a short introduction. He pointed out that Jena is number 10 in Physics in Germany, has ca. 100.000 inhabitants, and 20.000 students.

  1. Dr. Karl Rupp, Vienna, Lessons Learned in Developing the Linear Algebra Library ViennaCL. Notes: C++ operator overloading normally uses temporary, special trickery necessary to circumvent this, ViennaCL not callable from Fortran due to C++/operator overloading, eigen.tuxfamily.org, Karl Rupp’s slides, with CUDA 5+6 OpenCL and CUDA are more or less on par,
  2. Prof. Dr. Rainer Heintzmann, Jena, CudaMat – a toolbox for Cuda computations. Continue reading

Day 1, Workshop Programming of Heterogeneous Systems in Physics

As announced in Workshop Programming of Heterogeneous Systems in Physics, July 2014, I attended this two-day conference in Jena, Germany. Below are speakers and pictures with my personal notes.

  1. Dipl.-Ing. Hans Pabst from Intel, Programming for the future: scaling forward with cores and vectors. Hans Pabst Continue reading

Simple Exercises for a C Programming Language Course

I sometimes teach a C programming language course. I use the following simple exercises for the students to solve on their own. Most solutions are no longer than 10-20 lines of C code.

  1. Exercising simple loop and printf(): Print a table of square root values for arguments 1 to 30.
  2. Exercising printf(), scanf(), arrays, if-statement, flag-handling: Read a list of numbers into an array, and then sort the array with Bubblesort.
  3. Continue reading

Effort Estimation Using Learning Curves

A couple of times I had to conduct effort estimations which contained a number of repetitive tasks. The question was, how long does it take to finish all these tasks. For example, I had to provide estimates how many days it would cost to program an interface between one system interfacing to a couple of other systems to exchange some kind of data (trading products in my particular case). There were many systems involved, and many kind of data types (in my case futures, swaps, bonds, money markets, etc.).

It is common wisdom that you gain experience and become faster by applying the same solution tactics to the same problem. Of course, the effort per problem cannot go to zero or approach zero. There must be a lower limit on how far we can decrease each effort for each task even when we have learned for a long period of time. All this is the topic of learning curves.

Continue reading

Electronic Mathematical Journals

Below journals, loosely related to topics in Numerical Analysis, offer unencumbered full text:

  1. Electronic Transactions on Numerical Analysis
  2. Applied Mathematics E-Notes
  3. The Electronic Journal of Linear Algebra
  4. arXiv

Unfortunately, Numerische Mathematik, BIT, SIAM Journal on Scientific Computing (SISC), SIAM Journal on Numerical Analysis (SINUM) and Linear Algebra and its Applications do not provide free, complete articles.

Physics related:

  1. Living Reviews in Relativity