Christian Steinruecken

Email: tcs27@cam.ac.uk

I am a post-doctoral researcher in machine learning, working with Prof Zoubin Ghahramani on the Automatic Statistician Project.

I am a member of King’s College, at the University of Cambridge.

My academic interests are broadly spread across machine learning and artificial intelligence, and include topics in information theory (such as error correction and data compression), probabilistic programming, automated model construction, and computation theory. I built an Inverse Calculator, and a self-learning guessing game AI called Sensipede.

You can find my website here: https://q4.github.io/

Publications

Compressing Sets and Multisets of Sequences

Christian Steinruecken, March 2015. (IEEE Transactions on Information Theory). IEEE. DOI: 10.1109/TIT.2015.2392093. ISSN: 0018-9448. Note: A previous version was published at the Data Compression Conference 2014..

Abstract▼ URL

This article describes lossless compression algorithms for multisets of sequences, taking advantage of the multiset’s unordered structure. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naively by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible (such as cryptographic hash sums). The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects.

Improving PPM with dynamic parameter updates

Christian Steinruecken, Zoubin Ghahramani, David MacKay, April 2015. (In Proceedings of the Data Compression Conference). Edited by Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, James A. Storer. Snowbird, UT, USA. IEEE Computer Society. ISSN: 1068-0314.

Abstract▼ URL

This article makes several improvements to the classic PPM algorithm, resulting in a new algorithm with superior compression effectiveness on human text. The key differences of our algorithm to classic PPM are that (A) rather than the original escape mechanism, we use a generalised blending method with explicit hyper-parameters that control the way symbol counts are combined to form predictions; (B) different hyper-parameters are used for classes of different contexts; and (C) these hyper-parameters are updated dynamically using gradient information. The resulting algorithm (PPM-DP) compresses human text better than all currently published variants of PPM, CTW, DMC, LZ, CSE and BWT, with runtime only slightly slower than classic PPM.

Compressing combinatorial objects

Christian Steinruecken, January 2016.

Abstract▼ URL

Most of the world’s digital data is currently encoded in a sequential form, and compression methods for sequences have been studied extensively. However, there are many types of non-sequential data for which good compression techniques are still largely unexplored. This paper contributes insights and concrete techniques for compressing various kinds of non-sequential data via arithmetic coding, and derives re-usable probabilistic data models from fairly generic structural assumptions. Near-optimal compression methods are described for certain types of permutations, combinations and multisets; and the conditions for optimality are made explicit for each method.