Yuanlu Xu's Homepage

Marching with Humility, Humor and Curiosity.

Slides: [pptx] [pdf]      Reading Report: [pdf]

General Introduction

In this talk, we give an introduction about image parsing and video parsing via two papers: primal sketch [1] and video primal sketch [2] of S.C. Zhu. Examples of primal sketch and video primal sketch are shown in Fig.1, Fig.2, respectively. The models employed in papers are discussed with details and extensions. Finally, a deep insight is given about the mid-level image&video parsing.

Fig.1 An Example of primal sketch, reproduced from the project page of [1].
Fig.2 An Example of video primal sketch, reproduced from the project page of [2].

Both primal sketch and video primal sketch follow a similar framework, as shown in Fig.3. Image is segmented into two regions: region of primitives and region of texture while video into explicit region and implicit region.

Fig.3 Framework of primal sketch and video primal sketch.

Two categories of models are discussed in this talk: models for texture and models for texton. We summarize as follows.

 

Texture Modeling

FRAME Model [3]

  • Minimax Entropy Principle: an approach to find the best representation of a probability distribution with the current state of knowledge.
  • Deriving FRAME Model: representing texture as a linear combination of filter response statistics based upon three assumptions and Minimax Entropy Principle.
  • Choice of Filters: inferring FRAME model by constructing the filter bank and selecting filters.
  • Synthesizing Texture: reconstruct texture by applying Gibbs sampling on FRAME model.

ST-FRAME Model [2]

  • Three kinds of filters:: static filters, motion filters, flicker filters.

Texton Modeling

Sparse Coding

  • Definition and Explanation: modeling data vectors as sparse linear combinations of basis elements.
  • Energy Function: defining the energy function as a reconstruction residual term plus sparsity term.
  • Convex Optimization: the constraint on dictionary basis (L2-norm of each basis less than or equal to one) and the relaxed constraint on sparisty term (replacing L0-norm with L1-norm).
  • Solving Sparse Coding:

Online Sparse Coding [4]

  • Comparison: modeling data vectors as sparse linear combinations of basis elements.
  • Online Dictionary Learning: defining the energy function as a reconstruction residual term plus sparsity term.

Adapted Version in Primitive Modeling

  • Eight types of primitives: blob; terminators, edge, ridge; multi-ridge, corner; junction; cross.

Adapted Version in Explicit Region Modeling

  • Two dictionaries: a dictionary for trackable and sketchable regions and a dictionary for trackable and non-sketchable regions.

Inference Algorithm

Sketch Pursuit

  • Probability Model: sparse coding residual error plus FRAME modeling residual error plus dictionary coding length plus FRAME coding length.
  • Two phases: a Data-Driven (Gestalt laws) approach.
  • Phase 1: sequentially adding most prominent strokes into sketch graph.
  • Phase 2: refining the sketch graph by reversible graph operators.
  • Solving Sparse Coding: solving by an Expectation-Maximum (EM) like algorithm.

Reviews, Problems, and Vista

Problem in Video Primal Sketch

We list the experimental results in [2] as follows. As shown in Table.1, major model parameters come from representing explicit regions.

As shown in Table.4, major error of video primal sketch comes from reconstructing explicit regions (i.e. sparse coding residual error).

Note we have discussed in the modeling texton section, the authors develop a special dictionary for representing trackable and non-sketchable region. So here comes a problem, since modeling implicit regions cost both less parameters and error, why categorizes trackable and non-sketchable region into explicit region?

A Philosophy Problem

Probability model for the primal sketch representation:

Probability model for the video primal sketch representation:

Two above representations share a similar definition: a probability model with two inconsistent energy measurement (i.e. pixel-wise energy loss and statistics-wise energy loss). As a reuslt, solving above problems suffer from great computational complexity.

If we look at the above problem in a philosophical way, it is similiar with the debate about contrary vs. uniform (also stated as eternal debate by S.C. Zhu). Since the current research philosophy tends to be metaphysics, we usually treat image as separate regions without any connections. I mean, probably in the not far future, more effective representations of texture/texton should have consistent forms.

Vista

Is that even possible? I believe so. In quantum physics, particle wave duality explains the contrary and the uniform relationship. Analogized to this, texture&texton should coexist in the atom of image&video: pixels.

According to Schrödinger Equation, the particle position we observe is the integral of a probability wave. In my opinion, a new intuition of video parsing is: trackable and sketchable motion represented as the integral of a single probability distribution while textured motion as the integral of the composition of several probability wave.

Limited by current computer hardware, the above representation is definitely unsolvable. But I believe, one day, WE WILL REACH THE FINAL SOLUTION. ^_^

Reference

  • [1] Primal Sketch: Integrating Texture and Structure. C.E. Guo, S.C. Zhu and Y.N. Wu. Computer Vision and Image Understanding (CVIU), 106(1), 5-19, 2007.
  • [2] Video Primal Sketch: A Generic Middle-Level Representation of Video. Z. Han, Z. Xu and S.C. Zhu. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2011.
  • [3] FRAME: Filters, Random field And Maximum Entropy - Towards a Unified Theory for Texture Modeling. S. C. Zhu, Y. N. Wu and D.B. Mumford. International Journal of Computer Vision (IJCV), 27(2), 1-20, 1998.
  • [4] Online Dictionary Learning for Sparse Coding. Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro. In Proc. of IEEE International Conference on Machine Learning (ICML), 2005.