Share this page:

Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in JMLR, 2010.

Download the full text


Abstract

Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods.


Bib Entry

@inproceedings{HHCL10,
  author = {Huang, Fang-Lan and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
  title = {Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models},
  booktitle = {JMLR},
  year = {2010}
}

Related Publications

  1. Large Linear Classification When Data Cannot Fit In Memory

    Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in TKDD, 2012.
    Full Text Code Abstract BibTeX Details Best Paper Award, KDD 10
    Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.
    @inproceedings{yu2010large,
      author = {Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
      title = {Large Linear Classification When Data Cannot Fit In Memory},
      booktitle = {TKDD},
      year = {2012}
    }
    
    Details
  2. Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models

    Kai-Wei Chang and Dan Roth, in KDD, 2011.
    Full Text Slides Poster Code Abstract BibTeX Details
    As the size of data sets used to build classifiers steadily increases, training a linear model efficiently with limited memory becomes essential. Several techniques deal with this problem by loading blocks of data from disk one at a time, but usually take a considerable number of iterations to converge to a reasonable model. Even the best block minimization techniques [1] require many block loads since they treat all training examples uniformly. As disk I/O is expensive, reducing the amount of disk access can dramatically decrease the training time.
    @inproceedings{ChangRo11,
      author = {Chang, Kai-Wei and Roth, Dan},
      title = {Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models},
      booktitle = {KDD},
      year = {2011}
    }
    
    Details
  3. Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

    Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in JMLR, 2010.
    Full Text Abstract BibTeX Details
    Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods.
    @inproceedings{HHCL10,
      author = {Huang, Fang-Lan and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
      title = {Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models},
      booktitle = {JMLR},
      year = {2010}
    }
    
    Details
  4. A Comparison of Optimization Methods and software for Large-scale L1-regularized Linear Classification

    Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in JMLR, 2010.
    Full Text Code Abstract BibTeX Details
    Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.
    @inproceedings{YCHL10,
      author = {Yuan, Guo-Xun and Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
      title = {A Comparison of Optimization Methods and software for Large-scale L1-regularized Linear Classification},
      booktitle = {JMLR},
      year = {2010}
    }
    
    Details
  5. Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

    Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin, in JMLR, 2010.
    Full Text Code Abstract BibTeX Details
    Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements.
    @inproceedings{CHCRL10,
      author = {Chang, Yin-Wen and Hsieh, Cho-Jui and Chang, Kai-Wei and Ringgaard, Michael and Lin, Chih-Jen},
      title = {Training and Testing Low-degree Polynomial Data Mappings via Linear SVM},
      booktitle = {JMLR},
      year = {2010}
    }
    
    Details
  6. A Sequential Dual Method for Large Scale Multi-Class Linear SVMs

    S. Sathiya Keerthi, S. Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in KDD, 2008.
    Full Text Code Abstract BibTeX Details
    Efficient training of direct multi-class formulations of linear Support Vector Machines is very useful in applications such as text classification with a huge number examples as well as features. This paper presents a fast dual method for this training. The main idea is to sequentially traverse through the training set and optimize the dual variables associated with one example at a time. The speed of training is enhanced further by shrinking and cooling heuristics. Experiments indicate that our method is much faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient methods
    @inproceedings{KSCHL08,
      author = {Keerthi, S. Sathiya and Sundararajan, S. and Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
      title = {A Sequential Dual Method for Large Scale Multi-Class Linear SVMs},
      booktitle = {KDD},
      year = {2008}
    }
    
    Details
  7. A Dual Coordinate Descent Method for Large-Scale Linear SVM

    Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, Sathia S. Keerthi, and S. Sundararajan, in ICML, 2008.
    Full Text Slides Code Abstract BibTeX Details Top-10 cited paper at ICML 08
    In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2- loss functions. The proposed method is simple and reaches an e-accurate solution in O(log(1/e)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVMperf , and a recent primal coordinate descent implementation.
    @inproceedings{HCLKS08,
      author = {Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen and Keerthi, Sathia S. and Sundararajan, S.},
      title = {A Dual Coordinate Descent Method for Large-Scale Linear SVM},
      booktitle = {ICML},
      year = {2008}
    }
    
    Details
  8. Coordinate Descent Method for Large-scale L2-loss Linear SVM

    Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in JMLR, 2008.
    Full Text Code Abstract BibTeX Details
    Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON.
    @inproceedings{ChangHsLi08,
      author = {Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
      title = {Coordinate Descent Method for Large-scale L2-loss Linear SVM},
      booktitle = {JMLR},
      year = {2008}
    }
    
    Details
  9. LIBLINEAR: A Library for Large Linear Classification

    Rong En Fan, Kai-Wei Chang, Cho-Jui Hsieh, X.-R. Wang, and Chih-Jen Lin, in JMLR, 2008.
    Full Text Code Abstract BibTeX Details
    LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
    @inproceedings{FCHWL08,
      author = {Fan, Rong En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, X.-R. and Lin, Chih-Jen},
      title = {LIBLINEAR: A Library for Large Linear Classification},
      booktitle = {JMLR},
      year = {2008}
    }
    
    Details