Schema Evolution:

UNIVERSITY OF CALIFORNIA

Information Systems Under Schema Evolution: Analyzing
Change Histories and Management Tools

PI: Carlo Zaniolo
Computer Science Department
University of California
Los Angeles
zaniolo@cs.ucla.edu

Work supported by NSF 0917333,
Disclaimer:
Statements in this report represent author's opinions and not those of NSF

Abstract
The significant progress made by database research on schema mapping (e.g., composition, invertibility), data exchange, and query rewriting, can provide breakthrough solutions for the Database Schema Evolution problem. But as of today, information systems are sorely lacking the methods and tools needed to cope with the problem, and to reduce the cost of data migration, rework of queries, application rewriting, and downtime created by schema changes. In fact, this old problem has been made worse by the success of scientific databases (e.g., Ensembl) and web information systems (e.g., Wikipedia)---where the fast evolution of applications and requirements characterizing the web and the scientific discovery process is exacerbated by the number and diversity of users and organizations cooperating on these endeavors.. Fortunately, the openness of these public-domain information systems (vs. corporate ones), and the abundance of their interesting evolution histories make it possible to built a comprehensive testbed to determine the strengths, limitations, and potentials of candidate methods and tools proposed for the problem. Thus, this project is building (i) an open-source curated repository containing evolution histories from key information systems, (ii) benchmarks for a comprehensive set of tools tested therein, and (iii) instruments to collect and analyze evolution histories. These are then used to (a) compare and evaluate existing approaches, methods and tools, and (b) entice researchers to evaluate and improve their techniques and add their test cases to the benchmark. A transformative impact can be expected upon schema mapping research and applications, inasmuch as theoretical solutions are now validated and improved on real-life case-studies. These in turn are expected to transform and improve significantly scientific databases and web information systems.

Project Summary Since organizations and companies are adapting continuously to a fast changing reality, their Information Systems have to rapidly evolve to satisfy changing requirements. This results in frequent modifications of the database schema, and in the consequent rework of queries and applications. In fact, in scientific databases and web information systems changes are occurring at even more dramatic pace than in traditional databases. This problem, known in literature as Schema Evolution or Versioning, has long been recognized by practitioners and researchers as challenging and pressing, both from theoretical and engineering standpoints. The solutions currently available in the commercial world only address the most basic requirements for schema evolution support, leaving practitioners struggling with the error-prone and labor-intensive activities of: (i) unassisted schema redesign, (ii) data migration, and (iii) manual adaptation of queries, updates, and applications. Even more troublesome problems are caused by a lack of tools to document the large portion of the schema generated by the evolution rather than the original design. On the other hand, researchers have produced many useful techniques that address the issues of mapping (i.e. discovery, composition, invertibility), data exchange, and query rewriting. These techniques are applicable to schema evolution, but their practical impact has been limited because the research community have not been able to refine and validate their solutions on large real-life schema evolution scenarios. This project will change that by building an extensive testbed of schema evolution test cases and providing effective tools to document the schema segments generated by evolution rather than the original design.

Intellectual Merit and Need for NSF Support   The goal of our project, in accordance with the very mission of the IIS program of NSF, is to provide to the research and practitioners communities: (i) a curated open-source repository containing evolution histories from real-life information systems, (ii) tools to automate the collection, analysis, and documentation of schema evolution histories, and (iii) a benchmark to assess the effectiveness of schema evolution support tools. The PI, his students, and collaborators are uniquely qualified to carry out this project successfully because of their previous work on database design and on supporting schema evolution for snapshot and temporal databases.

Educational Merit   Project funds will make possible the training of PhD students who will be working on improving and extending the data collection and benchmark we are designing. Undergraduate interns will also be able to learn from real-life evolution histories. The tool-suite for schema history analysis we are going to develop will serve as an invaluable educational tool for students, researchers and practitioners. We will be working with the UCLA Center for Excellence in Engineering and Diversity to recruit women and minorities as interns.

Broader Impact   The gap between research and practice has often proved to be the innovation bottleneck of our society. The ambitious endeavor we are proposing aims at reducing this gap, thus maximizing the benefits of research to society. Furthermore, these funds will support the launching of an open-source community that will manage and extend the testbed and system workbench developed by the project. This proposal describes a timely and innovative approach to achieve these goals. A broad range of scientific, educational, and economic activities will benefit from these advances.

Keywords: schema evolution, testbed, benchmark, data collection, scientific databases

B Table of Contents

A Project Summary*-0.2cm
B Table of Contents
C Project Description
    C.1 Results of Previous NSF-Funded Research
    C.2 Introduction
    C.3 Real-Life Testbed
        C.3.1 A Preliminary Study
        C.3.2 Complete Dataset
    C.4 Software Infrastructure for evolutionary database design
        C.4.1 Dataset Collection Suite
        C.4.2 Analysis Suite
        C.4.3 Mapping Mining
        C.4.4 Historical Metadata Manager
    C.5 A Benchmark for systems supporting schema evolution
        C.5.1 Benchmark Design
        C.5.2 Surveying Existing Tools
    C.6 Work Plan and Deliverables
    C.7 Outreach and dissemination activities

C Project Description

C.1 Results of Previous NSF-Funded Research

Carlo Zaniolo is the PI of the NSF grant IIS-0339259 "SGER: Collaborative Research: Support for Design of Evolving Information Systems"(Grant amount: $50,000; Grant period: 09/2003-08/2004). This exploratory project led to the NSF grant: IIS-0705345 "III-COR: Collaborative Research: Graceful Evolution and Historical Queries in Information Systems a Unified Approach" (Grant amount: $204,749 Grant period: 09/2007-08/2010). These two projects focus on developing the technology that allow information systems to archive and query history of databases that have evolved in both schema and content. Significant progress has been made here too, including (i) techniques and systems that enable us to publish and query the history of databases using XML and XQuery [115,116,117,123,118], (ii) methods and tools that support graceful schema evolution [33,34,32], and (iii) transaction-time temporal databases that support historical queries on database history under schema evolution [77,35]. These projects have given us in-depth experiences and implementation platforms that provide us with an indispensable and unique expertise on the problem of evolution itself.

Carlo Zaniolo is the PI in the NSF grant IIS-0742267 "SGER: Efficient Support for Mining Queries in Data Stream Management Systems" (Grant amount: $200,000; Grant period: 09/2007-08/2009). The first objective of this project is to tame the computational complexity of mining queries, by developing data stream mining algorithms that are fast and light enough for online mining of data streams. This line of research has produced novel on-line techniques to support efficient mining of frequent patterns [80,111]. The second objective is to extend current DSMS and their SQL-based query languages with constructs and optimization techniques needed to design and develop a data stream mining workbench-the first of its kind. As discussed in [16,17,110], much progress has already been made, and we are now approaching completion of our first system prototype.

C.2 Introduction

Nowadays organizations and companies are forced to continuously adapt to a fast-changing business reality. As a consequence, the corresponding Information Systems are also required to rapidly evolve to meet new requirements [64]. At the data management level this breaks the traditional notion of the database schema as an immutable object, that is carefully designed in advance with the expectation it will last a long time. On the contrary, the database schema becomes a constantly changing entity, thus, inducing the continuous re-working of queries an applications. This problem, known as schema evolution or schema versioning in the literature, has long been recognized by practitioners and researchers as a challenging and pressing one, from theoretical, engineering and economical perspectives [108,102,5,72,15,98,99,101,103,70,96,97,28,53,92,114,74,45,56,54,26,89,29,57,91].

The described phenomenon becomes more pressing in the context of web Information Systems, due to the collaborative nature of their development and usage. Scientific projects, such as the Large Hadron Collyder of Geneve CERN [2], or the Ensembl genetic database [1], face even more intense evolution requirements, in order to match the mutating needs of scientists from several collaborating research centers. In our preliminary investigations [33] we contacted various research centers and sampled their evolution histories, discovering incredibly intense scenarios. For instance, Wikipedia [119] evolved over 170 schema versions in 4.5 years, and the Ensembl DB has seen over 410 schema versions in about nine years of lifetime.

Evolving an information system, and its data management core in particular, is a labor-intensive and risky activity that can take 40% to 75% of the overall development costs [14], and thus appropriate tool support is strongly desired. The study [33] of the schema evolution of the backend of the free encyclopedia Wikipedia confirms our initial hunch on the costs of such evolution. In the Wikipedia schema evolution history, each evolution step has caused up to 70% of the tested queries to fail, requiring manual query adaptation [33].

Traditionally the design of the database schema was relegated to an initial phase, as a consequence of a careful analysis of the system requirements. However, the fast dynamics of today's business reality call for a more continuous design activity. The schema is, in fact, evolved throughout the lifetime of a system, which brings up another critical issue, i.e., documentation. Operating to cope with a tight schedule, the database administrators often fail to maintain the documentation up-to-date with the modified schema. This is vividly illustrated in an actual comment¹ of a senior Wikipedia DB Administrator: "Move schema documentation to inline comments in tables.sql, where it's less likely to be left years out of date." The lack of good documentation also makes the key task of determining data and metadata provenance extremely difficult, as we will discuss in Section C.4.

Existing Commercial Tools A preliminary analysis of the state-of-the-art systems supporting evolution, both from open source [109,83,84,3,85,94,44,4,79,93] and commercial [48,58,86,59,47,67,25,112,10,105,104,9,8,7,27,30,113,106,107,120,46,121,65,31,66] worlds, calls for immediate actions. The solutions available on the market barely address the most basic requirements to support schema evolution, leaving practitioners struggling in the error prone and labour intensive activities of: (i) unassisted schema re-design, (ii) data migration, (iii) manual adaptation of application queries and updates, (iv) documentation of the evolution itself, and (v) provenance of data and metadata.

In particular, the big players in the world of commercial DBMSs have been mainly focusing on reducing the downtime when the schema is updated [87] and on assistive design tools [40,109]. Other tools of interest are [63] and LiquiBase². All of the mentioned approaches focus on basic feature, that, although needed, provide only little help with critical tasks, such as application adaptation and documentation, and thus fail to reduce the high maintenance costs.

Research Endeavors The limitations of commercial systems are hardly surprising given the intrinsic complexity of the problems, which involves theoretical issues such as mapping discovery [82], mapping composition [69,50,81,23], mapping invertibility [51,52,49], query rewriting [42,43,41], and view-update [18,39,62]. This very practical problem is, in fact, fueling some of the most exciting theoretical endeavors undertaken by the database research community. On the other hand, the research community that has the strong theoretical background needed to produce theoretically sound solutions lacks the large-scale real-life case studies needed to guide and validate their work for practical usefulness.

Many interesting results have indeed been produced on the general problem of schema evolution. Without trying to be exhaustive, we can mention the impact-minimizing methodology of [89], the unified methodology for application and database evolution of [57], the application-code generation of [29], the framework for metadata model management of [73], the visual tool of [88], and the approaches presented in [21,24,114,122,75].

Further related works include the results on mapping information preservation by Barbosa et al. [19], the ontology-based repository of [26], the schema versioning approaches of [60]. XML schema evolution has been addressed in [78] by means of a guideline-driven approach. Object-oriented schema evolution has been investigated in [54]. In the context of data warehouse X-TIME represents an interesting step toward schema versioning by means of the notion of augmenting schema [56,95].

In [34,32] we have proposed a first practical solution to snapshot database evolution, by harnessing the theoretical advances on schema evolution modeling [23,114], mapping composition [69,50,81,23], mapping invertibility [51,52,49], and query rewriting [42,43,41]. Similarly in [77,35] we have proposed a transaction-time DB that supports schema evolution. In [38], we discuss the vision of framework proposing schema evolution and data integration.

The gap between academia and practitioners Advances in computer science require a continuous feedback process on the quality of the obtained results. This is particularly difficult in the area of database, where, due to privacy and business concerns, it is often difficult for the researchers to test their innovative solutions on real-life data. We can thus say that, due to the impedance mismatch between the academic and practitioners' communities [100], researchers have limited visibility on the actual needs of the industrial world, and limited opportunity to validate their solutions on real-life cases studies. In fact, for the problem of schema evolution we are considering, the intense and fruitful research effort mentioned above is not finding an easy in-road in commercial systems. The present project aims at this bottleneck, by providing data and tools that will help (i) to expose the actual needs of existing information systems to the research community, (ii) deliver recent research advances into innovative industrial solutions, and (iii) automate documentation and provenance for databases under schema evolution.

The proposed solution By conducting exploratory studies on real-life Web Information Systems [33], and scientific database, we have experienced the benefit of having first-hand information on the need for schema evolution, and had the opportunity to test and validate our systems [34,32,77,35] on real evolution histories. This project will extend this opportunity and its benefits to the entire research community, and enhance its potential benefits by:

collecting a large-scale dataset of real-life schema evolution histories and releasing both the dataset and its thorough analysis to the community;
developing a tool-suite capable of automating collection, analysis, documentation and provenance of schema evolution histories.
designing a benchmark for systems supporting schema evolution and assessing the current state-of-the-art system support.

We plan to conduct a large-scale campaign to collect evolution histories from Web Information Systems, scientific projects, government organizations, open source systems, and we have already made preliminary inquiries with some of these organizations. The enthusiastic response we have received from them confirm the feasibility and also the desirability of our data collection plans.

The collected data will be released to public via an open, curated repository. This repository will represent, for researchers and developers, an invaluable resource for validating innovative technologies and for obtaining better insight on the challenges posed by Information Systems evolution. Besides storing the evolution histories, the repository will also allow temporal querying of the metadata histories, by building and extending the History Metadata Manager presented in [35]. Each evolution history will be analyzed, providing several statistics, that capture the essence of the underlying evolution process. Furthermore, automated tools will derive, from the SQL migration scripts, a high level specification of the evolution based on the notion of Schema Evolution Operators (SMO) such as those used in [34]. This will provide the much needed documentation of the evolution process itself and cure the problem of degrading documentation quality that is now associated with evolution. The SMO-based representation of the evolution will also enable precise and automated data and metadata provenance analysis, crucial for scientific databases.

In order to collect and analyze large datasets, and assure the scalability of this process, we will develop automatic collection and analysis tools for evolution histories, including improvements and extensions of the very primitive ones describe in [33]. These will automate the download, versioning, and analysis of schema evolution histories, and support the quasi-automated collection of query workloads [33]. The tool-suite will be released to public, greatly facilitating the extraction of statistics that provide a comprehensive view over long and complex evolution histories. This will provide database administrators and researchers with invaluable knowledge. The availability of automatic tools and of a large repository will be a key component for the formation of community effort, where researchers and practitioners will be enable to access and contribute to this unique repository of case studies, thus, creating a reference point for both communities.

A second key objective of this project is to design and implement a benchmark for schema evolution tools based on this testbed. This will enable researchers to validate and compare their approaches and tools. On contrary to previous approaches based on synthetic data generations [12,11], our benchmark will be based on a mix of the most crucial and most common evolution steps from the actual evolution histories in our dataset. Moreover, we will exploit actual data and actual workloads in a series of experiments to test the features of systems supporting schema evolution. By measuring and stress-testing the state-of-the-art tools from commercial, open source, and academic worlds, we will contribute at improving the current systems, technology, and methods for schema evolution support. The benchmark will also released to the public to provide an invaluable tool to validate and compare novel solutions against real-world schema evolution scenarios. This will promote: (i) better targets for the research endeavors tackling real-world pressing issues, (ii) validation of the research approaches, and (iii) comparison and assessment of the various techniques against a reference testbed.

Impact Beyond Schema Evolution The usefulness of the dataset we will collect goes well beyond the schema evolution problem. In fact, the schemas, data, and query workloads we will provide will benefit researchers working on query optimization, indexing, partitioning, privacy, and many others. Our datasets will allow them to investigate the effectiveness of their approaches, compared with state of the art competitors all measured on a neutral ground. This will contribute to bridging the gap between academic and industrial worlds, facilitating the industrial adoption of aqdvanced research solutions.

C.3 Real-Life Testbed

In this section, we discuss the dataset we are planning to collect. First we summarize a preliminary study we conducted on Wikipedia, and then explain how this will be extended to hundreds of other schema evolution histories, in order to build the first real-life testbed for schema evolution. The size and nature of this endeavor calls for automated tools, which are discussed in Section C.4.

C.3.1 A Preliminary Study

In recent years, web Information Systems have witnessed ever-increasing popularity. Among others, Wikipedia, a multilingual, web-based, free-content encyclopedia, has seen a great success³, mainly due to its revolutionary concept of collaborative authoring. Wikipedia is built on a software platform called MediaWiki⁴, which powers over 30,000 wiki websites⁵. MediaWiki is a open-source, data-intensive, and collaborative web-portal software. The system architecture of MediaWiki relies on a relation DBMS backend, i.e. MySQL, PostgreSQL, or Oracle, to store and manage massive amounts of data (i.e., several hundreds of Gigabytes in the case of Wikipedia). Web pages are dynamically generated by querying the underlying database and rendering its content in HTML. In the following we analyze the schema evolution of MediaWiki, based on its 171 schema versions between April 2003 (first schema revision) and November 2007 (date of this analysis). The complete analysis is reported in [33].

Basic Statistics In Figure 1, we report the size of MediaWiki DB schema in history, in terms of the number of tables and columns. The graphs show an evident trend of growth in sizes, where the number of tables has increased from 17 to 34 (100% increase) and the number of columns from 100 to 242 (142%).

Figure Figure 1: MediaWiki Schema Size: Number of Tables and Columns

Figure Figure 2: Histograms of Tables and Columns Lifetimes

Figure 2 shows a histogram representation of the table and column lifetimes, in terms of number of versions. The lifetimes range from very long ones, e.g., the user table that was alive throughout the entire history, to short ones, e.g., random table that only survived for two revisions. On average, each table lasted 103.3 versions (60.4% of the total DB history). Similarly column lasted 97.17 versions on average (56.8% of the total DB history).

Table 1: Macro-Classification of Schema Changes (One evolution step may have more than one change type)

Macro-Classification of Changes We group the 170 evolution steps based on the types of evolution they present as in Table 1. While the "actual schema changes" have an impact on the queries, as they modify the schema layout, the evolution of the DBMS engine, indexes, and data types, (while being relevant to performance) does not require any query correction, because of the physical data-independence provided by the DBMS. Table 1 shows the frequencies⁶ of the types of changes among the 170 evolution steps.

Schema Modification Operators To better understand the relational DB schema evolution, we introduce a classification of the "actual schema changes". Different formalisms can be exploited for this purpose. Shneiderman and Thomas proposed in [102] a comprehensive set of schema changes, including structural schema changes and also changes regarding the keys and dependencies. More recently, Bernstein et al. have also proposed a set of schema evolution primitives using algebra-based constraints as their primitives [22].

Among several options, we chose the Schema Modification Operators (SMOs) that we proposed in [34,77]. SMOs capture the essence of the existing works, but can also express schema changes not modeled by previous approaches. For example, by using function⁷ in the ADD COLUMN operator, SMOs can support semantic conversion of columns (e.g., currency exchange), column concatenation/split (e.g., different address formats), and other similar changes that have been heavily exploited in modeling MediaWiki schema changes. The effectiveness of SMOs have been validated in [34,77], where the PRISM and PRIMA systems used SMOs to describe schema evolution in transaction-time databases and to support historical query reformulations over multi-schema-version transaction-time databases.

The syntax of SMO is similar to that of SQL DDL, and provides a concise way to describe typical modifications of a database schema and the corresponding data migration. Every SMO takes as input a schema and produces as output a new version of the same schema. Note that SMOs can be arbitrarily combined to describe complex structural changes, as those occured in the MediaWiki DB schema evolution. The SMO-based representation of the evolution will be automatically extracted by means of tools described in Section C.4.

Figure Figure 3: SMOs distribution

Micro-Classification of Changes Using SMOs In this context we exploit SMOs as a pure classification instrument to provide a fine-grained analysis of the types of change the schema has been subject to. While there might be several ways to describe a schema evolution step by means of SMOs, we carefully select, analyzing the available documentation, the most natural set of SMOs describing each schema change in the MediaWiki history. Figure 3 shows the distribution of the SMOs, presenting, for each type, how many times it has been used in the entire schema evolution history, and a graphical representation of the sum of SMOs pre schema version. Is interesting to notice that the more sophisticated SMOs (e.g., MERGE TABLE) while being indispensable are not very common. The balance between column/table additions and deletions highlights the "content preserving" attitude of Wikipedia⁸.

Impact on Applications In order to study the effect of schema evolution on the frontend application, we analyze the impact of the schema changes on six representative sets of queries. Each experiment tests the success or failure of a set of queries, originally designed to run on a specific schema version, when issued against other schema versions.

To simulate a case where current applications are run on databases under older schema versions, we test three sets of queries, valid on the last schema version, on all the previous schema versions (Figure 4). The figure Also, to study how legacy applications succeed or fail on newer versions of the database schema, we test three sets of legacy queries on all the subsequent schema versions.

Figure Figure 4: a) Average query success rate against following schema versions (the queries are designed for the 28^th version, and run against all the following versions); b) real Wikipedia profiler queries executed against previous schema versions.

The six sets considered in our experiments are as follows:
Real-world templates, current (Figure 4.a): the 500 most common query templates (extracted⁹ from over 780 millions of query instances), derived from the Wikipedia on-line profiler¹⁰ and post-processed for cleaning¹¹.
Lab-gen queries, current (Figure 4.a): 2496 query instances generated by a local installation of the current version of MediaWiki (release 1.11, schema version 171), interacting with the frontend¹² and logging the queries issued against the underlying MySQL DBMS.
Lab-gen templates, current (Figure 4.a): 148 templates of queries extracted from the above queries.
Lab-gen queries, legacy (Figure 4.b): 4175 query instances generated by a local installation of an old version of MediaWiki (release 1.3¹³, schema version 28), interacting with the frontend and logging the queries issued against the underlying MySQL DBMS.
Lab-gen templates, legacy (Figure 4.b): 74 templates extracted from the above lab-gen queries, legacy.
Synthetic probe queries, legacy (Figure 4.b): 133 synthetic queries accessing single columns (i.e., select tab_j.att_i from tab_j) of schema version 28, designed to highlight the affected schema portion.

Each set has been tested against all schema versions: the resulting query execution success rates are shown in Figure 4.a and Figure 4.b . The outliers in the graphs (sudden and extremely low values) are due to syntactically incorrect DB schema versions. Due to lack of space we omit further discussion of the results, more comments are available at [33], but we simply notice that these experiments provide a clear evidence of the strong impact of schema changes on applications, and sustain the claim for better schema evolution support.

C.3.2 Complete Dataset

We plan to extend the dataset by collecting and analyzing the schema change histories of many other information systems. Currently, we consider over a hundred candidate information systems from various contexts: web and open source platforms, scientific projects, governmental and administrative. Web IS include wiki softwares, like MediaWiki or TikiWiki, content management systems (CMS) like Joomla, Drupal, XOOPS, and TYPO3, and many others, often released under open source license. The typical schema evolution ranges from tens to hundreds of schema versions. Scientific IS also demonstrate very dynamic cases of schema evolution. For instance, the Ensembl project, funded by the European Biology Institute and the Welcome Trust Sanger Institute, provides a data-centric platform used to support the homonymous human genome database, and other 15 genetic research endeavors. The Ensembl DB witnessed an intense schema evolution history. In about 9 years of life-time, more than 410 schema versions appeared to the public, which equates to almost one version a week. Also, we have contacted the European Organization for Nuclear Research (CERN) and obtained access to their physics databases, which often go through extensive schema changes. For instance, one of CERN's project called CASTOR (CERN Advanced STORage manager) that hosts 15 petabyte of data reports 149 schema versions during the last three years (which averages to about one version per week). We plan to further extend the pool of information systems to be included especially in the area of Scientific DB. We are, thus, actively opening collaboration with multiple research centers, including National Center for Biotechnology Information (NCBI), hosting several medical and biological databases [6].

We are also interested in schema evolution in administrative IS, such as government and university databases. We have contacted the former IT director of several Information System for Italian government and obtained preliminary access to some of their schema evolution histories. We also contacted administrators of UCLA and Politecnico di Milano to obtain access to some administrative schema evolution histories. While for scientific and open source projects, the public nature of their systems eliminate privacy issues, for administrative databases, this pose a major challenge in our endeavor. We are thus managing this on a case-by-case basis guaranteeing privacy and anonymity where required.

We already have contacted and obtained access to almost 100 databases from open source projects, about 50 scientific databases, and about 20 administrative governmental databases, for a grand-total of almost 4,000 schema versions. In most cases, we have identified the severity of schema changes ranging from tens to hundreds of schema versions, and plan to explore them in a great detail as in Wikipedia case. This large corpus of case studies will provide a unique perspective on the problem of schema evolution itself, and allow us to develop and validate technologies against the actual needs of evolution faced by real-life information systems.

C.4 Software Infrastructure for evolutionary database design

As anticipated in the introduction, the evolution is typically coped with, under tight development schedules. Non-functional aspects, such as documentation and analysis, are, thus, the first to suffer. Documentation tends to degrade over time, and, as a consequence, the understanding of the schema itself becomes more and more difficult. This is testified by various examples of schema modifications applied and rollbacked more than once in the Wikipedia schema evolution history [33]. This calls for tools that can provide: (i) a synthetic view over the entire evolution history, (ii) capture evolution in a more conceptual way, thus providing automated documentation, and (iii) support data and metadata provenance tasks.

We, thus, plan to develop appropriate tools for automating the data collection and analysis process. Furthermore, by automatically analyzing the migration scripts we will derive the mapping between subsequent schema versions, providing and added value to the bare schema histories. This will also enable users to query the data collection, to freely inspect the history of metadata and derive precious provenance information. By allowing unfiltered temporal queries over annotated metadata histories, we also enable unpredicted usages our dataset, and guarantee extensibility of our platform. Hence, we propose a software infrastructure that supports collection, analysis and querying of large-scale schema evolution histories.

The proposed set of tools will enable the development of evolutionary database design methodologies [13], similarly to what has been proposed in software engineering under the name of "agile methodologies" [71], with the popular case of "extreme programming" [20].

C.4.1 Dataset Collection Suite

In order to automate the schema history collection process, we leverage the systematic recording of the source code evolution, typical of large projects, such as biological and physics databases, Web Information Systems, or large open source projects. Based on our preliminary studies we know, in fact, that is rather common for large organizations to maintain the history of their database schema and application source code by versioning it in CVS or SVN repositories the SQL scripts that creates the schema, and "patch" SQL scripts to migrate data from the old to the current schema version. These repositories, storing textual source-code information, serves the bare minimum purposes of tracking software evolution, but lack to provide high level view of the evolution process. Furthermore, any data provenance task is extremely taxing, and due to limited and degrading documentation, also the simple understanding of the current schema might become challenging.

What is needed is a set of automated tools, capable of extracting from such repositories the hidden knowledge about metadata evolution. We will develop such tools, relying on the experience we obtain in designing [33] and [35]. In particular, the schema evolution histories, which are currently shredded in a textual format in the CVS or SVN repositories, will be automatically collected and stored into the Historical Metadata DB (HMDB).

The HMDB captures the history of the SQL standard information_schema, together with various additional metadata (SMOs, user queries, etc.). Due to space limitation, we discuss only a selected subset of the HMDB content. We introduce it in a snapshot format, which is a simple extension of the SQL standard information_schema, to later comment on the corresponding temporal archive.

  schemata(schema_name, default_character_set_name)
  tables(table_schema, table_name, table_type, table_rows)
  columns(table_schema, table_name, column_name, data_type, default_value)
  queries(query_id, query_text, schema_name, issue_timestamp)
  query_exec(query_id, exec_timestamp, exec_success)
  smos(smo_id, smo_text, smo_type, timestamp)

The first three relations, schemata, tables, and columns, come from the information_schema, while the relations named queries, query_exec, and smos store the mapping between subsequent schema versions in terms of Schema Modification Operators (Section C.4.3 will discuss how SMOs can be extracted from schema histories). The queries relation archives the user queries (as templates), the target schema, and query issuing time. The table query_exec stores the results (boolean flag exec_success) of testing each of above query templates against the subsequent schema versions. Lastly, the table smos stores the SMOs describing the semantics of each schema change.

In order to effectively and efficiently archive the history of these metadata, we leverage and XML-based format we developed and implemented in [115,116,117,123], named V-document.

By building on the advanced features of the temporal transaction time dbms ArchIS [118] and our system PRIMA [77], we will provide access to metadata histories via an effective XQuery querying interface, discussed in Section C.4.4.

In the remainder of this section, we discuss how the collected data can be manipulated, enriched, and enriched to provide a complete framework for evolution support and documentation.

C.4.2 Analysis Suite

Due to the large size of schemas, and schema histories, a raw collection of schema versions is of very little use for researchers planning to test their techniques, or DB Administrators managing a database. For this reason, we are planning to develop a complete analysis suite. The HMDB representation of the evolution history allows us to automate the collection rich statistics about the evolution itself, similarly to what we did for the Wikipedia case study. Some of the statistics we are considering include simple measures such as: schema growth (tables and columns), table and column life-time, key evolution, overall editing activity over time. Furthermore, we plan to create and automate richer statistics combining several factors to determine the overall level of evolution activity in different portions of a DB schema. For example by studying the amount of change (to the set of columns, datatypes, keys, indexes, etc.) of a table we can determine how (un)stable it is, and observing the percentage of queries in the workload involving the table we can capture how critical the table is.

In the case of Wikipedia, this would show how the core article management tables are critical and unstable, thus warning the DB Administrator and focusing its attention, to proactively invest in a better design of this portion. This will provide an invaluable tool both for the DBA and for researchers both to design the evolution and to test innovative technologies.

The tool-suite we are designing contains a rich set of built-in statistics, and will allow the users to specify further statistics combining simple SQL queries.

C.4.3 Mapping Mining

One of the key tasks to achieve automated documentation and provenance is to capture the relationship between subsequent schema versions. This task is similar to the problem of schema matching and mapping [82,68,55,90,76,37,61], well known in the data integration and data exchange worlds. While schema matching approaches tackle the more general problem of mapping unrelated schemas, in our scenario extra information is available in terms of SQL data migration scripts. These scripts capture the relationships between schema versions by means of an operational specification. By leveraging this extra knowledge, we plan to design a set of heuristics capable of analyzing SQL migration scripts and derive an SMO-based representation of the evolution. In our preliminary tests this approach is very promising, and by exploiting the specific characteristics of our scenario, we expect to obtain very high levels of precision and recall.

We will take full advantage of the large body of work we developed in [34,77,35,38] to derive from the SMO specification of the evolution a logical mapping relating subsequent schema versions. This will provide automated documentation, and data provenance (by means of query rewriting). This will also enable to exploit the schema evolution support functionalities of the PRISM system we developed in [34], thus providing a very complete suite for schema evolution.

C.4.4 Historical Metadata Manager

The Historical Metadata Manager (HMM) we designed in [36] is a tool capable of storing and querying metadata histories, and once fully implemented will represent a perfect interface to pose temporal queries over the schema histories. The metadata histories are capture in the Historical Metadata DB (HMDB), discussed above, and represented in an XML format.

This allows us to exploit XQuery as a powerful temporal query language. Due to space limitations we present the effectiveness of the proposed tools by means of few simple examples, the interested reader can find further details in [36].

The queries we present are grouped in two classes: (i) queries over the history of the information_schema (Query 1-Query 3), and (ii) queries exploiting SMOs and logged queries (Query 4 and Query 5).

Simple but very useful temporal queries are snapshot queries as the following:

Query 1. What was the Wikipedia database schema valid on 2007-01-01?

for $db in document("wiki-hmdb.xml")/db,
  $tn in $db/Tables/row/table_name[@ts<="2007-01-01" and @te>"2007-01-01"]
return <table> {$tn/@text}, <cols>
        {$db/Columns/row[table_name=$tn]/
        column_name[@ts<="2007-01-01" and @te>"2007-01-01"]/text()}</cols>
       </table>

Similarly, it is possible (by range queries) to inspect specific portions of the history, retrieving all the subsequent modifications occured in a given period. Next, we give an example of temporal aggregate queries that can be exploited to observe general trends of the schema evolution (the corresponding XQuery for this and the following queries is omitted due to space limitations, more details can be found in [36]):

Query 2. Retrieve the number of columns in the Wikipedia DB throughout the history.

The output of such query can be nicely rendered as we showed in Figure 1, where it is easy to notice a net growing trend in the Wikipedia schema size of roughly 31% a year.

Furthermore, to analyze quality and stability of the design it is possible to pose queries retrieving stable or unstable portions of the schema as the following:

Query 3. Which table in Wikipedia DB schema that remained valid for the longest period?

The answer of this query reveals how the user table was the most stable in the DB schema. Another interesting class of queries exploits the explicit characterization of the change semantics provided by SMOs to track back (or forward) the evolution of information stored in a given schema element. Consider the following example:

Query 4. Retrieve the previous version of the information currently stored in table `page'.

Finally, by exploiting the information about query templates and their execution, it is possible to retrieve and visualize the impact of the schema evolution on the original user queries, as exemplified by the following query, whose results has been shown in Figure 4:

Query 5. What's the success rate of legacy queries (valid on 2004-08-15) after each Wikipedia schema change?

These relatively simple queries will be naturally supported by the HMM on the history of metadata we archive. This querying capability will provide a powerful tool for dissecting the metadata history, enabling a better understanding of undergoing schema evolution.

C.5 A Benchmark for systems supporting schema evolution

As we have discussed in the previous section, information systems are subject to intense schema evolution. A lot of effort has been devoted to develop theories and tools to mitigate this problem. Various approaches present different levels of support and focus on different aspects. We believe that it will be very beneficial to have a standardized benchmark that will allow for a fair comparison between them and also encourage further development of such tools.

In fact, we plan to leverage the extensive data of schema evolution history from various information systems for building a benchmark based on real-world case studies. By selecting a representative subset of evolution steps, we will be able to highlight the key issues that a tool, targeting schema evolution, must provide. The large size of the evolution histories we collected can also be used to test scalability.

C.5.1 Benchmark Design

We first discuss the main features a schema evolution support system (SESS) should provide and then briefly present how this can be experimentally measured.

Schema Evolution Support Features We divide the set of desirable features into four main categories: documentation, predictability, transformation, and reversibility.
Documentation refers to a set of features that systematically records schema versions in history including snapshot data, queries, and other database objects¹⁴. This set of features provide the minimum level of support, guaranteeing traceability of the evolution. This set of features is offered at a certain extent by open-source and commercial tools.
Predictability: during the design of a change in the schema, it is of key importance being able to predict the impact on schema, data and queries of the given evolution step. We group in this category the set of features of a SESS, that provide what-of scenario support for the various components of a database (e.g., schema, data, queries, updates, integrity constraints, etc.). This set of features is only partially supported by commercial and open-source tools.
Transformation: the actual modification of the schema components is also very important. In this category we group features like: support of the data migration process, automatic query adaptation, automatic integrity constraints adaptation, index adjustment. Some of the features in this class are very advanced and only available in academic prototypes [34,82].
Reversibility: the last class considered includes those features enabling a seamless rollback of the evolution process. These features are partially achieved at a basic level by some commercial tools, and supported by some academic of the academic tools we considered.

Quantitative Measurement for the Features Some of the above mentioned features can only be present or absent from a given system, however, others can be measured. For example, when considering data migration, we can analyze two key measures: migration time and down time. Where the first represent the overall time needed to accomplish a given migration task, and the second one refers to the amount of time (subset of the previous one) during which the migrated data cannot be accessed. By leveraging the large dataset we collect, we will define a set of standardized experiments aiming at measuring in a fair and uniform ways key characteristics of an SESS. This will provide a quantitative measure of how effective and efficient is a SESS in providing a given features.

C.5.2 Surveying Existing Tools

We will apply the proposed benchmark to the state-of-the-art SESS, ranging from commercial [48,58,86,59,47,67,25,112,10,105,104,9,8,7,27,30,113,106,107,120,46,121,65,31,66] (e.g. MySQL Workbench, SwisSQL migration tool), open source tools [109,83,84,3,85,94,44,4,79,93] (e.g. Liquibase, Hibernate), and academic tools [82,34,75,23,88] (e.g. CLIO, PRISM). Some of them deliver quite limited features while the others provide the most advanced features we consider such as legacy query and update automatic rewriting. We plan to take a more closer look into these systems which will allow us fine-tune our benchmark as well. The overall results of this activity will be the first real-life schema evolution benchmark and a thorough survey of the existing tools, which will provide a complete assessment of the maturity level of the approaches to schema evolution.

C.6 Work Plan and Deliverables

The project objective and milestones will be achieved according to the following workplan:

In this first year, we will build the bulk of our testbed by collecting and organizing the data sets we have described in the previous sections and adding several new ones. We will build tools that automates the collection and management of such data sets, and can scale up to thousands of schema versions. We will set up an appropriate distribution infrastructure, by means of an open, curated repository, which will make accessible from very beginning to promote the formation of a community of users and contributors.
In the second year, we will design and implement the analysis tools for our software platform. This will enable us to extract from our testbed (that will be continuously updated and extended throughout the rest of the project) a large set of statistics. Moreover during this year, we will tackle the problem of Schema Modification Operators Mining from SQL data migration scripts. This will produce tools to automatically capture the evolution in terms of a more conceptual representation, thus enabling evolution documentation and provenance support. Lastly, we will design and implement the History Metadata Manager, which will make it easy to pose historical queries on the evolution of the schema and other metadata.
In the third year, we will design a benchmark for schema evolution based on our testbed (that has been growing during these three years). We will evaluate schema evolution/mapping tools from the commercial and the research world on multiple features using quality metrics based on sound rationale. We will design and automate experiments to test and validate traditional and innovative approaches to schema evolution. By applying the benchmark to the many open-source, commercial and academic tools available, we will provide a precise characterization of state of the art tools, thus motivating and guiding researchers and companies toward solving the current problem.

C.7 Outreach and dissemination activities

The success of the research endeavor we are undertaking critically depends on achieving a successful dissemination of the results, tools, and data sets developed by our project. To guarantee this success we have scheduled a spectrum of activities that include:

extensive and timely documentation on results and software (all of which will be downloadable) available from the project home page http://www.schemaevolution.org,
Collaboration with several universities and research groups (in addition with groups at UCSD, Politecnico di Milano, and UCR we already working with). By leveraging on the critical mass of schema evolution histories collected in the project, and the tool-suite we are going to develop, we will attract a large number of researchers and practitioners-particularly those working with scientific information systems,
The formation of an open-source community contributing to this testbed. Toward that goal, we are already contacted Open Source groups, which agreed on contributing and participating in the tasks of forming and administrating the community. This activity is very important for the success of the project, as in it will allow to: (i) made the invaluable deliverables of our project to a vast public, and (ii) contribute to create a reference point for the community, and, thus, (iii) improve the level of assessment of scientific results.
Divulging our results and tools through papers at research conferences and technical meetings, and by demos, and tutorials.

Through these activities, we will be able to reach a large community, since the data we collect can play an important role well beyond the Schema Evolution research area. In general the testbed and benchmark created by this project will contribute to assure verifiable scholarly standards and creating consensus in the research community. This dovetails with recent efforts of major database conferences and journals to assure the repeatability of scientific results, and to provide special tracks for experiments and analysis, aiming at better assessing research advances.

References

[1]: http://ensembl.org.
[2]: https://twiki.cern.ch/twiki/bin/view/pssgroup/pdbnumbers?rev=4.
[3]: Schema-compare : a php script that will compare two database schema's and show the differences side-by-side http://sourceforge.net/projects/schema-compare/.
[4]: Simple schema migration for django http://www.aswmc.com/dbmigration/.
[5]: Edbms - the evolutionary database management system - concepts and summary. In Technical Report, Infomration Reserach Associates, Austin, Texas, 1983.
[6]: See: http://www.ncbi.nlm.nih.gov.
[7]: Inc. AdventNet. Swissql console 4.2 : Sql query translation / conversion tool http://www.swissql.com/products/sql-translator/sql-converter.html.
[8]: Inc. AdventNet. Swissql data migration tool http://www.swissql.com/products/datamigration/data-migration.html?ad-main.
[9]: Inc. AdventNet. Swissql dbchangemanager 4.6 http://www.swissql.com/products/database-compare-synchronize-tool/.
[10]: Aldon. Aldon lifecycle manager : Multiplatform change management software http://www.aldon.com/pdf/Aldon_database.pdf.
[11]: Bogdan Alexe, Wang-Chiew Tan, and Yannis Velegrakis. Comparing and evaluating mapping systems with stbenchmark. Proc. VLDB Endow., 1(2):1468-1471, 2008.
[12]: Bogdan Alexe, Wang-Chiew Tan, and Yannis Velegrakis. Stbenchmark: towards a benchmark for mapping systems. Proc. VLDB Endow., 1(1):230-244, 2008.
[13]: Scott W. Ambler and Pramodkumar J. Sadalage. Refactoring Databases : Evolutionary Database Design (Addison Wesley Signature Series) http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20\&path=ASIN/0321293533. Addison-Wesley Professional, March 2006.
[14]: Grigoris Antoniou, Dimitris Plexousakis, and Giorgos Flouris. Evolving ontology evolution. In SOFSEM 2006: Theory and Practice of Computer Science, pages 14-29. Springer Berlin / Heidelberg, 2006.
[15]: Gad Ariav. Temporally oriented data definitions - managing schema evolution in temporally oriented databases. Data and Knowledge Engineering, 6(1):451-467, 1991.
[16]: Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp management and query execution in data stream management systems. IEEE Internet Computing, 12(6):13-21, 2008.
[17]: Yijian Bai and Carlo Zaniolo. Minimizing latency and memory in dsms: a unified approach to quasi-optimal scheduling. In SSPS, pages 58-67, 2008.
[18]: F. Bancilhon and N. Spyratos. Update semantics of relational views. ACM Trans. Database Syst., 6(4):557-575, 1981.
[19]: Denilson Barbosa, Juliana Freire, and Alberto O. Mendelzon. Designing information-preserving mapping schemes for xml. In VLDB, pages 109-120, 2005.
[20]: Kent Beck. Extreme Programming Explained: Embrace Change. Addison-Wesley Professional, 1st edition, October 1999.
[21]: Philip A. Bernstein. Applying model management to classical meta data problems. In CIDR, 2003.
[22]: Philip A. Bernstein, Todd J. Green, Sergey Melnik, and Alan Nash. Implementing mapping composition. In VLDB, 2006.
[23]: Philip A. Bernstein, Todd J. Green, Sergey Melnik, and Alan Nash. Implementing mapping composition. VLDB J., 17(2):333-353, 2008.
[24]: Philip A. Bernstein and Erhard Rahm. Data warehouse scenarios for model management. In ER, 2003.
[25]: Inc. BEST SoftTool. Sql db compare http://bestsofttool.com/SQLDBCompare/SDC_Feature.aspx.
[26]: Hassina Bounif and Rachel Pottinger. Schema repository for database schema evolution. DEXA, 0:647-651, 2006.
[27]: Inc. Bradmark Technologies. Dbcontrol for udb: Database administration and schema evolution for the ibm db2 universal database environment. datasheet http://www.bradmark.com/site/products/pdfs/dbcontrol_db2.pdf.
[28]: C. De Castro, F. Grandi, and M. R. Scalas. Schema versioning for multitemporal relational databases. Information Systems, 22(5):249-290, 1997.
[29]: Anthony Cleve and Jean-Luc Hainaut. Co-transformations in database applications evolution. LNCS: Generative and Transformational Techniques in Software Engineering, pages 409-421, 2006.
[30]: Troy Coleman. Evolution of online schema change to database definition on demand. blog http://ibmsystemsmag.blogs.com/db2utor/2007/09/evolution-of-on.html, September 2007.
[31]: The Australian Software Company. Sql delta for oracle databases - beta http://www.sqldelta.com/oraclebeta.html.
[32]: Carlo Curino, Hyun J. Moon, MyungWon Ham, and Carlo Zaniolo. The prism workwench: Database schema evolution without tears. In ICDE, 2009.
[33]: Carlo Curino, Hyun J. Moon, Letizia Tanca, and Carlo Zaniolo. Schema Evolution in Wikipedia: toward a Web Information System Benchmark. ICEIS, 2008.
[34]: Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Graceful database schema evolution: the prism workbench. Very Large DataBases (VLDB), 1, 2008.
[35]: Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Managing the history of metadata in support for db archiving and schema evolution. In Fifth International Workshop on Evolution and Change in Data Management (ECDM 2008), 2008.
[36]: Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Managing the history of metadata in support for db archiving and schema evolution. In ER Workshop on Evolution and Change in Data Management (ECDM), 2008.
[37]: Carlo Curino, Giorgio Orsi, and Letizia Tanca. X-som: A flexible ontology mapper. In DEXA Workshops, pages 424-428, 2007.
[38]: Carlo Curino, Letizia Tanca, and Carlo Zaniolo. Information systems integration and evolution: Ontologies at rescue. In International Workshop on Semantic Technologies in System Maintenance (STSM), 2008.
[39]: U. Dayal and P.A. Bernstein. On the Correct Translation of Update Operations on Relational Views. ACM Transactions on Database Systems, 8(3):381-416, 1982.
[40]: DB2 development team. DB2 Change Management Expert. 2006.
[41]: Alin Deutsch, Alan Nash, and Jeff Remmel. The chase revisited. In Principles of database systems (PODS), pages 149-158, New York, NY, USA, 2008. ACM.
[42]: Alin Deutsch and Val Tannen. Optimization properties for classes of conjunctive regular path queries. In DBPL '01, pages 21-39, London, UK, 2002. Springer-Verlag.
[43]: Alin Deutsch and Val Tannen. Mars: A system for publishing xml from mixed and redundant storage. In VLDB, 2003.
[44]: Jan Dittberner. Migrate versioning http://code.google.com/p/sqlalchemy-migrate/wiki/MigrateVersioning.
[45]: Nina Edelweiss and Álvaro Freitas Moreira. Temporal and versioning model for schema evolution in object-oriented databases. Data Knowl. Eng., 53(2):99-128, 2005.
[46]: edule. Synchronize sql server databases http://e-dule.com/sql_server/compare_synchronize_databases/Databases_Synchronization_Tool_DB_SynchroComp.asp.
[47]: Embarcadero. Embarcadero change manager. datasheet, 2008 http://www.embarcadero.com/products/changemanager/cmdatasheet.pdf.
[48]: John Chun et al. Db2 and ruby on rails, part 1: Getting started with db2 and ruby on rails http://www.ibm.com/developerworks/db2/library/techarticle/dm-0705chun/. article, 2007.
[49]: Ronald Fagin. Inverting schema mappings. ACM Trans. Database Syst., 32(4), 2007.
[50]: Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang Chiew Tan. Composing schema mappings: Second-order dependencies to the rescue. In PODS, pages 83-94, 2004.
[51]: Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. Quasi-inverses of schema mappings. In PODS '07, pages 123-132, 2007.
[52]: Ronald Fagin and Alan Nash. The structure of inverses in schema mappings. In IBM Research Report RJ10425, 2007.
[53]: Enrico Franconi, Fabio Grandi, and Federica Mandreoli. Schema evolution and versioning: A logical and computational characterisation, 2001.
[54]: Renata de Matos Galante, Clesio Saraiva dos Santos, Nina Edelweiss, and Alvaro Freitas Moreira. Temporal and versioning model for schema evolution in object-oriented databases. Data & Knowledge Engineering, 53(2):99-128, 2005.
[55]: F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-match: an algorithm and an implementation of semantic matching. In Y. Kalfoglou, M. Schorlemmer, A. Sheth, S. Staab, and M. Uschold, editors, Semantic Interoperability and Integration, number 04391 in Dagstuhl Seminar Proceedings, 2005.
[56]: Matteo Golfarelli, Jens Lechtenbörger, Stefano Rizzi, and Gottfried Vossen. Schema versioning in data warehouses. In ER (Workshops), pages 415-428, 2004.
[57]: Jean-Marc Hick and Jean-Luc Hainaut. Database application evolution: a transformational approach. Data Knowl. Eng., 59(3):534-558, 2006.
[58]: IBM. Db2 change management expert overview, 2007 http://publib.boulder.ibm.com/infocenter/mptoolic/v1r0/index.jsp?topic=/com.ibm.db2tools.chx.doc.ug/chxucoview01.htm.
[59]: Idera. Sql change manager. datasheet http://www.idera.com/products/sqlchange/Datasheet.pdf.
[60]: H. V. Jagadish, Inderpal Singh Mumick, and Michael Rabinovich. Scalable versioning in distributed databases with commuting updates. In Conference on Data Engineering, pages 520-531, 1997.
[61]: Y. Kalfoglou and M. Schorlemmer. Ontology mapping: the state of the art. The Knowledge Engineering Review, 18(01):1-31, 2003.
[62]: Yannis Kotidis, Divesh Srivastava, and Yannis Velegrakis. Updates through views: A new hope. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 2, Washington, DC, USA, 2006. IEEE Computer Society.
[63]: T. Lemke and R. Manthey. The schema evolution assistant: Tool description, 1995.
[64]: Ying Liu, Shu ren Zhang, and Mei qi Fang. Ecological analysis on evolution of information systems. In I3E (2), pages 308-315, 2007.
[65]: DBBalance LTD. Cross-database studio. datasheet http://www.dbbalance.com/CDBSStudio6.pdf.
[66]: Miro International Pty Ltd. Sql sync toolkit ver. 2.3 [build 1194] http://www.bizkit.ru/index.php?option=com_content&task=view&id=2&Itemid=51&lang=.
[67]: Red Gate Software Ltd. Sql solutions http://www.red-gate.com/products/SQL_Refactor/index.htm.
[68]: J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid citeseer.ist.psu.edu/madhavan01generic.html, year = 2001, bdsk-url-1 = citeseer.ist.psu.edu/madhavan01generic.html. In The VLDB Journal, pages 49-58.
[69]: Jayant Madhavan and Alon Y. Halevy. Composing mappings among data sources. In VLDB, 2003.
[70]: S. Marche. Measuring the stability of data models. European Journal of Information Systems, 2(1):37-47, 1993.
[71]: Robert Cecil Martin. Agile Software Development: Principles, Patterns, and Practices. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2003.
[72]: L. Edwin McKenzie and Richard T. Snodgrass. Schema evolution and the relational algebra. Information Systems, 15(2):207-232, 1990.
[73]: Sergey Melnik, Erhard Rahm, and Philip A. Bernstein. Rondo: A programming platform for generic model management. In SIGMOD, 2003.
[74]: T. Mens and T. Tourwè. A survey of software refactoring. IEEE Trans. Softw. Eng., 30(2):126-139, 2004.
[75]: MIT. Cl-migration : a port of the database migrations feature of ruby on rails to common lisp http://common-lisp.net/project/cl-migrations/.
[76]: P. Mitra, Natalya Fridman Noy, and A. R. Jaiswal. Omen: A probabilistic ontology mapping tool. Workshop on Meaning Coordination and Negotiation at ISWC-04, 2004.
[77]: Hyun J. Moon, Carlo Curino, Alin Deutsch, Chien-Yi Hou, and Carlo Zaniolo. Managing and querying transaction-time databases under schema evolution. Very Large DataBases (VLDB), 1, 2008.
[78]: Mirella M. Moro, Susan Malaika, and Lipyeow Lim. Preserving XML Queries during Schema Evolution. In WWW, pages 1341-1342, 2007.
[79]: Joel Moss. Cake db migrations v3.2 http://bakery.cakephp.org/articles/view/cake-db-migrations-v2-1.
[80]: B. Mozafari, H. Thakkar, and C. Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In ICDE, 2008.
[81]: Alan Nash, Philip A. Bernstein, and Sergey Melnik. Composition of mappings given by embedded dependencies. In PODS, 2005.
[82]: The Database Group @ University of Toronto. Clio project http://queens.db.toronto.edu/project/clio/index.php.
[83]: open source. Liquibase http://www.liquibase.org/manual/home.
[84]: open source. Metabase : Dbms-independent php api to access sql databases http://freshmeat.net/projects/metabase/.
[85]: open source. Tora : toolkit for oracle http://tora.sourceforge.net/.
[86]: Oracle. Oracle enterprise manager 10g change management pack. datasheet http://www.oracle.com/technology/products/oem/pdf/ds_change_pack.pdf.
[87]: Oracle development team. Oracle database 10g online data reorganization and redefinition http://www.oracle.com/technology/deploy/availability/. Oracle White Paper, 2005.
[88]: G. Papastefanatos, F. Anagnostou, Y. Vassiliou, and P. Vassiliadis. Hecataeus: A what-if analysis tool for database schema evolution. Software Maintenance and Reengineering, 2008. CSMR 2008. 12th European Conference on, pages 326-328, April 2008.
[89]: Young-Gook Ra. Relational schema evolution for program independency. Intelligent Information Technology, pages 273-281, 2005.
[90]: E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334-350, 2001.
[91]: E. Rahm and P. A. Bernstein. An online bibliography on schema evolution. SIGMOD Record, 35(4):30-31, 2006.
[92]: Sudha Ram and G. Shankaranarayanan. Research issues in database schema evolution: the road not taken. In Boston University School of Management, Paper No: 2003-15, 2003.
[93]: LLC Red Hat Middleware. Relational persistence for java and .net http://hibernate.org/.
[94]: Sebastian Riedel. Dbix::migration - seamless db schema up- and downgrades http://search.cpan.org/~daniel/DBIx-Migration-0.05/lib/DBIx/Migration.pm.
[95]: Stefano Rizzi and Matteo Golfarelli. X-time: Schema versioning and cross-version querying in data warehouses. In ICDE, pages 1471-1472, 2007.
[96]: J.F. Roddick. A Survey of Schema Versioning Issues for Database Systems. Information and Software Technology, 37(7):383-393, 1995.
[97]: J.F. Roddick. A Model for Schema Versioning in Temporal Database Systems. In Proc. 19th. ACSC Conf., pages 446-452, 1996.
[98]: John F. Roddick. Dynamically changing schemas within database models. The Austrailian Computer Journal, 23(3):105-109, 1991.
[99]: John F. Roddick. Sql/se - a query language extension for databases supporting schema evolution. SIGMOD Record, 21(3):10-16, 1992.
[100]: Arnon Rosenthal and Len Seligman. Pragmatics and open problems for inter-schema constraint theory. In ICDEW '06: Proceedings of the 22nd International Conference on Data Engineering Workshops, page 1, Washington, DC, USA, 2006. IEEE Computer Society.
[101]: M. R. Scalas, A. Cappeli, and C. De Castro. A model for schema evolution in temporal relational databases. In the European Conference on Computers in Design, Manufacturing and Production (CompEuro), 1993.
[102]: Ben Shneiderman and Glenn Thomas. An architecture for automatic relational database system conversion. ACM Transactions on Database Systems, 7(2):235-257, 1982.
[103]: D. Sjoberg. Quantifying schema evolution. Information and Software Technology, 35(1):35-44, 1993.
[104]: DTM Software. Dtm migration kit http://www.sqledit.com/mk/index.html.
[105]: DTM Software. Dtm schema comparer http://www.sqledit.com/scmp/index.html.
[106]: Quest Software. The all-in-one sql server management and development toolset. datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD_ToadSQLServer40.pdf.
[107]: Quest Software. Automated sql server change management. datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD-ChangeDirector-F-US.pdf.
[108]: S.Y.W. Su and H. Lam. Application program conversion due to database changes. In VLDB, 1976.
[109]: MySQL Development Team. Mysql workbench: A data modeling guide for developers and dbas. White paper, 2008.
[110]: H. Thakkar, B. Mozafari, and C. Zaniolo. A data stream mining system. In ICDM, 2008.
[111]: Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. Designing an inductive data stream management system: the stream mill experience. In SSPS, pages 79-88, 2008.
[112]: Toad. Toad dba suite for oracle. datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD_Toad_DBASuite_US.pdf.
[113]: Adept SQL Tools. Comparing sql schema and table data with adeptsql diff http://www.adeptsql.com/sql_compare_tool_docs/index.htm.
[114]: Yannis Velegrakis, Ren»e J. Miller, and Lucian Popa. Mapping adaptation under evolving schemas. In VLDB, 2003.
[115]: F. Wang and C. Zaniolo. XBiT: An XML-based Bitemporal Data Model. In ER, 2004.
[116]: Fusheng Wang and Carlo Zaniolo. An XML-Based Approach to Publishing and Querying the History of Databases. World Wide Web: Web Information Systems Engineering, 8(3):233-259, 2005.
[117]: Fusheng Wang, Carlo Zaniolo, and Xin Zhou. Temporal xml? sql strikes back! In TIME, pages 47-55, 2005.
[118]: Fusheng Wang, Carlo Zaniolo, and Xin Zhou. Archis: an xml-based approach to transaction-time temporal database systems. VLDB J., accepted for publication, 2009.
[119]: Wikimedia Foundation. Wikipedia, the free encyclopedia http://en.wikipedia.org/, 2008. [Online].
[120]: xSQL Software. What's new: Support for sql server 2008 and more http://www.xsqlsoftware.com/Sql2008Support/Index.aspx.
[121]: dbMaestro Yaniv Yehuda, Joint Managing Director. Database compare and synchronize methods as a part of software release management process http://www.dbmaestro.com/DbMaestro/24/1/525.aspx.
[122]: Cong Yu and Lucian Popa. Semantic adaptation of schema mappings when schemas evolve. In VLDB, 2005.
[123]: Xin Zhou, Fusheng Wang, and Carlo Zaniolo. Efficient temporal coalescing query support in relational database systems. In DEXA, 2006.

Footnotes:

¹This is found in a CVS commit at: http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=8792.

²Available on-line: http://www.liquibase.org/

³Wikipedia is solidly one of the top ten most popular websites on the web according to Alexa.

⁴http://www.mediawiki.org

⁵See http://s23.org/wikistats/.

⁶Please note that each evolution step might contain more than one type of change.

⁷Both from system libraries and user defined.

⁸The main noticeable exception is the set of information supporting the user rights management, which has been strongly reduced in the DB after version v9335 (65^th version), as it was moved to the application layer.

⁹The templates are extracted ignoring constants and retaining only the query structure.

¹⁰Available on-line at http://noc.wikimedia.org/cgi-bin/report.py.

¹¹The cleaning process removes syntactical errors produced by the template extraction of the Wikipedia profiler.

¹²In order to generate as many as possible types of queries, we tried to trigger all features accessible from the web browser.

¹³The oldest version compatible with the environment of our experimental setting.

¹⁴By database objects, we refer to index, view, stored procedures, and triggers in this proposal, unless otherwise mentioned.

File translated from T_EX by T_TH, version 3.81.
On 03 Jul 2009, 15:12.

Information Systems Under Schema Evolution: Analyzing Change Histories and Management Tools

Abstract