UNIVERSITY OF CALIFORNIA
Information Systems Under Schema
Evolution: Analyzing
Change Histories and Management Tools
PI: Carlo Zaniolo
Computer Science Department
University of California
Los Angeles
zaniolo@cs.ucla.edu
Work supported by NSF 0917333,
Disclaimer:
Statements in this report represent author's opinions and not those of NSF
Abstract
The significant progress made by database research on schema mapping (e.g.,
composition, invertibility), data exchange, and query rewriting, can provide
breakthrough solutions for the Database Schema Evolution problem. But as of
today, information systems are sorely lacking the methods and tools needed to
cope with the problem, and to reduce the cost of data migration, rework of queries,
application rewriting, and downtime created by schema changes. In fact, this
old problem has been made worse by the success of scientific databases (e.g.,
Ensembl) and web information systems (e.g., Wikipedia)---where the fast evolution
of applications and requirements characterizing the web and the scientific discovery
process is exacerbated by the number and diversity of users and organizations
cooperating on these endeavors.. Fortunately, the openness of these public-domain
information systems (vs. corporate ones), and the abundance of their interesting
evolution histories make it possible to built a comprehensive testbed to determine
the strengths, limitations, and potentials of candidate methods and tools proposed
for the problem. Thus, this project is building (i) an open-source curated repository
containing evolution histories from key information systems, (ii) benchmarks
for a comprehensive set of tools tested therein, and (iii) instruments to collect
and analyze evolution histories. These are then used to (a) compare and evaluate
existing approaches, methods and tools, and (b) entice researchers to evaluate
and improve their techniques and add their test cases to the benchmark. A transformative
impact can be expected upon schema mapping research and applications, inasmuch
as theoretical solutions are now validated and improved on real-life case-studies.
These in turn are expected to transform and improve significantly scientific
databases and web information systems.
Project Summary Since organizations and companies are
adapting continuously to a fast changing reality, their Information Systems
have to rapidly evolve to satisfy changing requirements. This results in frequent
modifications of the database schema, and in the consequent rework of queries
and applications. In fact, in scientific databases and web information systems
changes are occurring at even more dramatic pace than in traditional databases.
This problem, known in literature as Schema Evolution or Versioning, has long
been recognized by practitioners and researchers as challenging and pressing,
both from theoretical and engineering standpoints. The solutions currently available
in the commercial world only address the most basic requirements for schema
evolution support, leaving practitioners struggling with the error-prone and
labor-intensive activities of: (i) unassisted schema redesign, (ii) data migration,
and (iii) manual adaptation of queries, updates, and applications. Even more
troublesome problems are caused by a lack of tools to document the large portion
of the schema generated by the evolution rather than the original design. On
the other hand, researchers have produced many useful techniques that address
the issues of mapping (i.e. discovery, composition, invertibility), data exchange,
and query rewriting. These techniques are applicable to schema evolution, but
their practical impact has been limited because the research community have
not been able to refine and validate their solutions on large real-life schema
evolution scenarios. This project will change that by building an extensive
testbed of schema evolution test cases and providing effective tools to document
the schema segments generated by evolution rather than the original design.
Intellectual Merit and Need for NSF Support
The goal of our project, in accordance with the very mission of the IIS program
of NSF, is to provide to the research and practitioners communities: (i) a curated
open-source repository containing evolution histories from real-life information
systems, (ii) tools to automate the collection, analysis, and documentation
of schema evolution histories, and (iii) a benchmark to assess the effectiveness
of schema evolution support tools. The PI, his students, and collaborators are
uniquely qualified to carry out this project successfully because of their previous
work on database design and on supporting schema evolution for snapshot and
temporal databases.
Educational Merit Project funds will make
possible the training of PhD students who will be working on improving and extending
the data collection and benchmark we are designing. Undergraduate interns will
also be able to learn from real-life evolution histories. The tool-suite for
schema history analysis we are going to develop will serve as an invaluable
educational tool for students, researchers and practitioners. We will be working
with the UCLA Center for Excellence in Engineering and Diversity to recruit
women and minorities as interns.
Broader Impact The gap between research and
practice has often proved to be the innovation bottleneck of our society. The
ambitious endeavor we are proposing aims at reducing this gap, thus maximizing
the benefits of research to society. Furthermore, these funds will support the
launching of an open-source community that will manage and extend the testbed
and system workbench developed by the project. This proposal describes a timely
and innovative approach to achieve these goals. A broad range of scientific,
educational, and economic activities will benefit from these advances.
Keywords: schema evolution, testbed, benchmark, data
collection, scientific databases
B Table of Contents
Contents
A Project Summary*-0.2cm
B Table of Contents
C Project Description
C.1 Results of Previous NSF-Funded Research
C.2 Introduction
C.3 Real-Life Testbed
C.3.1 A Preliminary Study
C.3.2 Complete Dataset
C.4 Software Infrastructure for evolutionary database design
C.4.1 Dataset Collection Suite
C.4.2 Analysis Suite
C.4.3 Mapping Mining
C.4.4 Historical Metadata Manager
C.5 A Benchmark for systems supporting schema evolution
C.5.1 Benchmark Design
C.5.2 Surveying Existing Tools
C.6 Work Plan and Deliverables
C.7 Outreach and dissemination activities
C Project Description
C.1 Results of Previous
NSF-Funded Research
Carlo Zaniolo is the PI of the NSF grant IIS-0339259 "SGER: Collaborative
Research: Support for Design of Evolving Information Systems"(Grant amount: $50,000;
Grant period: 09/2003-08/2004). This exploratory project led to the NSF grant:
IIS-0705345 "III-COR: Collaborative Research: Graceful Evolution and Historical
Queries in Information Systems a Unified Approach" (Grant amount: $204,749 Grant
period: 09/2007-08/2010). These two projects focus on developing the technology
that allow information systems to archive and query history of databases that
have evolved in both schema and content. Significant progress has been made here
too, including (i) techniques and systems that enable us to publish and query
the history of databases using XML and XQuery [115,116,117,123,118],
(ii) methods and tools that support graceful schema evolution [33,34,32],
and (iii) transaction-time temporal databases that support historical queries
on database history under schema evolution [77,35].
These projects have given us in-depth experiences and implementation platforms
that provide us with an indispensable and unique expertise on the problem of evolution
itself.
Carlo Zaniolo is the PI in the NSF grant IIS-0742267 "SGER: Efficient
Support for Mining Queries in Data Stream Management Systems" (Grant amount: $200,000;
Grant period: 09/2007-08/2009). The first objective of this project is to tame
the computational complexity of mining queries, by developing data stream mining
algorithms that are fast and light enough for online mining of data streams. This
line of research has produced novel on-line techniques to support efficient mining
of frequent patterns [80,111].
The second objective is to extend current DSMS and their SQL-based query languages
with constructs and optimization techniques needed to design and develop a data
stream mining workbench-the first of its kind. As discussed in [16,17,110],
much progress has already been made, and we are now approaching completion of
our first system prototype.
C.2 Introduction
Nowadays organizations and companies are forced to continuously
adapt to a fast-changing business reality. As a consequence, the corresponding
Information Systems are also required to rapidly evolve to meet new requirements
[64]. At the data management level
this breaks the traditional notion of the database schema as an immutable object,
that is carefully designed in advance with the expectation it will last a long
time. On the contrary, the database schema becomes a constantly changing entity,
thus, inducing the continuous re-working of queries an applications. This problem,
known as schema evolution or schema versioning in the literature,
has long been recognized by practitioners and researchers as a challenging and
pressing one, from theoretical, engineering and economical perspectives [108,102,5,72,15,98,99,101,103,70,96,97,28,53,92,114,74,45,56,54,26,89,29,57,91].
The described phenomenon becomes more pressing in the context
of web Information Systems, due to the collaborative nature of their development
and usage. Scientific projects, such as the Large Hadron Collyder of
Geneve CERN [2], or the Ensembl genetic database
[1], face even more intense evolution
requirements, in order to match the mutating needs of scientists from several
collaborating research centers. In our preliminary investigations [33]
we contacted various research centers and sampled their evolution histories, discovering
incredibly intense scenarios. For instance, Wikipedia [119]
evolved over 170 schema versions in 4.5 years, and the Ensembl DB has seen over
410 schema versions in about nine years of lifetime.
Evolving an information system, and its data management core in
particular, is a labor-intensive and risky activity that can take 40% to 75% of
the overall development costs [14],
and thus appropriate tool support is strongly desired. The study [33]
of the schema evolution of the backend of the free encyclopedia Wikipedia confirms
our initial hunch on the costs of such evolution. In the Wikipedia schema evolution
history, each evolution step has caused up to 70% of the tested queries to fail,
requiring manual query adaptation [33].
Traditionally the design of the database schema was relegated
to an initial phase, as a consequence of a careful analysis of the system requirements.
However, the fast dynamics of today's business reality call for a more continuous
design activity. The schema is, in fact, evolved throughout the lifetime of a
system, which brings up another critical issue, i.e., documentation. Operating
to cope with a tight schedule, the database administrators often fail to maintain
the documentation up-to-date with the modified schema. This is vividly illustrated
in an actual comment1 of
a senior Wikipedia DB Administrator: "Move schema documentation to inline
comments in tables.sql, where it's less likely to be left years out of date."
The lack of good documentation also makes the key task of determining data and
metadata provenance extremely difficult, as we will discuss in Section C.4.
Existing Commercial Tools A preliminary analysis
of the state-of-the-art systems supporting evolution, both from open source [109,83,84,3,85,94,44,4,79,93]
and commercial [48,58,86,59,47,67,25,112,10,105,104,9,8,7,27,30,113,106,107,120,46,121,65,31,66]
worlds, calls for immediate actions. The solutions available on the market barely
address the most basic requirements to support schema evolution, leaving practitioners
struggling in the error prone and labour intensive activities of: (i) unassisted
schema re-design, (ii) data migration, (iii) manual adaptation of application
queries and updates, (iv) documentation of the evolution itself, and (v) provenance
of data and metadata.
In particular, the big players in the world of commercial DBMSs
have been mainly focusing on reducing the downtime when the schema is updated
[87] and on assistive
design tools [40,109].
Other tools of interest are [63] and
LiquiBase2. All of the
mentioned approaches focus on basic feature, that, although needed, provide only
little help with critical tasks, such as application adaptation and documentation,
and thus fail to reduce the high maintenance costs.
Research Endeavors The limitations of commercial
systems are hardly surprising given the intrinsic complexity of the problems,
which involves theoretical issues such as mapping discovery [82],
mapping composition [69,50,81,23],
mapping invertibility [51,52,49],
query rewriting [42,43,41],
and view-update [18,39,62].
This very practical problem is, in fact, fueling some of the most exciting theoretical
endeavors undertaken by the database research community. On the other hand, the
research community that has the strong theoretical background needed to produce
theoretically sound solutions lacks the large-scale real-life case studies needed
to guide and validate their work for practical usefulness.
Many interesting results have indeed been produced on the general
problem of schema evolution. Without trying to be exhaustive, we can mention the
impact-minimizing methodology of [89],
the unified methodology for application and database evolution of [57],
the application-code generation of [29],
the framework for metadata model management of [73],
the visual tool of [88],
and the approaches presented in [21,24,114,122,75].
Further related works include the results on mapping information
preservation by Barbosa et al. [19],
the ontology-based repository of [26],
the schema versioning approaches of [60].
XML schema evolution has been addressed in [78]
by means of a guideline-driven approach. Object-oriented schema evolution has
been investigated in [54].
In the context of data warehouse X-TIME represents an interesting step toward
schema versioning by means of the notion of augmenting schema [56,95].
In [34,32]
we have proposed a first practical solution to snapshot database evolution, by
harnessing the theoretical advances on schema evolution modeling [23,114],
mapping composition [69,50,81,23],
mapping invertibility [51,52,49],
and query rewriting [42,43,41].
Similarly in [77,35]
we have proposed a transaction-time DB that supports schema evolution. In [38],
we discuss the vision of framework proposing schema evolution and data integration.
The gap between academia and practitioners
Advances in computer science require a continuous feedback process on the quality
of the obtained results. This is particularly difficult in the area of database,
where, due to privacy and business concerns, it is often difficult for the researchers
to test their innovative solutions on real-life data. We can thus say that, due
to the impedance mismatch between the academic and practitioners' communities
[100], researchers
have limited visibility on the actual needs of the industrial world, and limited
opportunity to validate their solutions on real-life cases studies. In fact, for
the problem of schema evolution we are considering, the intense and fruitful research
effort mentioned above is not finding an easy in-road in commercial systems. The
present project aims at this bottleneck, by providing data and tools that will
help (i) to expose the actual needs of existing information systems to the research
community, (ii) deliver recent research advances into innovative industrial solutions,
and (iii) automate documentation and provenance for databases under schema evolution.
The proposed solution By conducting exploratory
studies on real-life Web Information Systems [33],
and scientific database, we have experienced the benefit of having first-hand
information on the need for schema evolution, and had the opportunity to test
and validate our systems [34,32,77,35]
on real evolution histories. This project will extend this opportunity and its
benefits to the entire research community, and enhance its potential benefits
by:
- collecting a large-scale dataset of real-life schema evolution
histories and releasing both the dataset and its thorough analysis to the
community;
- developing a tool-suite capable of automating collection,
analysis, documentation and provenance of schema evolution histories.
- designing a benchmark for systems supporting schema evolution
and assessing the current state-of-the-art system support.
We plan to conduct a large-scale campaign to collect evolution
histories from Web Information Systems, scientific projects, government organizations,
open source systems, and we have already made preliminary inquiries with some
of these organizations. The enthusiastic response we have received from them confirm
the feasibility and also the desirability of our data collection plans.
The collected data will be released to public via an open, curated
repository. This repository will represent, for researchers and developers, an
invaluable resource for validating innovative technologies and for obtaining better
insight on the challenges posed by Information Systems evolution. Besides storing
the evolution histories, the repository will also allow temporal querying of the
metadata histories, by building and extending the History Metadata Manager presented
in [35]. Each evolution
history will be analyzed, providing several statistics, that capture the essence
of the underlying evolution process. Furthermore, automated tools will derive,
from the SQL migration scripts, a high level specification of the evolution based
on the notion of Schema Evolution Operators (SMO) such as those used in [34].
This will provide the much needed documentation of the evolution process itself
and cure the problem of degrading documentation quality that is now associated
with evolution. The SMO-based representation of the evolution will also enable
precise and automated data and metadata provenance analysis, crucial for scientific
databases.
In order to collect and analyze large datasets, and assure the
scalability of this process, we will develop automatic collection and analysis
tools for evolution histories, including improvements and extensions of the very
primitive ones describe in [33].
These will automate the download, versioning, and analysis of schema evolution
histories, and support the quasi-automated collection of query workloads [33].
The tool-suite will be released to public, greatly facilitating the extraction
of statistics that provide a comprehensive view over long and complex evolution
histories. This will provide database administrators and researchers with invaluable
knowledge. The availability of automatic tools and of a large repository will
be a key component for the formation of community effort, where researchers and
practitioners will be enable to access and contribute to this unique repository
of case studies, thus, creating a reference point for both communities.
A second key objective of this project is to design and implement
a benchmark for schema evolution tools based on this testbed. This will enable
researchers to validate and compare their approaches and tools. On contrary to
previous approaches based on synthetic data generations [12,11],
our benchmark will be based on a mix of the most crucial and most common evolution
steps from the actual evolution histories in our dataset. Moreover, we will exploit
actual data and actual workloads in a series of experiments to test the features
of systems supporting schema evolution. By measuring and stress-testing the state-of-the-art
tools from commercial, open source, and academic worlds, we will contribute at
improving the current systems, technology, and methods for schema evolution support.
The benchmark will also released to the public to provide an invaluable tool to
validate and compare novel solutions against real-world schema evolution scenarios.
This will promote: (i) better targets for the research endeavors tackling real-world
pressing issues, (ii) validation of the research approaches, and (iii) comparison
and assessment of the various techniques against a reference testbed.
Impact Beyond Schema Evolution The usefulness
of the dataset we will collect goes well beyond the schema evolution problem.
In fact, the schemas, data, and query workloads we will provide will benefit researchers
working on query optimization, indexing, partitioning, privacy, and many others.
Our datasets will allow them to investigate the effectiveness of their approaches,
compared with state of the art competitors all measured on a neutral ground. This
will contribute to bridging the gap between academic and industrial worlds, facilitating
the industrial adoption of aqdvanced research solutions.
C.3 Real-Life Testbed
In this section, we discuss the dataset we are planning to collect.
First we summarize a preliminary study we conducted on Wikipedia, and then explain
how this will be extended to hundreds of other schema evolution histories, in
order to build the first real-life testbed for schema evolution. The size and
nature of this endeavor calls for automated tools, which are discussed in Section C.4.
C.3.1 A Preliminary
Study
In recent years, web Information Systems have witnessed ever-increasing
popularity. Among others, Wikipedia, a multilingual, web-based, free-content encyclopedia,
has seen a great success3,
mainly due to its revolutionary concept of collaborative authoring. Wikipedia
is built on a software platform called MediaWiki4,
which powers over 30,000 wiki websites5.
MediaWiki is a open-source, data-intensive, and collaborative web-portal software.
The system architecture of MediaWiki relies on a relation DBMS backend, i.e. MySQL,
PostgreSQL, or Oracle, to store and manage massive amounts of data (i.e., several
hundreds of Gigabytes in the case of Wikipedia). Web pages are dynamically generated
by querying the underlying database and rendering its content in HTML. In the
following we analyze the schema evolution of MediaWiki, based on its 171 schema
versions between April 2003 (first schema revision) and November 2007 (date of
this analysis). The complete analysis is reported in [33].
Basic Statistics In Figure 1,
we report the size of MediaWiki DB schema in history, in terms of the number of
tables and columns. The graphs show an evident trend of growth in sizes, where
the number of tables has increased from 17 to 34 (100% increase) and the number
of columns from 100 to 242 (142%).
Figure
Figure 1: MediaWiki Schema Size: Number of Tables and Columns
Figure
Figure 2: Histograms of Tables and Columns Lifetimes
Figure 2 shows a histogram representation
of the table and column lifetimes, in terms of number of versions. The lifetimes
range from very long ones, e.g., the user table that was alive throughout
the entire history, to short ones, e.g., random table that only survived
for two revisions. On average, each table lasted 103.3 versions (60.4% of the
total DB history). Similarly column lasted 97.17 versions on average (56.8% of
the total DB history).
Table 1: Macro-Classification of Schema Changes (One evolution
step may have more than one change type)
Macro-Classification of Changes We group the
170 evolution steps based on the types of evolution they present as in Table 1.
While the "actual schema changes" have an impact on the queries, as they modify
the schema layout, the evolution of the DBMS engine, indexes, and data types,
(while being relevant to performance) does not require any query correction, because
of the physical data-independence provided by the DBMS. Table 1
shows the frequencies6
of the types of changes among the 170 evolution steps.
Schema Modification Operators To better understand the
relational DB schema evolution, we introduce a classification of the "actual schema
changes". Different formalisms can be exploited for this purpose. Shneiderman
and Thomas proposed in [102]
a comprehensive set of schema changes, including structural schema changes and
also changes regarding the keys and dependencies. More recently, Bernstein et
al. have also proposed a set of schema evolution primitives using algebra-based
constraints as their primitives [22].
Among several options, we chose the Schema Modification Operators
(SMOs) that we proposed in [34,77].
SMOs capture the essence of the existing works, but can also express schema changes
not modeled by previous approaches. For example, by using function7
in the ADD COLUMN operator, SMOs can support semantic conversion of columns
(e.g., currency exchange), column concatenation/split (e.g., different address
formats), and other similar changes that have been heavily exploited in modeling
MediaWiki schema changes. The effectiveness of SMOs have been validated in [34,77],
where the PRISM and PRIMA systems used SMOs to describe schema evolution in transaction-time
databases and to support historical query reformulations over multi-schema-version
transaction-time databases.
The syntax of SMO is similar to that of SQL DDL, and provides
a concise way to describe typical modifications of a database schema and the corresponding
data migration. Every SMO takes as input a schema and produces as output a new
version of the same schema. Note that SMOs can be arbitrarily combined to describe
complex structural changes, as those occured in the MediaWiki DB schema evolution.
The SMO-based representation of the evolution will be automatically extracted
by means of tools described in Section C.4.
Figure
Figure 3: SMOs distribution
Micro-Classification of Changes Using SMOs In this context
we exploit SMOs as a pure classification instrument to provide a fine-grained
analysis of the types of change the schema has been subject to. While there might
be several ways to describe a schema evolution step by means of SMOs, we carefully
select, analyzing the available documentation, the most natural set of SMOs describing
each schema change in the MediaWiki history. Figure 3
shows the distribution of the SMOs, presenting, for each type, how many times
it has been used in the entire schema evolution history, and a graphical representation
of the sum of SMOs pre schema version. Is interesting to notice that the more
sophisticated SMOs (e.g., MERGE TABLE) while being indispensable are
not very common. The balance between column/table additions and deletions highlights
the "content preserving" attitude of Wikipedia8.
Impact on Applications In order to study the effect of
schema evolution on the frontend application, we analyze the impact of the schema
changes on six representative sets of queries. Each experiment tests the success
or failure of a set of queries, originally designed to run on a specific schema
version, when issued against other schema versions.
To simulate a case where current applications are run on databases
under older schema versions, we test three sets of queries, valid on the last
schema version, on all the previous schema versions (Figure 4).
The figure Also, to study how legacy applications succeed or fail on newer versions
of the database schema, we test three sets of legacy queries on all the subsequent
schema versions.
Figure
Figure 4: a) Average query success rate against following
schema versions (the queries are designed for the 28th version,
and run against all the following versions); b) real Wikipedia profiler queries
executed against previous schema versions.
The six sets considered in our experiments are as follows:
Real-world templates, current (Figure 4.a):
the 500 most common query templates (extracted9
from over 780 millions of query instances), derived from the Wikipedia on-line
profiler10 and post-processed
for cleaning11.
Lab-gen queries, current (Figure 4.a):
2496 query instances generated by a local installation of the current version
of MediaWiki (release 1.11, schema version 171), interacting with the frontend12
and logging the queries issued against the underlying MySQL DBMS.
Lab-gen templates, current (Figure 4.a):
148 templates of queries extracted from the above queries.
Lab-gen queries, legacy (Figure 4.b):
4175 query instances generated by a local installation of an old version of MediaWiki
(release 1.313, schema
version 28), interacting with the frontend and logging the queries issued against
the underlying MySQL DBMS.
Lab-gen templates, legacy (Figure 4.b):
74 templates extracted from the above lab-gen queries, legacy.
Synthetic probe queries, legacy (Figure 4.b):
133 synthetic queries accessing single columns (i.e., select tabj.atti
from tabj) of schema version 28, designed to highlight the affected
schema portion.
Each set has been tested against all schema versions: the resulting
query execution success rates are shown in Figure 4.a
and Figure 4.b . The outliers in the graphs
(sudden and extremely low values) are due to syntactically incorrect DB schema
versions. Due to lack of space we omit further discussion of the results, more
comments are available at [33],
but we simply notice that these experiments provide a clear evidence of the strong
impact of schema changes on applications, and sustain the claim for better schema
evolution support.
C.3.2 Complete Dataset
We plan to extend the dataset by collecting and analyzing the
schema change histories of many other information systems. Currently, we consider
over a hundred candidate information systems from various contexts: web and open
source platforms, scientific projects, governmental and administrative. Web IS
include wiki softwares, like MediaWiki or TikiWiki, content management systems
(CMS) like Joomla, Drupal, XOOPS, and TYPO3, and many others, often released under
open source license. The typical schema evolution ranges from tens to hundreds
of schema versions. Scientific IS also demonstrate very dynamic cases of schema
evolution. For instance, the Ensembl project, funded by the European Biology Institute
and the Welcome Trust Sanger Institute, provides a data-centric platform used
to support the homonymous human genome database, and other 15 genetic research
endeavors. The Ensembl DB witnessed an intense schema evolution history. In about
9 years of life-time, more than 410 schema versions appeared to the public, which
equates to almost one version a week. Also, we have contacted the European Organization
for Nuclear Research (CERN) and obtained access to their physics databases, which
often go through extensive schema changes. For instance, one of CERN's project
called CASTOR (CERN Advanced STORage manager) that hosts 15 petabyte of data reports
149 schema versions during the last three years (which averages to about one version
per week). We plan to further extend the pool of information systems to be included
especially in the area of Scientific DB. We are, thus, actively opening collaboration
with multiple research centers, including National Center for Biotechnology Information
(NCBI), hosting several medical and biological databases [6].
We are also interested in schema evolution in administrative IS,
such as government and university databases. We have contacted the former IT director
of several Information System for Italian government and obtained preliminary
access to some of their schema evolution histories. We also contacted administrators
of UCLA and Politecnico di Milano to obtain access to some administrative schema
evolution histories. While for scientific and open source projects, the public
nature of their systems eliminate privacy issues, for administrative databases,
this pose a major challenge in our endeavor. We are thus managing this on a case-by-case
basis guaranteeing privacy and anonymity where required.
We already have contacted and obtained access to almost 100 databases
from open source projects, about 50 scientific databases, and about 20 administrative
governmental databases, for a grand-total of almost 4,000 schema versions. In
most cases, we have identified the severity of schema changes ranging from tens
to hundreds of schema versions, and plan to explore them in a great detail as
in Wikipedia case. This large corpus of case studies will provide a unique perspective
on the problem of schema evolution itself, and allow us to develop and validate
technologies against the actual needs of evolution faced by real-life information
systems.
C.4 Software Infrastructure
for evolutionary database design
As anticipated in the introduction,
the evolution is typically coped with, under tight development schedules. Non-functional
aspects, such as documentation and analysis, are, thus, the first to suffer. Documentation
tends to degrade over time, and, as a consequence, the understanding of the schema
itself becomes more and more difficult. This is testified by various examples
of schema modifications applied and rollbacked more than once in the Wikipedia
schema evolution history [33]. This
calls for tools that can provide: (i) a synthetic view over the entire evolution
history, (ii) capture evolution in a more conceptual way, thus providing automated
documentation, and (iii) support data and metadata provenance tasks.
We, thus, plan to develop appropriate tools for automating the
data collection and analysis process. Furthermore, by automatically analyzing
the migration scripts we will derive the mapping between subsequent schema versions,
providing and added value to the bare schema histories. This will also enable
users to query the data collection, to freely inspect the history of metadata
and derive precious provenance information. By allowing unfiltered temporal queries
over annotated metadata histories, we also enable unpredicted usages our dataset,
and guarantee extensibility of our platform. Hence, we propose a software infrastructure
that supports collection, analysis and querying of large-scale schema evolution
histories.
The proposed set of tools will enable the development of evolutionary
database design methodologies [13],
similarly to what has been proposed in software engineering under the name of
"agile methodologies" [71],
with the popular case of "extreme programming" [20].
C.4.1 Dataset Collection
Suite
In order to automate the schema history collection process, we
leverage the systematic recording of the source code evolution, typical of large
projects, such as biological and physics databases, Web Information Systems, or
large open source projects. Based on our preliminary studies we know, in fact,
that is rather common for large organizations to maintain the history of their
database schema and application source code by versioning it in CVS or SVN repositories
the SQL scripts that creates the schema, and "patch" SQL scripts to migrate data
from the old to the current schema version. These repositories, storing textual
source-code information, serves the bare minimum purposes of tracking software
evolution, but lack to provide high level view of the evolution process. Furthermore,
any data provenance task is extremely taxing, and due to limited and degrading
documentation, also the simple understanding of the current schema might become
challenging.
What is needed is a set of automated tools, capable of extracting
from such repositories the hidden knowledge about metadata evolution. We will
develop such tools, relying on the experience we obtain in designing [33]
and [35]. In particular,
the schema evolution histories, which are currently shredded in a textual format
in the CVS or SVN repositories, will be automatically collected and stored into
the Historical Metadata DB (HMDB).
The HMDB captures the history of the SQL standard information_schema,
together with various additional metadata (SMOs, user queries, etc.). Due to space
limitation, we discuss only a selected subset of the HMDB content. We
introduce it in a snapshot format, which is a simple extension of the SQL standard
information_schema, to later comment on the corresponding temporal archive.
schemata(schema_name, default_character_set_name)
tables(table_schema, table_name, table_type, table_rows)
columns(table_schema, table_name, column_name, data_type, default_value)
queries(query_id, query_text, schema_name, issue_timestamp)
query_exec(query_id, exec_timestamp, exec_success)
smos(smo_id, smo_text, smo_type, timestamp)
The first three relations, schemata, tables,
and columns, come from the information_schema, while the relations
named queries, query_exec, and smos store the mapping
between subsequent schema versions in terms of Schema Modification Operators (Section C.4.3
will discuss how SMOs can be extracted from schema histories). The queries
relation archives the user queries (as templates), the target schema, and query
issuing time. The table query_exec stores the results (boolean flag exec_success)
of testing each of above query templates against the subsequent schema versions.
Lastly, the table smos stores the SMOs describing the semantics of each
schema change.
In order to effectively and efficiently archive the history of
these metadata, we leverage and XML-based format we developed and implemented
in [115,116,117,123],
named V-document.
By building on the advanced features of the temporal transaction
time dbms ArchIS [118]
and our system PRIMA [77],
we will provide access to metadata histories via an effective XQuery querying
interface, discussed in Section C.4.4.
In the remainder of this section, we discuss how the collected
data can be manipulated, enriched, and enriched to provide a complete framework
for evolution support and documentation.
C.4.2 Analysis Suite
Due to the large size of schemas, and schema histories, a raw
collection of schema versions is of very little use for researchers planning to
test their techniques, or DB Administrators managing a database. For this reason,
we are planning to develop a complete analysis suite. The HMDB representation
of the evolution history allows us to automate the collection rich statistics
about the evolution itself, similarly to what we did for the Wikipedia case study.
Some of the statistics we are considering include simple measures such as: schema
growth (tables and columns), table and column life-time, key evolution, overall
editing activity over time. Furthermore, we plan to create and automate richer
statistics combining several factors to determine the overall level of evolution
activity in different portions of a DB schema. For example by studying the amount
of change (to the set of columns, datatypes, keys, indexes, etc.) of a table we
can determine how (un)stable it is, and observing the percentage of queries in
the workload involving the table we can capture how critical the table is.
In the case of Wikipedia, this would show how the core article
management tables are critical and unstable, thus warning the DB Administrator
and focusing its attention, to proactively invest in a better design of this portion.
This will provide an invaluable tool both for the DBA and for researchers both
to design the evolution and to test innovative technologies.
The tool-suite we are designing contains a rich set of built-in
statistics, and will allow the users to specify further statistics combining simple
SQL queries.
C.4.3 Mapping Mining
One of the key tasks to achieve
automated documentation and provenance is to capture the relationship between
subsequent schema versions. This task is similar to the problem of schema matching
and mapping [82,68,55,90,76,37,61],
well known in the data integration and data exchange worlds. While schema matching
approaches tackle the more general problem of mapping unrelated schemas, in our
scenario extra information is available in terms of SQL data migration scripts.
These scripts capture the relationships between schema versions by means of an
operational specification. By leveraging this extra knowledge, we plan to design
a set of heuristics capable of analyzing SQL migration scripts and derive an SMO-based
representation of the evolution. In our preliminary tests this approach is very
promising, and by exploiting the specific characteristics of our scenario, we
expect to obtain very high levels of precision and recall.
We will take full advantage of the large body of work we developed
in [34,77,35,38]
to derive from the SMO specification of the evolution a logical mapping relating
subsequent schema versions. This will provide automated documentation, and data
provenance (by means of query rewriting). This will also enable to exploit the
schema evolution support functionalities of the PRISM system we developed in [34],
thus providing a very complete suite for schema evolution.
C.4.4 Historical Metadata
Manager
The Historical Metadata Manager (HMM)
we designed in [36] is a tool capable
of storing and querying metadata histories, and once fully implemented will represent
a perfect interface to pose temporal queries over the schema histories. The metadata
histories are capture in the Historical Metadata DB (HMDB), discussed
above, and represented in an XML format.
This allows us to exploit XQuery as a powerful temporal query
language. Due to space limitations we present the effectiveness of the proposed
tools by means of few simple examples, the interested reader can find further
details in [36].
The queries we present are grouped in two classes: (i) queries
over the history of the information_schema (Query 1-Query
3), and (ii) queries exploiting SMOs and logged queries (Query 4
and Query 5).
Simple but very useful temporal queries are snapshot queries as
the following:
Query 1. What was the Wikipedia database schema valid on 2007-01-01?
for $db in document("wiki-hmdb.xml")/db,
$tn in $db/Tables/row/table_name[@ts<="2007-01-01" and @te>"2007-01-01"]
return <table> {$tn/@text}, <cols>
{$db/Columns/row[table_name=$tn]/
column_name[@ts<="2007-01-01" and @te>"2007-01-01"]/text()}</cols>
</table>
Similarly, it is possible (by range queries) to inspect specific
portions of the history, retrieving all the subsequent modifications occured in
a given period. Next, we give an example of temporal aggregate queries that can
be exploited to observe general trends of the schema evolution (the corresponding
XQuery for this and the following queries is omitted due to space limitations,
more details can be found in [36]):
Query 2. Retrieve the number of columns in the Wikipedia DB
throughout the history.
The output of such query can be nicely rendered as we showed in
Figure 1, where it is easy to notice a net growing
trend in the Wikipedia schema size of roughly 31% a year.
Furthermore, to analyze quality and stability of the design it
is possible to pose queries retrieving stable or unstable portions of the schema
as the following:
Query 3. Which table in Wikipedia DB schema that remained
valid for the longest period?
The answer of this query reveals how the user table was
the most stable in the DB schema. Another interesting class of queries exploits
the explicit characterization of the change semantics provided by SMOs to track
back (or forward) the evolution of information stored in a given schema element.
Consider the following example:
Query 4. Retrieve the previous version of the information
currently stored in table `page'.
Finally, by exploiting the information about query templates and
their execution, it is possible to retrieve and visualize the impact of the schema
evolution on the original user queries, as exemplified by the following query,
whose results has been shown in Figure 4:
Query 5. What's the success rate of legacy queries (valid
on 2004-08-15) after each Wikipedia schema change?
These relatively simple queries will be naturally supported by
the HMM on the history of metadata we archive. This querying capability
will provide a powerful tool for dissecting the metadata history, enabling a better
understanding of undergoing schema evolution.
C.5 A Benchmark for systems
supporting schema evolution
As we have discussed in the previous section, information systems
are subject to intense schema evolution. A lot of effort has been devoted to develop
theories and tools to mitigate this problem. Various approaches present different
levels of support and focus on different aspects. We believe that it will be very
beneficial to have a standardized benchmark that will allow for a fair comparison
between them and also encourage further development of such tools.
In fact, we plan to leverage the extensive data of schema evolution
history from various information systems for building a benchmark based on real-world
case studies. By selecting a representative subset of evolution steps, we will
be able to highlight the key issues that a tool, targeting schema evolution, must
provide. The large size of the evolution histories we collected can also be used
to test scalability.
C.5.1 Benchmark Design
We first discuss the main features a schema evolution support
system (SESS) should provide and then briefly present how this can be experimentally
measured.
Schema Evolution Support Features We divide the set of
desirable features into four main categories: documentation, predictability,
transformation, and reversibility.
Documentation refers to a set of features that systematically records
schema versions in history including snapshot data, queries, and other database
objects14. This set of
features provide the minimum level of support, guaranteeing traceability of the
evolution. This set of features is offered at a certain extent by open-source
and commercial tools.
Predictability: during the design of a change in the schema, it is of
key importance being able to predict the impact on schema, data and queries of
the given evolution step. We group in this category the set of features of a SESS,
that provide what-of scenario support for the various components of a database
(e.g., schema, data, queries, updates, integrity constraints, etc.). This set
of features is only partially supported by commercial and open-source tools.
Transformation: the actual modification of the schema components is also
very important. In this category we group features like: support of the data migration
process, automatic query adaptation, automatic integrity constraints adaptation,
index adjustment. Some of the features in this class are very advanced and only
available in academic prototypes [34,82].
Reversibility: the last class considered includes those features enabling
a seamless rollback of the evolution process. These features are partially achieved
at a basic level by some commercial tools, and supported by some academic of the
academic tools we considered.
Quantitative Measurement for the Features Some of the above
mentioned features can only be present or absent from a given system, however,
others can be measured. For example, when considering data migration, we can analyze
two key measures: migration time and down time. Where the first represent the
overall time needed to accomplish a given migration task, and the second one refers
to the amount of time (subset of the previous one) during which the migrated data
cannot be accessed. By leveraging the large dataset we collect, we will define
a set of standardized experiments aiming at measuring in a fair and uniform ways
key characteristics of an SESS. This will provide a quantitative measure of how
effective and efficient is a SESS in providing a given features.
C.5.2 Surveying Existing
Tools
We will apply the proposed benchmark to the state-of-the-art SESS,
ranging from commercial [48,58,86,59,47,67,25,112,10,105,104,9,8,7,27,30,113,106,107,120,46,121,65,31,66]
(e.g. MySQL Workbench, SwisSQL migration tool), open source tools [109,83,84,3,85,94,44,4,79,93]
(e.g. Liquibase, Hibernate), and academic tools [82,34,75,23,88]
(e.g. CLIO, PRISM). Some of them deliver quite limited features while the others
provide the most advanced features we consider such as legacy query and update
automatic rewriting. We plan to take a more closer look into these systems which
will allow us fine-tune our benchmark as well. The overall results of this activity
will be the first real-life schema evolution benchmark and a thorough survey of
the existing tools, which will provide a complete assessment of the maturity level
of the approaches to schema evolution.
C.6 Work Plan and Deliverables
The project objective and milestones will be achieved according
to the following workplan:
- In this first year, we will build the bulk of our testbed
by collecting and organizing the data sets we have described in the previous
sections and adding several new ones. We will build tools that automates the
collection and management of such data sets, and can scale up to thousands
of schema versions. We will set up an appropriate distribution infrastructure,
by means of an open, curated repository, which will make accessible from very
beginning to promote the formation of a community of users and contributors.
- In the second year, we will design and implement the analysis
tools for our software platform. This will enable us to extract from our testbed
(that will be continuously updated and extended throughout the rest of the
project) a large set of statistics. Moreover during this year, we will tackle
the problem of Schema Modification Operators Mining from SQL data migration
scripts. This will produce tools to automatically capture the evolution in
terms of a more conceptual representation, thus enabling evolution documentation
and provenance support. Lastly, we will design and implement the History Metadata
Manager, which will make it easy to pose historical queries on the evolution
of the schema and other metadata.
- In the third year, we will design a benchmark for schema
evolution based on our testbed (that has been growing during these three years).
We will evaluate schema evolution/mapping tools from the commercial and the
research world on multiple features using quality metrics based on sound rationale.
We will design and automate experiments to test and validate traditional and
innovative approaches to schema evolution. By applying the benchmark to the
many open-source, commercial and academic tools available, we will provide
a precise characterization of state of the art tools, thus motivating and
guiding researchers and companies toward solving the current problem.
C.7 Outreach and dissemination
activities
The success of the research endeavor we are undertaking critically
depends on achieving a successful dissemination of the results, tools, and data
sets developed by our project. To guarantee this success we have scheduled a spectrum
of activities that include:
- extensive and timely documentation on results and software
(all of which will be downloadable) available from the project home page http://www.schemaevolution.org,
- Collaboration with several universities and research groups
(in addition with groups at UCSD, Politecnico di Milano, and UCR we already
working with). By leveraging on the critical mass of schema evolution histories
collected in the project, and the tool-suite we are going to develop, we will
attract a large number of researchers and practitioners-particularly those
working with scientific information systems,
- The formation of an open-source community contributing
to this testbed. Toward that goal, we are already contacted Open Source groups,
which agreed on contributing and participating in the tasks of forming and
administrating the community. This activity is very important for the success
of the project, as in it will allow to: (i) made the invaluable deliverables
of our project to a vast public, and (ii) contribute to create a reference
point for the community, and, thus, (iii) improve the level of assessment
of scientific results.
- Divulging our results and tools through papers at research
conferences and technical meetings, and by demos, and tutorials.
Through these activities, we will be able to reach a large community,
since the data we collect can play an important role well beyond the Schema Evolution
research area. In general the testbed and benchmark created by this project will
contribute to assure verifiable scholarly standards and creating consensus in
the research community. This dovetails with recent efforts of major database conferences
and journals to assure the repeatability of scientific results, and to provide
special tracks for experiments and analysis, aiming at better assessing research
advances.
References
- [1]
- http://ensembl.org.
- [2]
- https://twiki.cern.ch/twiki/bin/view/pssgroup/pdbnumbers?rev=4.
- [3]
- Schema-compare : a php script that will compare two database
schema's and show the differences side-by-side http://sourceforge.net/projects/schema-compare/.
- [4]
- Simple schema migration for django http://www.aswmc.com/dbmigration/.
- [5]
- Edbms - the evolutionary database management system - concepts
and summary. In Technical Report, Infomration Reserach Associates, Austin,
Texas, 1983.
- [6]
- See: http://www.ncbi.nlm.nih.gov.
- [7]
- Inc. AdventNet. Swissql console 4.2 : Sql query translation
/ conversion tool http://www.swissql.com/products/sql-translator/sql-converter.html.
- [8]
- Inc. AdventNet. Swissql data migration tool http://www.swissql.com/products/datamigration/data-migration.html?ad-main.
- [9]
- Inc. AdventNet. Swissql dbchangemanager 4.6 http://www.swissql.com/products/database-compare-synchronize-tool/.
- [10]
- Aldon. Aldon lifecycle manager : Multiplatform change management
software http://www.aldon.com/pdf/Aldon_database.pdf.
- [11]
- Bogdan Alexe, Wang-Chiew Tan, and Yannis Velegrakis. Comparing
and evaluating mapping systems with stbenchmark. Proc. VLDB Endow.,
1(2):1468-1471, 2008.
- [12]
- Bogdan Alexe, Wang-Chiew Tan, and Yannis Velegrakis. Stbenchmark:
towards a benchmark for mapping systems. Proc. VLDB Endow., 1(1):230-244,
2008.
- [13]
- Scott W. Ambler and Pramodkumar J. Sadalage.
Refactoring Databases : Evolutionary Database Design (Addison Wesley Signature
Series) http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20\&path=ASIN/0321293533.
Addison-Wesley Professional, March 2006.
- [14]
- Grigoris Antoniou, Dimitris Plexousakis, and Giorgos Flouris.
Evolving ontology evolution. In SOFSEM 2006: Theory and Practice of Computer
Science, pages 14-29. Springer Berlin / Heidelberg, 2006.
- [15]
- Gad Ariav. Temporally oriented data definitions - managing
schema evolution in temporally oriented databases. Data and Knowledge
Engineering, 6(1):451-467, 1991.
- [16]
- Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo.
Time-stamp management and query execution in data stream management systems.
IEEE Internet Computing, 12(6):13-21, 2008.
- [17]
- Yijian Bai and Carlo Zaniolo. Minimizing latency and memory
in dsms: a unified approach to quasi-optimal scheduling. In SSPS,
pages 58-67, 2008.
- [18]
- F. Bancilhon and N. Spyratos. Update semantics
of relational views. ACM Trans. Database Syst., 6(4):557-575, 1981.
- [19]
- Denilson Barbosa, Juliana Freire, and Alberto O. Mendelzon.
Designing information-preserving mapping schemes for xml. In VLDB,
pages 109-120, 2005.
- [20]
- Kent Beck. Extreme Programming Explained: Embrace Change.
Addison-Wesley Professional, 1st edition, October 1999.
- [21]
- Philip A. Bernstein. Applying model management to
classical meta data problems. In CIDR, 2003.
- [22]
- Philip A. Bernstein, Todd J. Green, Sergey Melnik,
and Alan Nash. Implementing mapping composition. In VLDB, 2006.
- [23]
- Philip A. Bernstein, Todd J. Green, Sergey Melnik,
and Alan Nash. Implementing mapping composition. VLDB J., 17(2):333-353,
2008.
- [24]
- Philip A. Bernstein and Erhard Rahm. Data warehouse
scenarios for model management. In ER, 2003.
- [25]
- Inc. BEST SoftTool. Sql db compare http://bestsofttool.com/SQLDBCompare/SDC_Feature.aspx.
- [26]
- Hassina Bounif and Rachel Pottinger. Schema repository
for database schema evolution. DEXA, 0:647-651, 2006.
- [27]
- Inc. Bradmark Technologies. Dbcontrol for udb: Database
administration and schema evolution for the ibm db2 universal database environment.
datasheet http://www.bradmark.com/site/products/pdfs/dbcontrol_db2.pdf.
- [28]
- C. De Castro, F. Grandi, and M. R. Scalas.
Schema versioning for multitemporal relational databases. Information
Systems, 22(5):249-290, 1997.
- [29]
- Anthony Cleve and Jean-Luc Hainaut. Co-transformations
in database applications evolution. LNCS: Generative and Transformational
Techniques in Software Engineering, pages 409-421, 2006.
- [30]
- Troy Coleman. Evolution of online schema change to database
definition on demand. blog http://ibmsystemsmag.blogs.com/db2utor/2007/09/evolution-of-on.html,
September 2007.
- [31]
- The Australian Software Company. Sql delta for oracle
databases - beta http://www.sqldelta.com/oraclebeta.html.
- [32]
- Carlo Curino, Hyun J. Moon, MyungWon Ham, and Carlo
Zaniolo. The prism workwench: Database schema evolution without tears. In
ICDE, 2009.
- [33]
- Carlo Curino, Hyun J. Moon, Letizia Tanca, and Carlo
Zaniolo. Schema Evolution in Wikipedia: toward a Web Information System Benchmark.
ICEIS, 2008.
- [34]
- Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Graceful
database schema evolution: the prism workbench. Very Large DataBases (VLDB),
1, 2008.
- [35]
- Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Managing
the history of metadata in support for db archiving and schema evolution.
In Fifth International Workshop on Evolution and Change in Data Management
(ECDM 2008), 2008.
- [36]
- Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. Managing
the history of metadata in support for db archiving and schema evolution.
In ER Workshop on Evolution and Change in Data Management (ECDM),
2008.
- [37]
- Carlo Curino, Giorgio Orsi, and Letizia Tanca. X-som: A
flexible ontology mapper. In DEXA Workshops, pages 424-428, 2007.
- [38]
- Carlo Curino, Letizia Tanca, and Carlo Zaniolo. Information
systems integration and evolution: Ontologies at rescue. In International
Workshop on Semantic Technologies in System Maintenance (STSM), 2008.
- [39]
- U. Dayal and P.A. Bernstein. On the Correct Translation
of Update Operations on Relational Views. ACM Transactions on Database
Systems, 8(3):381-416, 1982.
- [40]
- DB2 development team. DB2 Change Management Expert. 2006.
- [41]
- Alin Deutsch, Alan Nash, and Jeff Remmel. The chase revisited.
In Principles of database systems (PODS), pages 149-158, New York,
NY, USA, 2008. ACM.
- [42]
- Alin Deutsch and Val Tannen. Optimization properties for
classes of conjunctive regular path queries. In DBPL '01, pages 21-39,
London, UK, 2002. Springer-Verlag.
- [43]
- Alin Deutsch and Val Tannen. Mars: A system for publishing
xml from mixed and redundant storage. In VLDB, 2003.
- [44]
- Jan Dittberner. Migrate versioning http://code.google.com/p/sqlalchemy-migrate/wiki/MigrateVersioning.
- [45]
- Nina Edelweiss and Álvaro Freitas Moreira. Temporal
and versioning model for schema evolution in object-oriented databases. Data
Knowl. Eng., 53(2):99-128, 2005.
- [46]
- edule. Synchronize sql server databases http://e-dule.com/sql_server/compare_synchronize_databases/Databases_Synchronization_Tool_DB_SynchroComp.asp.
- [47]
- Embarcadero. Embarcadero change manager. datasheet, 2008
http://www.embarcadero.com/products/changemanager/cmdatasheet.pdf.
- [48]
- John Chun et al. Db2 and ruby on rails, part
1: Getting started with db2 and ruby on rails http://www.ibm.com/developerworks/db2/library/techarticle/dm-0705chun/.
article, 2007.
- [49]
- Ronald Fagin. Inverting schema mappings. ACM Trans.
Database Syst., 32(4), 2007.
- [50]
- Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and
Wang Chiew Tan. Composing schema mappings: Second-order dependencies
to the rescue. In PODS, pages 83-94, 2004.
- [51]
- Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and
Wang-Chiew Tan. Quasi-inverses of schema mappings. In PODS '07, pages
123-132, 2007.
- [52]
- Ronald Fagin and Alan Nash. The structure of inverses in
schema mappings. In IBM Research Report RJ10425, 2007.
- [53]
- Enrico Franconi, Fabio Grandi, and Federica Mandreoli.
Schema evolution and versioning: A logical and computational characterisation,
2001.
- [54]
- Renata de Matos Galante, Clesio Saraiva dos Santos,
Nina Edelweiss, and Alvaro Freitas Moreira. Temporal and versioning model
for schema evolution in object-oriented databases. Data & Knowledge
Engineering, 53(2):99-128, 2005.
- [55]
- F. Giunchiglia, P. Shvaiko, and M. Yatskevich.
S-match: an algorithm and an implementation of semantic matching. In Y. Kalfoglou,
M. Schorlemmer, A. Sheth, S. Staab, and M. Uschold, editors,
Semantic Interoperability and Integration, number 04391 in Dagstuhl
Seminar Proceedings, 2005.
- [56]
- Matteo Golfarelli, Jens Lechtenbörger, Stefano Rizzi,
and Gottfried Vossen. Schema versioning in data warehouses. In ER (Workshops),
pages 415-428, 2004.
- [57]
- Jean-Marc Hick and Jean-Luc Hainaut. Database application
evolution: a transformational approach. Data Knowl. Eng., 59(3):534-558,
2006.
- [58]
- IBM. Db2 change management expert overview, 2007 http://publib.boulder.ibm.com/infocenter/mptoolic/v1r0/index.jsp?topic=/com.ibm.db2tools.chx.doc.ug/chxucoview01.htm.
- [59]
- Idera. Sql change manager. datasheet http://www.idera.com/products/sqlchange/Datasheet.pdf.
- [60]
- H. V. Jagadish, Inderpal Singh Mumick, and Michael
Rabinovich. Scalable versioning in distributed databases with commuting updates.
In Conference on Data Engineering, pages 520-531, 1997.
- [61]
- Y. Kalfoglou and M. Schorlemmer. Ontology mapping:
the state of the art. The Knowledge Engineering Review, 18(01):1-31,
2003.
- [62]
- Yannis Kotidis, Divesh Srivastava, and Yannis Velegrakis.
Updates through views: A new hope. In ICDE '06: Proceedings of the 22nd
International Conference on Data Engineering, page 2, Washington,
DC, USA, 2006. IEEE Computer Society.
- [63]
- T. Lemke and R. Manthey. The schema evolution
assistant: Tool description, 1995.
- [64]
- Ying Liu, Shu ren Zhang, and Mei qi Fang. Ecological
analysis on evolution of information systems. In I3E (2), pages 308-315,
2007.
- [65]
- DBBalance LTD. Cross-database studio. datasheet http://www.dbbalance.com/CDBSStudio6.pdf.
- [66]
- Miro International Pty Ltd. Sql sync toolkit ver.
2.3 [build 1194] http://www.bizkit.ru/index.php?option=com_content&task=view&id=2&Itemid=51&lang=.
- [67]
- Red Gate Software Ltd. Sql solutions http://www.red-gate.com/products/SQL_Refactor/index.htm.
- [68]
- J. Madhavan, P. A. Bernstein, and E. Rahm.
Generic schema matching with cupid citeseer.ist.psu.edu/madhavan01generic.html,
year = 2001, bdsk-url-1 = citeseer.ist.psu.edu/madhavan01generic.html. In
The VLDB Journal, pages 49-58.
- [69]
- Jayant Madhavan and Alon Y. Halevy. Composing mappings
among data sources. In VLDB, 2003.
- [70]
- S. Marche. Measuring the stability of data models.
European Journal of Information Systems, 2(1):37-47, 1993.
- [71]
- Robert Cecil Martin. Agile Software Development:
Principles, Patterns, and Practices. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 2003.
- [72]
- L. Edwin McKenzie and Richard T. Snodgrass. Schema
evolution and the relational algebra. Information Systems, 15(2):207-232,
1990.
- [73]
- Sergey Melnik, Erhard Rahm, and Philip A. Bernstein.
Rondo: A programming platform for generic model management. In SIGMOD,
2003.
- [74]
- T. Mens and T. Tourwè. A survey of software
refactoring. IEEE Trans. Softw. Eng., 30(2):126-139, 2004.
- [75]
- MIT. Cl-migration : a port of the database migrations feature
of ruby on rails to common lisp http://common-lisp.net/project/cl-migrations/.
- [76]
- P. Mitra, Natalya Fridman Noy, and A. R.
Jaiswal. Omen: A probabilistic ontology mapping tool. Workshop on Meaning
Coordination and Negotiation at ISWC-04, 2004.
- [77]
- Hyun J. Moon, Carlo Curino, Alin Deutsch, Chien-Yi
Hou, and Carlo Zaniolo. Managing and querying transaction-time databases under
schema evolution. Very Large DataBases (VLDB), 1, 2008.
- [78]
- Mirella M. Moro, Susan Malaika, and Lipyeow Lim. Preserving
XML Queries during Schema Evolution. In WWW, pages 1341-1342, 2007.
- [79]
- Joel Moss. Cake db migrations v3.2 http://bakery.cakephp.org/articles/view/cake-db-migrations-v2-1.
- [80]
- B. Mozafari, H. Thakkar, and C. Zaniolo.
Verifying and mining frequent patterns from large windows over data streams.
In ICDE, 2008.
- [81]
- Alan Nash, Philip A. Bernstein, and Sergey Melnik.
Composition of mappings given by embedded dependencies. In PODS,
2005.
- [82]
- The Database Group @ University of Toronto. Clio
project http://queens.db.toronto.edu/project/clio/index.php.
- [83]
- open source. Liquibase http://www.liquibase.org/manual/home.
- [84]
- open source. Metabase : Dbms-independent php api to access
sql databases http://freshmeat.net/projects/metabase/.
- [85]
- open source. Tora : toolkit for oracle http://tora.sourceforge.net/.
- [86]
- Oracle. Oracle enterprise manager 10g change management
pack. datasheet http://www.oracle.com/technology/products/oem/pdf/ds_change_pack.pdf.
- [87]
- Oracle development team. Oracle database 10g online data
reorganization and redefinition http://www.oracle.com/technology/deploy/availability/.
Oracle White Paper, 2005.
- [88]
- G. Papastefanatos, F. Anagnostou, Y. Vassiliou,
and P. Vassiliadis. Hecataeus: A what-if analysis tool for database schema
evolution. Software Maintenance and Reengineering, 2008. CSMR 2008. 12th
European Conference on, pages 326-328, April 2008.
- [89]
- Young-Gook Ra. Relational schema evolution for program
independency. Intelligent Information Technology, pages 273-281,
2005.
- [90]
- E. Rahm and P. A. Bernstein. A survey of approaches
to automatic schema matching. The VLDB Journal, 10(4):334-350, 2001.
- [91]
- E. Rahm and P. A. Bernstein. An online bibliography
on schema evolution. SIGMOD Record, 35(4):30-31, 2006.
- [92]
- Sudha Ram and G. Shankaranarayanan. Research issues
in database schema evolution: the road not taken. In Boston University
School of Management, Paper No: 2003-15, 2003.
- [93]
- LLC Red Hat Middleware. Relational persistence for
java and .net http://hibernate.org/.
- [94]
- Sebastian Riedel. Dbix::migration - seamless db schema
up- and downgrades http://search.cpan.org/~daniel/DBIx-Migration-0.05/lib/DBIx/Migration.pm.
- [95]
- Stefano Rizzi and Matteo Golfarelli. X-time: Schema versioning
and cross-version querying in data warehouses. In ICDE, pages 1471-1472,
2007.
- [96]
- J.F. Roddick. A Survey of Schema Versioning Issues for
Database Systems. Information and Software Technology, 37(7):383-393,
1995.
- [97]
- J.F. Roddick. A Model for Schema Versioning in Temporal
Database Systems. In Proc. 19th. ACSC Conf., pages 446-452, 1996.
- [98]
- John F. Roddick. Dynamically changing schemas within
database models. The Austrailian Computer Journal, 23(3):105-109,
1991.
- [99]
- John F. Roddick. Sql/se - a query language extension
for databases supporting schema evolution. SIGMOD Record, 21(3):10-16,
1992.
- [100]
- Arnon Rosenthal and Len Seligman. Pragmatics and open problems
for inter-schema constraint theory. In ICDEW '06: Proceedings of the 22nd
International Conference on Data Engineering Workshops, page 1,
Washington, DC, USA, 2006. IEEE Computer Society.
- [101]
- M. R. Scalas, A. Cappeli, and C. De Castro.
A model for schema evolution in temporal relational databases. In the
European Conference on Computers in Design, Manufacturing and Production (CompEuro),
1993.
- [102]
- Ben Shneiderman and Glenn Thomas. An architecture for automatic
relational database system conversion. ACM Transactions on Database Systems,
7(2):235-257, 1982.
- [103]
- D. Sjoberg. Quantifying schema evolution. Information
and Software Technology, 35(1):35-44, 1993.
- [104]
- DTM Software. Dtm migration kit http://www.sqledit.com/mk/index.html.
- [105]
- DTM Software. Dtm schema comparer http://www.sqledit.com/scmp/index.html.
- [106]
- Quest Software. The all-in-one sql server management and
development toolset. datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD_ToadSQLServer40.pdf.
- [107]
- Quest Software. Automated sql server change management.
datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD-ChangeDirector-F-US.pdf.
- [108]
- S.Y.W. Su and H. Lam. Application program conversion
due to database changes. In VLDB, 1976.
- [109]
- MySQL Development Team. Mysql workbench: A data modeling
guide for developers and dbas. White paper, 2008.
- [110]
- H. Thakkar, B. Mozafari, and C. Zaniolo.
A data stream mining system. In ICDM, 2008.
- [111]
- Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. Designing
an inductive data stream management system: the stream mill experience. In
SSPS, pages 79-88, 2008.
- [112]
- Toad. Toad dba suite for oracle. datasheet http://www.quest.com/Quest_Site_Assets/PDF/DSD_Toad_DBASuite_US.pdf.
- [113]
- Adept SQL Tools. Comparing sql schema and table data
with adeptsql diff http://www.adeptsql.com/sql_compare_tool_docs/index.htm.
- [114]
- Yannis Velegrakis, RenĀ»e J. Miller, and Lucian Popa.
Mapping adaptation under evolving schemas. In VLDB, 2003.
- [115]
- F. Wang and C. Zaniolo. XBiT: An XML-based Bitemporal
Data Model. In ER, 2004.
- [116]
- Fusheng Wang and Carlo Zaniolo. An XML-Based Approach to
Publishing and Querying the History of Databases. World Wide Web: Web
Information Systems Engineering, 8(3):233-259, 2005.
- [117]
- Fusheng Wang, Carlo Zaniolo, and Xin Zhou. Temporal xml?
sql strikes back! In TIME, pages 47-55, 2005.
- [118]
- Fusheng Wang, Carlo Zaniolo, and Xin Zhou. Archis: an xml-based
approach to transaction-time temporal database systems. VLDB J.,
accepted for publication, 2009.
- [119]
- Wikimedia Foundation. Wikipedia, the free encyclopedia
http://en.wikipedia.org/,
2008. [Online].
- [120]
- xSQL Software. What's new: Support for sql server 2008
and more http://www.xsqlsoftware.com/Sql2008Support/Index.aspx.
- [121]
- dbMaestro Yaniv Yehuda, Joint Managing Director.
Database compare and synchronize methods as a part of software release management
process http://www.dbmaestro.com/DbMaestro/24/1/525.aspx.
- [122]
- Cong Yu and Lucian Popa. Semantic adaptation of schema
mappings when schemas evolve. In VLDB, 2005.
- [123]
- Xin Zhou, Fusheng Wang, and Carlo Zaniolo. Efficient temporal
coalescing query support in relational database systems. In DEXA,
2006.
Footnotes:
1This
is found in a CVS commit at: http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=8792.
2Available
on-line: http://www.liquibase.org/
3Wikipedia
is solidly one of the top ten most popular websites on the web according to Alexa.
4http://www.mediawiki.org
5See
http://s23.org/wikistats/.
6Please
note that each evolution step might contain more than one type of change.
7Both
from system libraries and user defined.
8The
main noticeable exception is the set of information supporting the user rights
management, which has been strongly reduced in the DB after version v9335 (65th
version), as it was moved to the application layer.
9The
templates are extracted ignoring constants and retaining only the query structure.
10Available
on-line at http://noc.wikimedia.org/cgi-bin/report.py.
11The
cleaning process removes syntactical errors produced by the template extraction
of the Wikipedia profiler.
12In
order to generate as many as possible types of queries, we tried to trigger all
features accessible from the web browser.
13The
oldest version compatible with the environment of our experimental setting.
14By
database objects, we refer to index, view, stored procedures, and triggers in
this proposal, unless otherwise mentioned.
File translated from TEX by
TTH, version 3.81.
On 03 Jul 2009, 15:12.