- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Optimizing Data Warehouse Loading Procedures for
展开查看详情
1 . Optimizing Data Warehouse Loading Procedures for Enabling Useful-Time Data Warehousing Ricardo Jorge Santos Jorge Bernardino CISUC – DEI – FCT CISUC – ISEC – Superior Institute of Engineering University of Coimbra Polytechnic Institute of Coimbra Coimbra, Portugal Coimbra, Portugal lionsoftware.ricardo@gmail.com jorge@isec.pt ABSTRACT minimize processing response time for each transaction, so that it The purpose of a data warehouse is to aid decision making. As the can handle a high volume of simultaneous transactions, therefore real-time enterprise evolves, synchronism between transactional maximizing performance and availability for their users. The main data and data warehouses is redefined. To cope with real-time purpose of a data warehouse (DW) is to aid decision making. It requirements, the data warehouses must be able to enable collects data from multiple heterogeneous operational source continuous data integration, in order to deal with the most recent systems, storing summarized integrated business data in a central business data. Traditional data warehouses are unable to support repository used by analytical applications with different user any dynamics in structure and content while they are available for requirements. The databases of a DW usually store the complete OLAP. Their data is periodically updated because they are business history. Due to this, they frequently have a huge number unprepared for continuous data integration. For real-time of rows and grow to gigabytes, terabytes and even petabytes of enterprises with needs in decision support while the transactions storage size. The common process for obtaining decision making are occurring, (near) real-time data warehousing seem very information from DWs is based on executing ad-hoc queries and promising. In this paper we present a survey on testing today’s tools which perform on-line analytical processing (OLAP) [7]. most used loading techniques and analyze which are the best data Due to the dimension of the tables in the DW, decision support loading methods, presenting a methodology for efficiently queries usually access huge amounts of data, making performance supporting continuous data integration for data warehouses. To one of the most important issues in data warehousing. accomplish this, we use techniques such as table structure Traditionally, it has been well accepted that DW databases are replication with minimum content and query predicate restrictions updated periodically, typically in a daily, weekly or even monthly for selecting data, to enable loading data in the data warehouse basis [28]. This update policy was defined as common practice continuously, with minimum impact in query execution time. We due to two major assumptions: 1) Since the DW has huge sized demonstrate the efficiency of the method using benchmark TPC-H tables, doing data integration in a continuous fashion, similar to and executing query workloads while simultaneously performing OLTP operational systems, would create immense bottlenecks in continuous data integration. OLAP processing and data updating procedures, due to the time and resource consumption implied in loading data in such tables and updating the data structures (such as indexes, partitions and Keywords materialized views, among others) used for performance ETL, Data Warehouse Loading, Continuous Data Integration. optimization; and 2) Business decisions were made on a daily, weekly or monthly basis, dispensing near real-time business data. 1. INTRODUCTION This type of update policy meant that the DW’s data was never Essentially, operational systems are transactional systems, which up-to-date, because transactional records saved between those support the daily business processes and store the complete updates were not included in its databases. This implied that the detailed information for each business transaction, in real-time. most recent operational records were excluded from results These systems use on-line transactional processing (OLTP). supplied by OLAP tools. Until recently, using periodically Since they need to insure their availability during each business updated data for obtaining decision support information was not a transaction, OLTP databases frequently store only the current crucial issue. However, with enterprises such as e-business, stock daily information considered as essential for those transactions. brokering, online telecommunications, and health systems, for They usually process small, atomic and isolated transactions, instance, relevant information needs to be delivered as fast as which typically read or/and write a very small amount of data. possible to knowledge workers or decision systems which rely on These features allow minimizing data processing efforts, to it to react in a near real-time manner, according to the new and most recent data captured by the organization’s information system [8]. Active Data Warehousing refers to a new trend where Permission to make digital or hard copies of all or part of this work for DWs are updated as frequently as possible, due to the high personal or classroom use is granted without fee provided that copies are demands of users for fresh data. The term is also designated as not made or distributed for profit or commercial advantage and that Real-Time Data Warehousing (RTDW) for that reason in [24]. copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, Todays business models and processes take advantage of the latest requires prior specific permission and/or a fee. technology in informatics and telecommunications, bringing the IDEAS 2009, September 16–18, 2009, Cetraro, Calabria, Italy. capability and the necessity of obtaining decision making Copyright 2009 ACM 978-1-60558-402-7/09/09…$5.00. information while the business transactions themselves are taking 292
2 .place. Nowadays, IT managers are facing crucial challenges processes of ETL procedures and the DW’s data area usage. The deciding whether to build a real-time data warehouse instead of a items concerning extracting and transforming of operational conventional one, and whether their existing DW has become (OLTP) source systems data are not the focus of this paper. inadequate and needs to be converted into a real-time DW in Therefore, we shall always be referring the data warehouse point order to remain competitive. In some specific cases, data update of view, in the remainder of the paper. delays greater than a few seconds or minutes may jeopardize the The remainder of this paper is organized as follows. In section 2, usefulness of the whole system. This makes supporting RTDW a we refer the requirements for useful-time data warehousing. critical issue for such applications. Section 3 presents background and related work in real-time data The term Real-Time in RTDW is a relative concept, because it is warehousing. Section 4 presents the best data loading methods for impossible to accomplish updating both operational source actual data warehouses. Section 5 explains our continuous data systems and OLAP databases at exactly the same moment in time. integration methodology, along with its experimental evaluation, Usually, what is referred to as RTDW is the effort in trying to demonstrating its functionality. The final section contains enable updating OLAP databases as soon as possible after concluding remarks and future work business transactions are committed, minimizing the time gap between their actual occurrence and their propagation into the 2. REQUIREMENTS FOR USEFUL-TIME DW. This allows DW users or tools to take decisions and react DATA WAREHOUSING actively to a set of business transactions which might have The size of DWs implies that each decision support query which occurred in a very recent past, contrarily to what they could do if is executed usually accesses large amounts of records, performing they were using traditional DWs. In our opinion, the main goal for actions such as joins, sorting, aggregating and calculus functions. changing the data update policy of a DW from periodical batch To optimize these accesses – and, consequently, performance – load updates towards a continuous data integration fashion is to the DW uses predefined data structures (such as indexes or deliver decision making information to OLAP and/or business partitions, for instance), which are also large in size and have a intelligence tools, enabling them to react while the transaction considerably high level of complexity. This implies that it is very itself is occurring. If this can be accomplished, we consider that difficult to efficiently update the DW’s data area in useful-time, the DW is capable of delivering decision making information in for the propagation of near real-time data would most likely Useful-Time. Therefore, the main issue in useful-time data overload the server, given its update frequency and volume; warehousing is: how can relevant information for aiding decision involving immense complex operations on the DW’s data making be delivered to OLAP end users and tools in order to structures and dramatically degrade OLAP performance. Our aim enable bringing business intelligence to the transaction itself? is to transform a standard DW using batch loading during update The DW’s data area is updated by Extraction, Transformation and windows (during which OLAP is not allowed) into a near zero Loading (ETL) tools. ETL tools are responsible for identifying latency analytical environment providing current data. The and extracting relevant data from operational systems, requirements for this kind of analytical environment go beyond customizing and integrating it into a common format, cleaning what is typical in a traditional DW. The major issue is how to and conforming it into an adequate format for updating the DW enable continuous fashion data integration assuring that it data area and, finally, loading the final formatted data into its minimizes negative impact in the system’s main features, such as database. The efficiency and feasibility of the methods in this last availability and response time of both OLTP and OLAP systems. step of ETL are crucial for enabling useful-time data warehousing. An in-depth discussion of these features from the analytical point In a nutshell, accomplishing near zero latency between OLTP and of view (to enable timely consistent analysis) is given in [6]. The OLAP systems consists in insuring data integration in a near useful-time DW provides access to an accurate, integrated, continuous fashion from the former type of systems into the last. consolidated view of the organization’s information and helps to In order to make this feasible, several issues need to be taken deliver near real-time information to its users. This requires under consideration: (1) Operational OLTP systems are designed efficient ETL techniques enabling continuous fashion data to meet well-specified (short) response time requirements aiming integration, the focus of this paper. It becomes necessary to cope for maximum system availability, which means that a RTDW with at least two radical data state changes. First, it is necessary to scenario would have to cope with this in the overhead implied in perform continuous data update actions, due to the continuous OLTP systems; (2) The tables existing in a data warehouse’s data integration, which should mostly concern row insertions. database directly related with transactional records (commonly Second, these update actions must be performed in parallel with named as fact tables) are usually huge in size, and therefore, the the execution of OLAP, which – due to its new near real-time addition of new data and consequent procedures such as index nature – will probably be solicited more often. Therefore, the updating or referential integrity checks would certainly have main contributions of this paper are threefold: impact in OLAP systems’ performance and data availability. Our • Maximizing freshness of data by efficiently and rapidly work is focused on the DW perspective, for that reason we present integrating the most recent decision making data, obtained an efficient methodology for continuous data integration, from OLTP, into the data warehouse in useful-time; performing the ETL loading process. • Minimizing OLAP response time while simultaneously This paper presents a solution which enables efficient continuous integrating data in a continuous fashion; data integration in DWs, while allowing OLAP execution • Maximizing the data warehouse’s availability by reducing simultaneously, with minimum decrease of performance. With its update time window, in which users and OLAP this, we seek to minimize the delay between the recording of applications are off-line. transactional information and its reflected update on the decision support database. The issues focused in this paper concern the DW end of the system, referring how to perform loading 293
3 .3. RELATED WORK 4. SEARCHING FOR MOST EFFICIENT So far, research has mostly dealt with the problem of maintaining DATA LOADING PROCEDURES the DW in its traditional periodical update setup [14, 27]. In a The maturity of today’s DBMS shows the evolution of all kinds of different line of research, data streams [1, 2, 15, 20, 31, 32] could data management routines. Along with hardware evolution, we possibly appear as a potential solution. However, research in data can manage greater amounts of data, increasingly faster. Their streams has focused on topics concerning the front-end, such as performance leads us to assume that today they may be able to on-the-fly computation of queries without a systematic treatment deal with continuous fashion data integration in DWs. One of the of the issues raised at the back-end of a DW [10]. Much of the primary features to consider is searching for the most efficient recent work dedicated to RTDW is also focused on conceptual data loading procedures, whether OLAP is performed or not. In ETL modelling [4, 5, 19, 23, 33, 34], lacking the presentation of this section, we focus on determining the actual fastest and concrete extraction, transformation and loading algorithms along securest way of getting data into the data warehouse, without with their consequent OLTP and OLAP performance issues. DW jeopardizing data quality, demonstrating which are the most operational processes are normally composed by a labor-intensive efficient data loading methods. To do this, we will consider the set workflow, involving data extraction, transformation, integration, of most common and available data loading techniques for most cleaning and transportation. To deal with this, specialized tools of the used DBMS, such as Oracle, MySQL, Postgres, SQL are available today in the market [36, 37, 38, 39], under the Server, etc. general title Extraction – Transformation – Loading (ETL) tools. Nowadays, DBMS supply standard bulk loading utilities for batch Temporal data warehouses address the issue of supporting loading data. Oracle has the SQL*Loader [16] for importing data temporal information efficiently in data warehousing systems directly from outside the database itself. This approach is useful in [25]. In [27], the authors present efficient techniques (e.g. traditional periodically updated data warehouses, in which the temporal view self-maintenance) for maintaining data warehouses data to load usually represents a large number of rows. In these without disturbing source operations. A related challenge is cases, OLTP data is usually extracted, cleaned, transformed and supporting large-scale temporal aggregation operations in DWs saved into a flat text-file format which will be loaded into the DW [26]. In [4], the authors describe an approach which clearly later on by its DBMS bulk loader, such as SQL*Loader. Another separates the DW refreshment process from its traditional option is loading data using standard SQL statements or stored handling as a view maintenance or bulk loading process. They procedures through middleware drivers, such as ODBC. They can provide a conceptual model of the process, which is treated as a execute with higher or lower frequency, according to the needed composite workflow, but they do not describe how to efficiently update frequency in the DW. We also need to consider if the propagate the date. [21] discusses data currency quality factors in OLTP systems use the same DBMS as the DWs. DWs and propose a DW design that considers these factors. Therefore, we aim to test which of the data loading methods is the An important issue for near real-time data integration is the most efficient: accommodation of delays, which has been investigated for 1) DBMS Bulk data loaders such as SQL*Loader, using flat- (business) transactions in temporal active databases [18]. The text files; conclusion is that temporal faithfulness for transactions has to be 2) Insertion from outside the DBMS using SQL statements provided, which preserves the serialization order of a set of and/or stored procedures executed through middleware business transactions. Although possibly lagging behind real-time, drivers, such as ODBC; or a system that behaves in a temporally faithful manner guarantees 3) Insertion from inside the DBMS using SQL statements the expected serialization order. and/or stored procedures without using middleware drivers. In [23], the authors describe the ARKTOS ETL tool, capable of To perform these tests, we used the TPC-H decision support modeling and executing practical ETL scenarios by providing benchmark [22] and created a 25 Gigabyte sized TPC-H database explicit primitives for capturing common tasks (such as data with the Oracle 10g DBMS [16], on an Intel Pentium IV 2.8GHz cleaning, scheduling and data transformations) using a declarative CPU with 1 Gigabyte of 400 MHz RAM memory and a IDE language. ARKTOS offers graphical and declarative features for UDMA-133 7200rpm 180 Gigabyte hard drive. The purpose of defining DW transformations and tries to optimize the execution this database is to support business referring to customer orders of complex sequences of transformation and cleansing tasks. for diverse articles. There are two tables that mainly support every In [13] is described a zero-delay DW with Gong, which assists in new business transaction: table Orders, which saves every new providing confidence in the data available to every branch of the customer order; and table LineItem, which saves all ordered organization. Gong is a Tecco product [3] that offers uni or bi- items for each customer order. To test data integration in this directional replication of data between homogeneous and/or database, we simulate a scenario which considers that it must be heterogeneous distributed databases. Gong’s database replication able to deal with loading 1.000.000 transactions, in which each enables zero-delay business to assist in the daily running and transaction is composed by 1 row for the Orders table and an decision making of the organization. average of 4 rows for the LineItem table. We assume that this But not all transactional information needs to be immediately amount of data, characterized in Table 1, represents 1 day (24 dealt with in real-time decision making requirements. We can hours) of the transactional operational systems work. define which groups of data are more important to include rapidly in the DW and other groups of data which can be updated in latter Table 1. Characterization of the data used in experiments time. Recently, in [9], the authors present an interesting architecture on how to define the types of update and time Table Nr. of Rows to load Data Size to load Orders 1.000.000 120 Mbytes priorities (immediate, at specific time intervals or only on DW LineItem 4.000.000 435 Mbytes offline updates) and respective synchronization for each group of Total 5.000.000 555 Mbytes transactional data items. 294
4 .4.1 Loading Data as a One-Step Batch 4.2 Loading Data in a Continuous Fashion As we have mentioned before, the integration of new data in DWs To integrate data in a continuous near real-time manner into the can be done using one of two possible data update policies: DW database, instead of having a considerably large amount of executing periodical data updates, in a traditional DW fashion; or data to insert in a one-step batch load, we need to consider that in a continuously near real-time data integration fashion. In the the data arrives in small amounts, representing a business first case, the new decision support data to integrate in the DW is transaction, which needs to be rapidly stored. In the test scenario usually a large amount of data, easily reaching dozens or hundreds mentioned earlier, we want to load 1.000.000 complete of megabytes or more, concerning all operational transactions transactions in 24 hours. This means that a complete transaction occurred since the last DW update was performed. This update is must be integrated and committed into the TPC-H database every performed while the DW is offline to OLAP users and tools, 0,0864 seconds, during 24 hours. This is very different from meaning that the data must be fully integrated as soon as possible, loading a large amount of data in a bulk manner and committing in order to minimize this time window. All the data structures in the changes when all is completed, as shown in the prior sub- the DW which are used for performance optimization, such as section. When integrating small amounts of new data, the new indexes, partitions, materialized views, etc, are typically rebuilt rows of each loaded transaction are committed in their respective when this update is executed, regaining their maximum efficiency tables. This means that the number of commits that needs to be according to the data’s nature and features. This allows the executed and the frequency at which data arrives make it a new database to regain maximum processing speed once more. In the challenge. Since the new data comes from the OLTP systems, continuous data integration policy used in useful-time data which operate outside the DW database, it has been formerly warehousing, there is no foresight on how to maintain efficiency extracted, cleaned and transformed. When it is ready to be of the performance optimization data structures, for the nature and integrated into the DW database, data loading can be done using features of the new data which is to be integrated is unknown. one of the two following methods which were mentioned before: Therefore, it is obvious that sooner or later, after a certain number using a bulk loader recurring to a flat-text file containing the new of insertions, the performance decreases and becomes too poor to data; or using ODBC SQL Insert procedures. consider as acceptable. To regain performance it is necessary to Regarding to Oracle SQL*Loader, the bulk loader performs the rebuild performance optimization data structures, similar to what following actions to complete a data loading procedure: create and is done for the traditional DW. Therefore, optimization of data open a new log file for writing the loading procedure status; loading when the DW databases are offline to its end users and access and open the flat-text source file which has the new data to OLAP tools is important for both traditional and useful-time data load; read the data in the flat-text source file; write the data in the warehousing. Table 2 shows the results for inserting 1.000.000 corresponding database table(s); commit the changes in the complete transactions as soon as possible in the 25 Gbyte TPC-H database table(s); close the flat-text source file; write the results of database. The Oracle bulk loader, SQL*Loader, was used with the data load procedure in the log file and close it. As mentioned optimized parameters according to know-how expertise referred in before, loading 1.000.000 complete transactions in 24 hours imply published articles by experts in this field [29, 30]. the insertion of one complete transaction every 0,0864 seconds. After testing, we concluded that every execution of SQL*Loader Table 2. Time (HH:MM:SS) spent inserting a load of took between 1,5 and 1,9 seconds for executing the actions 1.000.000 TPC-H transactions with each loading method mentioned above. For update frequencies lower than 2 seconds, Table Orders Table LineItem Total the bulk loader is not efficient, for the log files show us that many SQL*Loader 00:03:01 00:23:14 00:26:15 records are not loaded into their tables because the operating ODBC SQL Insert 00:21:11 01:26:32 01:47:43 DBMS SQL Insert 00:02:16 00:08:19 00:10:35 system file operations cannot keep up with the time execution frequency. Therefore, to use SQL*Loader efficiently, it needs to Many authors [8, 12, 29] mention that using bulk loaders is the execute once every 2 seconds for loading 24 transactions each fastest way of getting data into the DW. Observing Table 2, we time, to load 1.000.000 complete transactions a day. When using can state that this is true if the data comes from outside the ODBC middleware for loading and committing each complete DBMS. However, if the data already lies within the DBMS, the TPC-H transaction, we need to determine if it can assure standard SQL Insert statement outperforms the bulk loader. This integrating of 1.000.000 transactions within the 24 hour limit. To is an important conclusion, which should be considered when do this, we tested inserting the data with a various number of designing useful-time DW ETL tools. Since multi-tasking is simultaneous loaders. The results are shown in Figure 1. available and more than one parallel instances of each data loading method can be executed at the same time, we also tested 1600000 Number of loaded transactions 1400000 loading data with several simultaneous data loading instances. 1200000 The experiments showed that executing several instances of bulk 1000000 loaders at the same time for integrating data in the same tables is 800000 600000 not efficient. Contrarily to what could be expected at first, every 400000 extra running instance of the bulk loader degraded performance 200000 approximately 30%. This happens because SQL*Loader is trying 0 1 ODBC Loader 2 ODBC 4 ODBC 8 ODBC 16 ODBC to insert data in the same tablespace segments, causing immense Loaders Loaders Loaders Loaders Série1 455625 627706 935956 1474375 1491250 data block locking problems and consequent database waits. Number of simultaneous instances of the middleware loaders Therefore, only one instance of the bulk loader should be used at a time, at least for each table to update. Tests with the other data Fig. 1. ODBC insertion volume for running simultaneous loading methods revealed the same results, although with data loading middleware instances performance degrading approximately 10%. 295
5 .As seen in Figure 1, only one ODBC loader cannot insert 1 much faster than in large sized tables. As a matter of fact, this is million transactions in 24 hours. Using 8 simultaneous instances mainly the reason why OLTP data sources are maintained with the of ODBC loader is the most efficient method, for 16 simultaneous fewer amount possible of necessary records, in order to maximize ODBC loaders increases the number of loaded transactions in an its availability. The proposed continuous data warehouse loading insignificant manner. Given this, we can state that to accomplish methodology is presented in Figure 2. loading 1.000.000 transactions in a day means that we can integrate a set of 8 complete transactions every 691 ms using 8 simultaneous instances. In conclusion, if we use SQL*Loader, we 5.1 Adapting the Data Warehouse Schema Suppose a very simple sales data warehouse with the schema can integrate new data every 2 seconds, 24 complete transactions illustrated in Figure 2, having two dimensional tables (Store and each time. If we use the ODBC middleware loader, we can Customer, representing business descriptor entities) and one integrate 8 new complete transactions each time, every 0,691 fact table (Sales, storing business measures aggregated from seconds. This is the closest to continuous data integration both the loading methods can get. transactions). To simplify the figure, the Date dimension is not shown. This DW allows storing the sales value per store, per 4.3 Loading Data in Large Tables vs. Loading customer, per day. The primary keys are represented in bold, while the referential integrity constraints with foreign keys are Data in Small Tables represented in italic. The factual attribute S_Value is additive. Since data is stored physically in tablespaces, the resources This property in facts is very important for our methodology, as involved in the management of their segments can also interfere we shall demonstrate further on. with data loading performance. The number of tablespace segments which needs to be managed for each table is proportional to the amount of data that lies in the table. Therefore, we also tested inserting 1.000.000 transactions into the original tables Orders and LineItem against inserting that same data into empty replicas of these tables, using the same procedures as shown is sub-section 4.1. Table 3. Time (HH:MM:SS) inserting a single complete Fig. 3. Sample sales data warehouse schema load of TPC-H 1.000.000 transactions using empty tables For the area concerning data warehouse schema, we adopt the Empty Empty Time following method: Orders LineItem Total Improv. Replica Replica Data warehouse schema adaptation for supporting useful- SQL*Loader 00:01:13 00:06:05 00:07:18 72% time data warehousing: Creation of an exact structural ODBC SQL Insert 00:10:02 00:42:09 00:52:11 52% replica of all the tables of the DW that could eventually receive DBMS SQL Insert 00:00:55 00:04:55 00:05:50 45% new data. These tables (referred also as temporary tables) are to be created empty of contents, with no defined indexes, Comparing the results in Table 3 with Table 2, loading new data primary key, or constraints of any kind, including referential into an empty table is significantly faster than loading it into a integrity. For each table, an extra attribute must be created, for table with a considerable initial size. This should also be taken storing a unique sequential identifier related to the insertion of under account for designing useful-time data warehousing ETL. each row within the temporary tables. We shall now present our continuous data integration The modified sub-schema is shown in Figure 4. The unique methodology, based on the conclusions illustrated in this section. sequential identifier attribute in each temporary table records the sequence in which each row is appended in the database. This 5. USEFUL-TIME DATA WAREHOUSE allows identifying the exact sequence for each new inserted row, LOADING METHODOLOGY useful for restoring prior data states in disaster recovery Our methodology is focused on four major areas: (1) data procedures, and also for discarding dimensional rows which have warehouse schema adaptation; (2) ETL loading procedures; (3) more recent updates. For instance, if the same customer has had OLAP query adaptation; and (4) DW database packing and two updates in the OLTP systems which, consequently, lead to the reoptimization. It is mainly based on a very simple principle: new insertion of two new rows in the temporary table CustomerTmp, row insertion procedures in tables with few (or no) contents are only the most recent one is relevant. This can be defined by performed much faster than in big size tables, as shown in the considering as most recent the row with highest CTmp_Counter previous section of this paper. It is obvious and undeniable that for that same customer (CTmp_CustKey). data handling in small sized tables are much less complex and Fig. 2. General architecture of the proposed continuous data warehouse loading methodology 296
6 . transaction, it should be 1000. The rows in the temporary fact StoreTmp CustomerTmp SalesTmp table with STmp_Counter = 1011 and STmp_Counter = 1012 StTmp_StoreKey CTmp_CustKey StTmp_Description STmp_StoreKey CTmp_Name reflect this modification of values. The first eliminates the value StTmp_Address STmp_CustomerKey CTmp_Address of the initial transactional row and the second has the new real StTmp_PostalCode STmp_Date CTmp_PostalCode StTmp_Phone STmp_Value CTmp_Phone value, due to the additivity of the STmp_Value attribute. The StTmp_EMail STmp_Counter CTmp_EMail definition of which attributes are additive and which are not StTmp_Manager CTmp_Counter StTmp_Counter should be the responsibility of the Database Administrator (DBA). According to [11], the most useful facts in a data warehouse are Fig. 4. Sample DW sub-schema for supporting useful-time DW numeric and additive. The method for data loading uses the most The authors of the ARKTOS tool [23] refer that their own simple method for writing data: appending new records. Any experience, as well as the most recent literature, suggests that the other type of writing method needs the execution of more time main problems of ETL tools do not consist only in performance consuming and complex tasks. problems (as normally would be expected), but also in aspects such as complexity, practicability and price. By performing only record insertion procedures inherent to continuous data integration using empty or small sized tables without any kind of constraint or attached physical file related to it, we guarantee the simplest and fastest logical and physical support for achieving our goals [12]. The fact that the only significant change in the logical and physical structure of the data warehouse’s schema is the Fig. 5. Partial contents of temporary fact table SalesTmp simple adaptation shown in Figure 4, allows implementing ETL with exemplification record insertions procedures in a manner to maximize its operationability. Data loading may be done by simple standard SQL instructions or DBMS batch loading software such as SQL*Loader [16], with a 5.3 OLAP Query Adaptation minimum of complexity. There is no need for developing complex Consider the following query, calculating the total revenue per routines for updating the data area, in which the needed data for is store in the last 7 days: easily accessible, independently from the used ETL tools. SELECT S_StoreKey, Sum(S_Value) AS Last7DaysSV FROM Sales 5.2 ETL Loading Procedures WHERE S_Date>=SystemDate()-7 To refresh the DW, once the ETL application has extracted and GROUP BY S_StoreKey transformed the OLTP data into the correct format for loading the To take advantage of method and include the most recent data in data area, it immediately proceeds in inserting that record as a the OLAP query response, queries should be rewritten taking new row in the correspondent temporary table, filling the unique under consideration the following rule: the FROM clause should sequential identifier attribute with the autoincremented sequential join all rows from the required original and temporary tables number. This number should start at 1 for the first record to insert with relevant data, excluding all fixed restriction predicate values in the DW after executing the packing and reoptimizing technique from the WHERE clause whenever possible. The modification for (explained in section 4.4 of this paper), and then be the prior instruction is illustrated below. It can be seen that the autoincremented by one unit for each record insertion. The relevant rows from both issue tables are joined for supplying the algorithm for accomplishing continuous data integration by the OLAP query answer, filtering the rows used in the resulting ETL tool may be similar to: dataset according to its restrictions in the original instruction. Trigger for each new OLTP record (after it is commited) SELECT S_StoreKey, Sum(S_Value) AS Last7DaysSV Extract new record from OLTP system FROM (SELECT S_StoreKey, S_Value Clean and transform the OLTP data, shaping it into the FROM Sales DW destination table’s format WHERE S_Date>=SystemDate()-7) Increment record insertion unique counter UNION ALL (SELECT STmp_StoreKey, STmp_Value Create a new record in the data warehouse temporary FROM SalesTmp destination table WHERE STmp_Date>=SystemDate()-7) Insert the data in the temporary destination table’s new GROUP BY S_StoreKey record, along with the value of the record insertion unique counter An interesting and relevant aspect of the proposed methodology is End_Trigger that if users wish to query only the most recent information, they only need to do so against the temporary replicated tables. For Following, we demonstrate a practical example for explaining instance, if the temporary tables are meant to be filled with data situations regarding updating the data warehouse shown in Figure for each business day before they are recreated, and we want to 4. Figure 5 presents the insertion of a row in the data warehouse know the sales value of the current day, per store, the adequate temporary fact table for the recording of a sales transaction of response could be obtained from the following SQL instruction: value 100 which took place at 2008-05-02 in store with St_StoreKey = 1 related to customer with C_CustKey = 10, SELECT STmp_StoreKey, identified by STmp_Counter = 1001. Meanwhile, other Sum(STmp_Value) AS TodaysValue FROM SalesTmp transactions occurred, and the organization’s OLTP system WHERE STmp_Date=SystemDate() recorded that instead of a value of 100 for the mentioned GROUP BY STmp_StoreKey 297
7 .This way, our method aids the processing of the data warehouse’s As seen in Table 4, the extra time needed for executing the query most recent data, for this kind of data is stored within the workload when accessing and joining the temporary tables with temporary replica tables, which are presumed to be small in size. the original ones led to an increase of 4,4 % of standard execution This minimizes CPU, memory and I/O costs involved in most time (4012 – 3842 = 170 seconds). This represents an almost recent data query processing. Theoretically, this would make it insignificant impact in OLAP response time. In what concerns the possible to deliver the most recent decision making information data loading methods, using ODBC middleware allows inserting while the business transaction itself occurs. data at a higher frequency than bulk loading, but the impact in OLAP response time for the first is much higher than for the last. 5.4 Packing and Reoptimizing the Data W. The extra time needed for executing the query workload using the Since the data is integrated within tables without optimization of bulk loading method is 1006 seconds (4848 – 3842), representing any kind that could speed up querying, such as indexes, for an increase of 26,2 % of standard execution time, against 6402 instance, its functionality is affected, implying a decrease of seconds and increase of 166,6% of standard execution time for performance. After a certain number of insertions the performance ODBC data loading, which is much worse. Therefore, using becomes too poor to consider as acceptable. To regain frequent bulk loading of a small set of complete transactions with performance it is necessary to execute a pack routine which will predefined time intervals between executions which assure data update the original DW schema tables using the records in the quality loading seems to be the best method. temporary tables, and recreate them empty of contents, along with rebuilding the original tables’ indexes and materialized views, so 5.6 Final Remarks that maximum processing speed is obtained once more. For Notice that only record insertions are used for updating the DW updating the original DW tables, the rows in the temporary tables for all related transactions in the OLTP source systems. Since this should be aggregated according to the original tables’ primary type of procedure does not require any record locking in the tables keys, maintaining the rows with highest unique counter attribute (except for the appended record itself) nor search operations for value for possible duplicate values in non-additive attributes, for previously stored data before writing data (like in update or delete they represent the most recent records. The time needed for instructions), the time necessary to accomplish this is minimal. executing these procedures represents the only time period where The issue of record locking is strongly enforced by the fact that the DW in unavailable to OLAP tools and end users, for they need the referred tables do not have any indexes or primary keys, to be executed exclusively. The appropriate moment for doing this implying absolutely no record locking, except for the appended can be determined by the DBA, or automatically, taking under record itself. Furthermore, since they do not have constraints of consideration parameters such as a determined number of records any sort, including referential integrity and primary keys, there is in the temporary tables, the amount of physically occupied space, no need to execute time consuming tasks such as index updating or yet a predefined period of time. The determination of this or referential integrity cross checks. Kimball refers in [12] that moment should consist on the best possible compromise related to many ETL tools use a UPDATE ELSE INSERT function in its frequency of execution and the amount of time it takes away loading data, considering this as a performance killer. With our from all user availability, which depends on the physical, logical method, any appending, updating or eliminating data tasks on and practical characteristics inherent to each specific DW OLTP systems reflect themselves as only new record insertions in implementation itself and is not object of discussion in this paper. the data warehouse, which allows minimizing row, block and table locks and other concurrent data access problems. Physical 5.5 Experimental Evaluation database tablespace fragmentation is also avoided, once there is In order to evaluate our methodology, we tested the execution of a now deletion of data, only sequential increments. This allows us set of TPC-H queries { Q1, Q2, Q4, Q6, Q11, Q12, Q13, Q14, to state that the data update time window for our methods is Q16, Q22 } while continuously integrating data, aiming for the minimal for the insertion of new data, maximizing the availability insertion of 1.000.000 transactions coming from outside the of that data, and consequently contributing to effectively increase database within 24 hours, representing 555 Mbytes of data. Both the data warehouses’ global availability and minimize any data loading methods were experimented. Following what was negative impact in its performance. mentioned in section 4 of this paper, we used one instance of Since we want to obtain data near real-time, the time gap between SQL*Loader for loading 24 transactions each time, every 2 recording OLTP transactions and their extraction by ETL seconds. We also tested ODBC middleware data loading, inserting processes is minimal, occurring nearly at the same time, which 8 simultaneous transactions at a time, every 0,691 seconds. We somewhat reduces error probability. We can also assume that the compare these results to the standard execution time of the amount of intermediate “information buckets” which the data mentioned TPC-H query workload, without performing passes through in the ETL Area is also minimal, for temporary continuous data integration. These results are shown in Table 4. storage is almost not needed. Furthermore, instead of extracting a significant amount of OLTP data, which is what happens in Table 4. Experimental evaluation of the continuous data “traditional” bulk loading, the volume of information extracted integration methodology with the new DW schema and transformed in real-time is extremely reduced (representing OLAP OLAP OLAP commonly few dozen bytes), for it consists of only one transaction Standard exec. exec. exec. per execution cycle. All this allows assuming that the extraction % of % of % of OLAP time increase time increase time increase and transformation phase will be cleaner and more time efficient. exec. without using using As a limitation to our methodology, DW contexts in which in exec. in exec. in exec. time data ODBC bulk (seconds) integrati time data time data time additive attributes are difficult or even not possible to define for on loading loading their fact tables may invalidate its practice. It also does not 3842 4012 4,4 % 10244 166,6 % 4848 26,2 % consider the optimization of materialized views. 298
8 .6. CONCLUSIONS AND FUTURE WORK [16] Oracle Corporation, 2005. www.oracle.com This paper refers the necessary requirements for useful-time data [17] T. B. Pedersen: “How is BI Used in Industry?”, Int. Conf. on Data Warehousing and Knowledge Discovery (DAWAK), 2004. warehousing, which imply the capability to deal with integrating [18] J. F. Roddick, and M. Schrefl: “Towards an Accommodation of data in the data warehouse in a continuous fashion. We Delay in Temporal Active Databases”, 11th Australasian Database demonstrate which are the best DW loading methods for each Conference (ADC), 2000. context and present a methodology for achieving useful-time data [19] Simitsis, P. Vassiliadis and T. Sellis: “Optimizing ETL Processes in warehousing by enabling continuous data integration, while Data Warehouses”, Int. Conf. on Data Engineering (ICDE), 2005. minimizing impact in query execution on the user end of the DW. [20] U. Srivastava, and J. Widom: “Flexible Time Management in Data This is achieved by data structure replication and adapting query Stream Systems”, Int. Conf. on Principles of Database Systems instructions in order to take advantage of the new schemas, while (PODS), 2004. executing the best previously determined continuous data [21] D. Theodoratus, and M. Bouzeghoub: “Data Currency Quality integration methods. We have shown its functionality, recurring to Factors in Data Warehouse Design”, Int. Workshop on Design and a simulation using the TPC-H benchmark, performing continuous Management of Data Warehouses (DMDW), 1999. data integration against the execution of query workloads, for [22] TPC-H decision support benchmark, Transaction Processing Council, www.tpc.com. each used data warehouse loading method. All scenarios show that it is possible to achieve UTDW performance in exchange for [23] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis: “ARKTOS: Towards the Modelling, Design, Control and an average increase of query execution time. This should be Execution of ETL Processes”, Inf. Systems, Vol. 26(8), 2001. considered the price to pay for real-time capability within the DW. [24] White: “Intelligent Business Strategies: Real-Time Data As future work we intend to develop an ETL tool which will Warehousing Heats Up”, DM Preview, integrate this methodology with extraction and transformation www.dmreview.com/article_sub_cfm?articleId=5570, 2002. routines for the OLTP systems. There is also room for optimizing [25] J. Yang: “Temporal Data Warehousing”, Ph.D. Thesis, Dp. the query instructions used for our methods, as well as including Computer Science, Stanford Univ, 2001. the problems referring to updating materialized views. [26] J. Yang, and J. Widom: “Incremental Computation and Maintenance of Temporal Aggregates”, 17th Intern. Conference on 7. REFERENCES Data Engineering (ICDE), 2001. [1] D. J. Abadi, D. Carney, et al.: “Aurora: A New Model and [27] J. Yang, and J. Widom: “Temporal View Self-Maintenance”, 7th Int. Architecture for Data Stream Management”, The VLDB Journal, Conf. Extending Database Technology (EDBT), 2001. 12(2), pp. 120-139, 2003. [28] T. Zurek, and K. Kreplin: “SAP Business Information Warehouse – [2] S. Babu, and J. Widom: “Continuous Queries Over Data Streams”, From Data Warehousing to an E-Business Platform”, 17th Int. SIGMOD Record 30(3), pp. 109-120. Conf. on Data Engineering (ICDE), 2001. [3] T. Binder: Gong User Manual, Tecco Software AG, 2003. [29] D. Burleson: Oracle data load (import, SQL*Loader) speed tips, [4] M. Bouzeghoub, F. Fabret, and M. Matulovic: “Modeling Data Burleson Consulting, http://www.dba- Warehouse Refreshment Process as a Workflow Application”, Int. oracle.com/oracle_tips_load_speed.htm, 2006. Workshop on Design and Management of DW (DMDW), 1999. [30] Oracle Corporation: [5] R. M. Bruckner, B. List, and J. Schiefer: “Striving Towards Near http://www.oracle.com/technology/products/database/utilities/index. Real-Time Data Integration for Data Warehouses”, Int. Conf. Data html, 2008. Warehousing and Knowledge Discovery (DAWAK), 2002. [31] N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom, H. [6] R. M. Bruckner, and A. M. Tjoa: “Capturing Delays and Valid Balakrishnan, U. Çetintemel, M. Cherniack, R. Tibbetts, S. B. Times in Data Warehouses – Towards Timely Consistent Analyses”. Zdonik: “Towards a Streaming SQL Standard”, Int. Conf. Very Journal of Intelligent Inf. Systems (JIIS), 19:2, pp. 169-190, 2002. Large Data Bases (VLDB), 1(2): 1379-1390, 2008. [7] S. Chaudhuri, and U. Dayal: “An Overview of Data Warehousing [32] N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, N. and OLAP Technology”, SIGMOD Record, Volume 26, Number 1, Frantzell: “Meshing Streaming Updates with Persistent Data in an pp. 65-74, 1997. Active Data Warehouse”, IEEE Transactions on Knowl. Data Eng, [8] W. H. Inmon, R. H. Terdeman, J. Norris-Montanari, and D. Meers: 20(7): 976-991, 2008. Data Warehousing for E-Business, J. Wiley & Sons, 2001. [33] H. Agrawal, G. Chafle, S. Goyal, S. Mittal, S. Mukherjea: “An [9] C. Italiano, and J. E. Ferreira: “Synchronization Options for Data Enhanced Extract-Transform-Load System for Migrating Data in Warehouse Designs”, IEEE Computer Magazine, 2006. Telecom Billing”, Int. Conference on Data Engineering (ICDE), pp. [10] Karakasidis, P. Vassiliadis, and E. Pitoura: “ETL Queues for Active 1277-1286, 2008. Data Warehousing”, IQIS’05, 2005. [34] A. Simitsis, P. Vassiliadis: “A method for the mapping of [11] R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite: The Data conceptual designs to logical blueprints for ETL processes”, Warehouse Lifecycle Toolkit – Expert Methods for Designing, Decision Support Systems (45) 22–40, 2008. Developing and Deploying Data Warehouses, Wiley Computer Pub, [35] P. Vassiliadis, A. Simitsis, P. Georgantas, M. Terrovitis, S. 1998. Skiadopoulos: “A generic and customizable framework for the [12] R. Kimball, and J. Caserta: The Data Warehouse ETL Toolkit, Wiley design of ETL scenarios”, Inform. Systems, 30(7): 492-525, 2005. Computer Publishing, 2004. [36] IBM, IBM Data warehouse manager, available at http://www- [13] E. Kuhn: “The Zero-Delay Data Warehouse: Mobilizing 3.ibm.com/software/data/db2/datawarehouse/. Heterogeneous Databases”, Int. Conf. on Very Large Data Bases [37] Informatica, Power Center, available at (VLDB), 2003. http://www.informatica.com/products/data+integration/powercenter/ [14] W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J. Widom: default.htm. “Performance Issues in Incremental Warehouse Maintenance”, Int. [38] Microsoft, Data transformation services, available at Conf. on Very Large Data Bases (VLDB), 2000. http://www.microsoft.com. [15] D. Lomet, and J. Gehrke: Special Issue on Data Stream Processing, [39] Oracle Corporation, “Oracle warehouse builder product page”, IEEE Data Eng. Bulletin, 26(1), 2003. available at http://otn.oracle.com/products/warehouse/content.html 299