Improving Visualizations to Support Rapid Decision Making

下载 0

da仔

发布于

1967

人观看

#信息技术

Data visualization is an effective mechanism for identifying trends,insights, and anomalies in data. On large datasets, however, generating visualizations can take a long time, delaying the extraction. One solution is to use online sampling-based schemes to generate visualizations faster while improving the displayed estimates incrementally, eventually converging to the exact visualization computed on the entire data. We have developed the algorithms in the context of an incremental visualization tool, titled INCVISAGE, for trendline and heatmap visualizations. We evaluate the usability of INCVISAGE via user studies and demonstrate that users are able to make effective decisions with incrementally improving visualizations, especially compared to vanilla online-sampling based schemes.

展开查看详情

1 . Visualizations to Support Rapid Decision Making Sajjadur Rahman1 Maryam Aliakbarpour2 Ha Kyung Kong1 Aditya Parameswaran1 Ronitt Rubinfield2 1 University of Illinois (UIUC) 4 Adobe Research {srahman7,hkong6,kkarahal,adityagp}@illinois.edu ABSTRACT Data visualization is an effective mechanism for identifying trends, insights, and anomalies in data. On large datasets, however, gen- erating visualizations can take a long time, delaying the extraction of insights, hampering decision making, and reducing exploration time. One solution is to use online sampling-based schemes to generate visualizations faster while improving the displayed esti- mates incrementally, eventually converging to the exact visualiza- tion computed on the entire data. However, the intermediate vi- sualizations are approximate, and often fluctuate drastically, lead- ing to potentially incorrect decisions. We propose sampling-based incremental visualization algorithms that reveal the “salient” fea- tures of the visualization quickly—with a 46× speedup relative to baselines—while minimizing error, thus enabling rapid and error- free decision making. We demonstrate that these algorithms are optimal in terms of sample complexity, in that given the teractivity, they generate approximations that take as few samples as possible. We have developed the algorithms in the context of an incremental visualization tool, titled I NC V ISAGE, for and heatmap visualizations. We evaluate the usability of AGE via user studies and demonstrate that users are able to make effective decisions with incrementally improving visualizations, pecially compared to vanilla online-sampling based schemes. 1. INTRODUCTION Visualization is emerging as the most common mechanism for exploring and extracting value from datasets by novice and expert analysts alike. This is evidenced by the huge popularity of data vi- sualization tools such as Microsoft’s PowerBI and Tableau, both of which have 100s of millions of dollars in revenue just this year [4, 2]. And yet data visualization on increasingly large datasets, re- mains cumbersome: when datasets are large, generating visualiza- tions can take hours, impeding interaction, preventing exploration, and delaying the extraction of insights [38]. One approach to gen- erating visualizations faster is to sample a small number of data- points from the dataset online; by using sampling, we can view visualizations that incrementally improve over time and eventually converge to the visualization computed on the entire data. How- ever, such intermediate visualizations are approximate, and often fluctuate drastically, leading to incorrect insights and conclusions. The key question we wish to address in this paper is the follow- ing: can we develop a sampling-based incremental visualization algorithm that reveals the features of the eventual visualization quickly, but does so in a manner that is guaranteed to be correct? Illustrative Example. We describe the goals of our sampling al- gorithms via a pictorial example. In Figure 1, we depict, in the first row, the variation of present sampling algorithms as time pro- gresses and more samples are taken: at t1 , t2 , t4 , t7 , and when all of the data has been sampled. This is, for example, what visu- alizing the results of an online-aggregation-like [20] sampling al- I’ve Seen “Enough”: Incrementally Improving Eric Blais3 Karrie Karahalios1,4 MIT 2 3 University of Waterloo {maryama@,ronitt@csail}.mit.edu eblais@uwaterloo.ca gorithm might provide. If a user sees the visualization at any of the intermediate time points, they may make incorrect decisions. For example, at time t1 , the user may reach an incorrect conclu- sion that the values at the start and the end are lower than most of the trend, while in fact, the opposite is true—this anomaly is due to the skewed samples that were drawn to reach t1 . The visual- ization continues to vary at t2 , t4 , and t7 , with values fluctuating randomly based on the samples that were drawn. Indeed, a user may end up having to wait until the values stabilize, and even then may not be able to fully trust the results. One approach to amelio- rate this issue would be to use confidence intervals to guide users in deciding when to draw conclusions—however, prior work has demonstrated that users are not able to interpret confidence inter- vals correctly [15]. At the same time, the users are subject to the same vagaries of the samples that were drawn. level of in- trendline I NC V IS - es- Figure 1: I NC V ISAGE example. Another approach, titled I NC V ISAGE1 , that we espouse in this paper and depict in the second row is the following: at each time point ti , reveal one additional segment for a i-segment trendline, by splitting one of the segments for the trendline at ti−1 , when the tool is confident enough to do so. Thus, I NC V ISAGE is very con- servative at t1 and just provides a mean value for the entire range, then at t2 , it splits the single segment into two segments, indicating that the trend increases towards the end. Overall, by t7 , the tool has indicated many of the important features of the trend: it starts off high, has a bump in the middle, and then increases towards the end. This approach reveals features of the eventual visualization in the order of prominence, allowing users to gain early insights and draw conclusions early. This approach is reminiscent of interlaced pixel- based image generation in browsers [40], where the image slowly goes from being blurry to sharp over the course of the rendering, displaying the most salient features of the visualization before the less important features. Similar ideas have also been developed in other domains such as signal processing [50] and geo maps [13]. So far, we’ve described trendlines; the I NC V ISAGE approach can be applied to heatmap visualizations as well—depicted in row 4 for 1 I NC V ISAGE is a portmanteau of “Inc”, i.e., short for incremental, and “Envisage”, i.e., to visualize.

2 .the corresponding online-aggregation-like approach shown in row approach can be used in tandem with online aggregation-based ap- 3—as is typical in heatmaps, the higher the value, the darker the proaches. IFOCUS [32], PFunk-H [10], and ExploreSample [53] color. Here, there is no notion of confidence intervals, so row 3 is are other approximate visualization algorithms targeted at gener- our current best approach for depicting the approximate heatmap. ating visualizations rapidly while preserving perceptual insights. Once again row 3—the status quo—fluctuates tremendously, not IFOCUS emphasizes the preservation of pairwise ordering of bars letting analysts draw meaningful insights early and confidently. On in a bar chart, as opposed to the actual values; PFunk-H uses per- the other hand, row 4—the I NC V ISAGE approach—repeatedly sub- ceptual functions from graphical perception research to terminate divides a block into four blocks when it is confident enough to do visualization generation early; ExploreSample approximates scat- so, emphasizing early, that the right hand top corner has a higher terplots, ensuring that overall distributions and outliers are pre- value, while the values right below it are somewhat lower. In fact, served. Lastly, M4 [29] uses rasterization to reduce the dimension- the intermediate visualizations may be preferable because users can ality of a time series without impacting the resulting visualization. get the big picture view without being influenced by noise. None of these methods emphasize revealing features of visualiza- Open Questions. Naturally, developing I NC V ISAGE brings a whole tions incrementally. host of open questions, that span the spectrum from usability to al- Outline. The outline of the remainder of this paper is as follows: gorithmic process. First, it is not clear at what rate we should be in Section 2 we formally define the incremental visualization prob- displaying the results of the incremental visualization algorithm. lem. Section 3 outlines our incremental visualization algorithm When can we be sure that we know enough to show the ith incre- while Section 4 details the system architecture of I NC V ISAGE. In ment, given that the (i − 1)th increment has been shown already? Section 5 we summarize the experimental results and the key take- How should the ith increment differ from the (i − 1)th increment? aways. Then we present the user study design and outcomes in How do we prioritize sampling to ensure that we get to the ith in- Section 6 (for usability) and 7 (for comparison to traditional online crement as soon as possible, but with guarantees? Can we show sampling schemes). Section 8 describes other related work on data that our algorithm is in a sense ‘optimal’, in that it aims to take as visualization and analytics. few samples as possible to display the ith increment with guaran- tees? And at the same time, how do we ensure that our algorithm is 2. PROBLEM FORMULATION lightweight enough that computation doesn’t become a bottleneck? In this section, we first describe the data model, and the visual- How do we place the control in the user’s hands in order to control ization types we focus on. We then formally define the problem. the level of interactivity needed? The other open questions involved are related to how users in- 2.1 Visualizations and Queries terpret incremental visualizations. Can users understand and make sense of the guarantees provided? Can they use these guarantees to Data and Query Setting. We operate on a dataset consisting of make well-informed decisions and terminate early without waiting a single large relational table R. Our approach also generalizes to for the entire visualization to be generated? multiple tables obeying a star or a snowflake schemata but we fo- cus on the single table case for ease of presentation. As in a tradi- Contributions. In this paper, we address all of these open ques- tional OLAP setting, we assume that there are dimension attributes tions. Our primary contribution in the paper is the notion of incre- and measure attributes—dimension attributes are typically used as mentally improving visualizations that surface important features group-by attributes in aggregate queries, while measure attributes as they are determined with high confidence — bringing a con- are the ones typically being aggregated. For example, in a prod- cept that is commonly used in other settings, e.g., rasterization and uct sales scenario, day of the year would be a typical dimension signal processing, to visualizations. Given a user specified inter- attribute, while the sales would be a typical measure attribute. activity threshold (described later), we develop incremental visual- I NC V ISAGE supports two kinds of visualizations: trendlines and izations algorithms for I NC V ISAGE that minimizes error. We intro- heatmaps. These two types of visualizations can be translated to duce the concept of improvement potential to help us pick the right aggregate queries QT and QH respectively: improvement per increment. We find, somewhat surprisingly, that a remarkably simple algorithm works best under a sub-Gaussian QT = SELECT Xa , AVG(Y) FROM R assumption [45] about the data, which is reasonable to assume in GROUP BY Xa ORDER BY Xa , and real-world datasets (as we show in this paper). We further demon- QH = SELECT Xa , Xb , AVG(Y) FROM R strate that these algorithms are optimal in that they generate approx- GROUP BY Xa , Xb ORDER BY Xa , Xb imations within some error bound given the interactivity threshold. When users don’t provide their desired interactivity threshold, we Here, Xa and Xb are dimension attributes in R, while Y is a mea- can pick appropriate parameters that help them best navigate the sure attribute being aggregated. The trendline and heatmap visu- tradeoff between error and latency. We additionally perform user alizations are depicted in Figure 1. For trendlines (query QT ), studies to evaluate the usability of an incremental visualization in- the attribute Xa is depicted along the x-axis while the aggregate terface, and evaluate whether users are able to make effective deci- AVG(Y) is depicted along the y-axis. On the other hand, for sions with incrementally improving visualizations. We found that heatmaps (query QH ) the two attributes Xa and Xb are depicted they are able to effectively determine when to stop the visualization along the x-axis and y-axis, respectively. The aggregate AVG(Y) and make a decision, trading off latency and error, especially when is depicted by the color intensity for each block (i.e., each Xa , Xb compared to traditional online sampling schemes. combination) of the heatmap. Give a continuous color scale, the higher the value of AVG(Y), the higher the color intensity. The Prior Work. Our work is complementary to other work in the complete set of queries (including WHERE clauses and other ag- incremental visualization space. In particular, sampleAction [17] gregates) that are supported by I NC V ISAGE can be found in Sec- and online aggregation [20] both perform online sampling to depict tion 3.5—we focus on the simple setting for now. aggregate values, along with confidence-interval style estimates to Note that we are implicitly focusing on Xa and Xb that are ordi- depict the uncertainty in the current aggregates. Online aggrega- nal, i.e., have an order associated with them so that it makes sense tion presents these values as raw values, while sampleAction dis- to visualize them as a trendline or a heatmap. As we will demon- plays the corresponding bar chart. In both cases, however, these ap- strate subsequently, this order is crucial in letting us approximate proaches prevent users from getting early insights since they need portions of the visualization that share similar behavior. While our to wait for the values to stabilize. As we will discuss later, our techniques could also be applied to Xa , Xb that are not ordinal by

3 .enforcing some order, e.g., a lexicographic order, the resulting vi- one of the segments S in the i-segment approximation into two sualizations are not as meaningful. new segments to give an (i + 1)-segment approximation: we call Sampling Engine. We assume that we have a sampling engine this process splitting a segment. The group within S ∈ Li im- that can efficiently retrieve random tuples from R corresponding to mediately following which the split occurs is referred to as a split different values of the dimension attribute(s) Xa and/or Xb (along group. Any group in the interval I ∈ S except for the last group with optional predicates from a WHERE). Focusing on QT for can be chosen as a split group. As an example, in Figure 1, the now, given a certain value of Xa = ai , our engine provides us a entire second row corresponds to an incrementally improving visu- random tuple that satisfies Xa = ai . Then, by looking up the value alization, where the 2-segment approximation is generated by tak- of Y corresponding to this tuple, we can get an estimate for the av- ing the segment in the 1-segment approximation corresponding to erage of Y for Xa = ai . Our sampling engine is drawn from Kim ([1, 36], 0.5), and splitting it at group 30 to give ([1, 30], 0.4) and et al. [32] and maintains an in-memory bitmap on the dimension ([31, 36], 0.8). Therefore, the split group is 30. The reason why attributes, allowing us to quickly identify tuples matching arbitrary we enforce two neighboring segment approximations to be related conditions [33]. Bitmaps are highly compressible and effective for in this way is to ensure that there is continuity in the way the visu- read-only workloads [34, 52, 51], and have been applied recently alization is generated, making it a smooth user experience. If, on to sampling for approximate generation of bar charts [32]. the other hand, each subsequent segment approximation had no re- Thus, given our sampling engine, we can retrieve samples of Y lationship to the previous one, it could be a very jarring experience given Xa = ai (or Xa = ai ∧ Xb = bi for heat maps). We call the for the user with the visualizations varying drastically, making it multiset of values of Y corresponding to Xa = ai across all tuples hard for them to be confident in their decision making. We show to be a group. This allows us to say that we are sampling from a in Section 5 that removing this restriction results in visualizations group, where implicitly we mean we are sampling the correspond- that are not significantly better in terms of error, but are much more ing tuples and retrieving the Y value. costly to compute. Next, we describe our problem of incrementally generating visual- Underlying Data Model and Output Model. To characterize izations more formally. We focus on trendlines (i.e., QT ); the cor- the performance of an incrementally improving visualization, we responding definitions and techniques for heatmaps are described need a model for the underlying data. We represent the underly- in Appendix A. ing data as a sequence of m distributions D1 , . . . , Dm with means µ1 , . . . , µm where, µi = AVG(Y) for xi ∈ Xa . To generate our 2.2 Incremental Visualizations incrementally improving visualization and its constituent segment approximations, we draw samples from distributions D1 , . . . , Dm . From this point forward, we describe the concepts in the context Our goal is to approximate the mean values (µ1 , . . . , µm ) while of our visualizations in row 2 of Figure 1. taking as few samples as possible. Segments and k-Segment Approximations. We denote the cardi- The output of a k-segment approximation Lk can be represented nality of our group-by dimension attribute Xa as m, i.e., |Xa | = alternately as a sequence of values (ν1 , . . . , νm ) such that νi is m. In Figure 1 this value is 36. At all time points over the course equal to the value corresponding to the segment that encompasses of visualization generation, we display one value of AVG(Y) cor- xi , i.e., ∀xi ∈Sj νi = ηj , where Sj (I, ηj ) ∈ Lk . By comparing responding to each group xi ∈ Xa , i ∈ 1 . . . m—thus, the user (ν1 , . . . , νm ) with (µ1 , . . . , µm ), we can evaluate the error corre- is always shown a complete trendline visualization. To approxi- sponding to a k-segment approximation, as we describe later. mate our trendlines, we use the notion of segments that encompass Incrementally Improving Visualization Generation Algorithm. multiple groups. We define a segment as follows: Given our data model, an incrementally improving visualization Definition 1. A segment S corresponds to a pair (I, η), where η is generation algorithm proceeds in iterations: at the ith iteration, this a value, while I is an interval I ⊆ [1, m] spanning a consecutive algorithm takes some samples from the distributions D1 , . . . , Dm , sequence of groups xi ∈ Xa . and then at the end of the iteration, it outputs the i-segment approx- For example, the segment S ([2, 4], 0.7) corresponds to the interval imation Li . Thus, a certain number of samples are taken in each of xi corresponding to x2 , x3 , x4 , and has a value of 0.7. Then, a iteration, and one segment approximation is output at the end of it. k-segment approximation of a visualization comprises k disjoint We denote the number of samples taken in iteration i as Ni . When segments that span the entire range of xi , i = 1 . . . m. Formally: Lm is output, the algorithm terminates. Definition 2. A k-segment approximation is a tuple Lk = S1 , . . . , Sk such that the segments S1 , . . . , Sk partition the interval [1, m] into 2.3 Characterizing Performance k disjoint sub-intervals. There are multiple ways to characterize the performance of in- crementally improving visualization generation algorithms. In Figure 1, at t2 , we display a 2-segment approximation, with seg- ments ([1, 30], 0.4) and ([31, 36], 0.8); and at t7 , we display a 7- Sample Complexity, Interactivity and Wall-Clock Time. The segment approximation, comprising ([1, 3], 0.8), ([4, 14], 0.4), . . ., first and most straightforward way to evaluate performance is by and ([35, 36], 0.7). When the number of segments is unspecified, measuring the samples taken in each iteration k, Nk , i.e., the sam- we simply refer to this as a segment approximation. ple complexity. Since the time taken to acquire the samples is proportional to the number of samples in our sampling engine (as Incrementally Improving Visualizations. We are now ready to shown in [32]), this is a proxy for the time taken in each itera- describe our notion of incrementally improving visualizations. tion. Recent work has identified 500ms as a “rule of thumb” for Definition 3. An incrementally improving visualization is defined interactivity in exploratory visual data analysis [38], beyond which to be a sequence of m segment approximations, (L1 , . . . , Lm ), analysts end up getting frustrated, and as a result explore fewer hy- where the ith item Li , i > 1 in the sequence is a i-segment approx- potheses. To enforce this rule of thumb, we can ensure that our imation, formed by selecting one of the segments in the (i − 1)- algorithms take only as many samples per iteration as is feasible segment approximation Li−1 , and dividing that segment into two. within 500ms — a time budget. We also introduce a new metric Thus, the segment approximations that comprise an incremen- called interactivity that quantifies the overall user experience: m tally improving visualization have a very special relationship to k=1 Nk × (m − k + 1) λ= each other: each one is a refinement of the previous, revealing k one new feature of the visualization and is formed by dividing where Nk is the number of samples taken at iteration k and k is

4 .the number of iterations where Nk > 0. The larger the λ, the to other query classes in Section 3.5. We can derive similar algo- smaller the interactivity: this measure essentially captures the av- rithms and guarantees for heatmaps in Appendix A. erage waiting time across iterations where samples are taken. We explore the measure in detail in Section 3.4 and 5.5. A more direct 3.1 Case 1: The Ideal Scenario way to evaluate performance is to measure the wall-clock time for We first consider the ideal case where the means µ1 , . . . , µm of each iteration. the distributions D1 , . . . , Dm are known. Then, our goal reduces to Error Per Iteration. Since our incrementally improving visual- obtaining the best k segment approximation of the distributions at ization algorithms trade off taking fewer samples to return results iteration k, while respecting the refinement restriction. Say the in- faster, it can also end up returning segment approximations that are crementally improving visualization generation algorithm has ob- incorrect. We define the 2 squared error of a k-segment approxi- tained a k-segment approximation Lk at the end of iteration k. mation Lk with output sequence (ν1 , . . . , νm ) for the distributions Then, at iteration (k + 1), the task is to identify a segment Si ∈ Lk D1 , . . . , Dm with means µ1 , . . . , µm as such that splitting Si into two new segments T and U minimizes the 1 m error of the corresponding (k + 1)-segment approximation Lk+1 . err(Lk ) = (µi − νi )2 (1) We describe the approach, followed by its justification. m i=1 Approach. At each iteration, we split the segment Si ∈ Lk into We note that there are several reasons a given k-segment ap- T and U that maximizes the quantity |T |·|U | |Si |·m (µT − µU )2 . Intu- proximation may be erroneous with respect to the underlying mean itively, this quantity—defined below as the improvement potential values (µ1 , . . . , µm ): (1) We are representing the data using k- of a refinement—picks segments that are large, and within them, segments as opposed to m-segments. (2) We use incremental re- splits where we get roughly equal sized T and U , with large differ- finement: each segment approximation builds on the previous, pos- ences between µT and µU . sibly erroneous estimates. (3) The estimates for each group and each segment may be biased due to sampling. Justification. The 2 squared error of a segment Si (Ii , ηi ), where These types of error are all unavoidable — the first two reasons Ii = [p, q] and 1 ≤ p ≤ q ≤ m, for the distributions Dp , . . . , Dq enable a visualization that incrementally improves over time, while with means µp , . . . , µq is q the last one occurs whenever we perform sampling: (1) While a 1 1 err(Si ) = (µj − ηi )2 = (µj − ηi )2 k-segment approximation does not capture the data exactly, it pro- (q − p + 1) |Si | j=p j∈Si vides a good approximation preserving visual features, such as the overall major trends and the regions with a large value. Moreover, Here, |Si | is the number of groups (distributions) encompassed computing an accurate k-segment approximation requires fewer by segment Si . When the means of the distributions are known, samples and therefore faster than an accurate m-segment approx- err(Si ) will be minimized if we represent the value of segment Si imation. (2) Incremental refinement allows users to have a fluid as the mean of the groups encompassed by Si , i.e., setting ηi = experience, without the visualization changing completely between µSi = j∈Si µj /|Si | minimizes err(Si ). Therefore, in the ideal neighboring approximations. At the same time, is not much worse scenario, the error of the segment Si is in error than visualizations that change completely between ap- 1 1 proximations, as we will see later. (3) And lastly, sampling is nec- err(Si ) = (µj − ηi )2 = µ2j − µ2Si (2) |Si | j∈Si |Si | j∈Si essary for us to return visualizations faster, but perhaps at the cost of erroneous estimates. Then, using Equation 1, we can express the 2 squared error of the k-segment approximation Lk as follows: 2.4 Problem Statement m k 1 |Si | The goal of our incrementally improving visualization genera- err(Lk ) = (µi − νi )2 = err(Si ) m m tion algorithm is, at each iteration k, generate a k-segment approx- i=1 i=1 imation that is not just “close” to the best possible refinement at that Now, Lk+1 is obtained by splitting a segment Si ∈ Lk into two point, but also takes the least number of samples to generate such segments T and U . Then, the error of Lk+1 is: an approximation. Further, since the decisions made by our algo- |Si | |T | |U | rithm can vary depending on the samples retrieved, we allow the err(Lk+1 ) = err(Lk ) − err(Si ) + err(T ) + err(U ) m m m user to specify a failure probability δ, which we expect to be close |Si | 2 |T | 2 |U | 2 to zero, such that the algorithm satisfies the “closeness” guarantee = err(Lk ) + µ − µ − µ m Si m T m U with probability at least 1 − δ. |T | · |U | Problem 1. Given a query QT , and the parameters δ, , design an = err(Lk ) − (µT − µU )2 . |Si | · m incrementally improving visualization generation algorithm that, at each iteration k, returns a k-segment approximation while taking where the second equality is due to Equation 5 and the fact that as few samples as possible, such that with probability 1 − δ, the er- the j µ2j is fixed no matter the segment approximation that is ror of the k-segment approximation Lk returned at iteration k does used, while last equality comes from the fact that |T | + |U | = |Si | not exceed the error of the best k-segment approximation formed and µSi = (|T |µT + |U |µU )/|Si |. We use the above expression by splitting a segment of Lk−1 by no more than . to define the notion of improvement potential. The improvement potential of a segment Si ∈ Lk is the minimization of the error of Lk+1 obtained by splitting Si into T and U . Thus, the improvement 3. VISUALIZATION ALGORITHMS potential of segment Si relative to T and U is In this section, we gradually build up our solution to Problem 1. |T | · |U | We start with the ideal case when we know the means of all of ∆(Si , T, U ) = (µT − µU )2 (3) the distributions up-front, and then work towards deriving an error |Si | · m guarantee for a single iteration of an incrementally improving visu- For any segment Si = (Ii , ηi ), every group in the interval Ii except alization algorithm. Then, we propose our incrementally improving the last one can be chosen to be the split group (see Section 2.2). visualization generation algorithm ISplit assuming the same guar- Therefore, there are |Si | − 1 possible ways to choose T, U ⊆ Si antee across all iterations. We further discuss how we can tune the such that T ∪ U = S. The split group maximizing the improve- guarantee across iterations in Section 3.4. We consider extensions ment potential of Si , minimizes the error of Lk+1 . The maximum

5 .improvement potential of a segment is expressed as follows: error of the best refinement L∗k+1 of Lk by at most err (L†k+1 ) − |T | · |U | err (L∗k+1 ) ≤ . ∆ (Si ) = max ∆(Si , T, U ) = max (µT − µU )2 T,U ⊆Si T,U ⊆Si |Si | · m Proof. The estimated improvement potential of the refinement Lk+1 Lastly, we denote the improvement potential of a given Lk+1 by satisfies φ(Lk+1 , Si , T, U ), where φ(Lk+1 , Si , T, U ) = ∆(Si , T, U ). There- fore, the maximum improvement potential of Lk+1 , φ (Lk+1 ) = |φ(Lk+1 ) − φ(Lk+1 )| maxSi ⊆Lk ∆ (Si ). When the means of the distributions are known, |S| 2 |T | 2 |U | 2 at iteration (k + 1), the optimal algorithm simply selects the refine- ≤ (µS − µS2 ) + (µT − µT2 ) + (µU − µU2 ) m m m ment corresponding to φ (Lk+1 ), which is the segment approxi- mation with the maximum improvement potential. ≤ . 2 3.2 Case 2: The Online-Sampling Scenario Together this inequality, the identity err(Lk+1 ) = err(Lk )−φ(Lk+1 ), We now consider the case where the means µ1 , . . . , µm are un- and the inequality φ(Lk+1 ) ≤ φ(L†k+1 ) imply that known and we estimate each mean by drawing samples. Similar err (L†k+1 ) − err (L∗k+1 ) to the previous case, we want to identify a segment Si ∈ Lk such that splitting Si into T and U results in the maximum improvement = φ(L∗k+1 ) − φ(L†k+1 ) potential. We will first describe our approach for a given iteration = φ(L∗k+1 ) − φ(L∗k+1 ) + φ(L∗k+1 ) − φ(L†k+1 ) assuming samples have been taken, then we will describe our ap- proach for selecting samples, following which, we will establish a ≤ φ(L∗k+1 ) − φ(L∗k+1 ) + φ(L†k+1 ) − φ(L†k+1 ) lower-bound. ≤ . 3.2.1 Selecting the Refinement Given Samples We first describe our approach, and then the justification. 3.2.2 Determining the Sample Complexity Approach. As in the previous setting, our goal is to identify the To achieve the guarantee for Theorem 1, we need to retrieve a refinement with the maximum improvement potential. Unfortu- certain number of samples from each of the distributions D1 , . . . , Dm . nately, since the means are unknown, we cannot measure the ex- We now describe our approach for drawing samples. act improvement potential, so we minimize the empirical improve- Approach. Perhaps somewhat surprisingly, we find that we need 2 ment potential based on samples seen so far. Analogous to the to draw a constant C = 2882amσ ln 4m δ from each distribu- previous section, we simply pick the refinement that maximizes tion Di to satisfy the requirements of Theorem 1. (We will de- |T |·|U | |Si |·m (µT − µU )2 , where the µs are the empirical estimates of the fine these parameters subsequently.) Thus, our sampling approach means. is remarkably simple—and is essentially uniform sampling—plus, Justification. At iteration (k + 1), we first draw samples from the as we show in the next subsection, other approaches cannot pro- ˜1 , . . . , µ distributions D1 , . . . , Dm to obtain the estimated means µ ˜m . vide significantly better guarantees. What is not simple, however, For each Si ∈ Lk , we set its value to ηi = µSi = j∈Si µ ˜j /|Si |, is showing that taking C samples allows us to satisfy the require- which we call the estimated mean of Si . For any refinement Lk+1 ments for Theorem 1. Another benefit of uniform sampling is that of Lk , we then let the estimated improvement potential of Lk+1 be we can switch between showing our segment approximations or |T | 2 |U | 2 |S| 2 |T | · |U | 2 showing the actual running estimates of each of the groups (as in φ(Lk+1 , Si , T, U ) = µ + µ − µ = (µT − µU ) online aggregation [20])—for the latter purpose, uniform sampling m T m U m Si |Si | · m is trivially optimal. For simplicity we denote φ(Lk+1 , Si , T, U ) and φ(Lk+1 , Si , T, U ) Justification. To justify our choice, we assume that the data is gen- as φ(Lk+1 ) and φ(Lk+1 ), respectively. erated from a sub-Gaussian distribution [45]. Sub-Gaussian distri- Our goal is to get a guarantee for err(Lk+1 ) is relative to err(Lk ). butions form a general class of distributions with a strong tail decay Instead of a guarantee on the actual error err, for which we would property, with its tails decaying at least as rapidly as the tails of a need to know the means of the distributions, our guarantee is in- Gaussian distribution. This class encompasses Gaussian distribu- stead on a new quantity, err , which we define to be the empirical tions as well as distributions where extreme outliers are rare—this error. Given Si = (Ii , ηi ) and Equation 5, err is defined as fol- is certainly true when the values derive from real observations that lows: err (Si ) = |S1i | j∈Si µ2j − ηi2 . Although err (Si ) is not are bounded. We validate this in our experiments. In particular, any equal to err(Si ) when ηi = µS , err (Si ) converges to err(Si ) Gaussian distribution with variance σ 2 is sub-Gaussian with pa- as ηi gets closer to µS (i.e., as more samples are taken). Simi- rameter σ 2 . Therefore, we represent the distributions D1 , . . . , Dm larly, the error of k-segment approximation err (Lk ) converges to by m sub-Gaussian distributions with mean µi and sub-Gaussian err(Lk ). We show experimentally (see Section 5) that optimizing parameter σ. for err (Lk+1 ) gives us a good solution of err(Lk+1 ) itself. Given this generative assumption, we can determine the number To derive our guarantee on err , we need one more piece of ter- of samples required to obtain an estimate with a desired accuracy minology. At iteration (k+1), we define T (I, η), where I = [p, q] using Hoeffding’s inequality [21]. Given m independent random and 1 ≤ p ≤ q ≤ m to be a boundary segment if either p or q is samples x1 , . . . , xm with sub-Gaussian parameter σi2 and mean µi a split group in Lk . In other words, at the iteration (k + 1), all the m segments in Lk and all the segments that may appear in Lk+1 after t2 Pr | (xi − µi )| > t ≤ 2 exp − m . splitting a segment are called boundary segments. Next, we show i=1 2 i=1 σi2 that if we estimate the boundary segments accurately, then we can Given Hoeffding’s inequality along with the union bound, we can find a split which is very close to the best possible split. derive an upper bound on the number of samples we need to esti- Theorem 1. If for every boundary segment T of the k-segment ap- mate the mean µi of each xi . proximation Lk , we obtain an estimate µT of the mean µT that sat- Lemma 1. For a fixed δ > 0 and a k-segment approximation of the isfies µT2 − µT2 ≤ 6|T m | , then the refinement L†k+1 of Lk that max- distributions D1 , . . . , Dm represented by m independent random imizes the estimated value φ(L†k+1 ) will have error that exceeds the samples x1 , . . . , xm with sub-Gaussian parameter σ 2 and mean

6 . 288 a σ 2 4m µi ∈ [0, a] if we draw C = 2m ln δ samples uniformly formally, we define the distribution Di over the group i as below µT2 µT2 Fixed dist. with value 1/2 − if i/t = 1 (mod 3).  from each xi , then with probability at least 1 − δ, − ≤  m  6|T | for every boundary segment T of Lk . D(i) = Bernoulli dist. with mean p 2i/k if i/t = 2 (mod 3).   Fixed dist. with value 1/2 + if i/t = 0 (mod 3). Proof. Fix any boundary segment T contained in the segment S ∈ where p1 , p2 , . . . , pk/2 are k/2 random variables that are picked Lk . Then, we draw samples xi,1 xi,2 , . . . , xi,C uniformly from from {1/2 − , 1/2 + } each with probability a half. It is not xi ∈ T , then hard to see that any D is a k-flat segment approximation, so the 1 C 1 error of the optimal k-segment approximation is zero. Therefore, µT − µT = xi,j − µi A outputs a segment approximation, namely Lo , which is -close C|T | i∈T j=1 |T | i∈T to the D with probability 2/3. For the segment j, we define pj 1 C (respectively pˆj ) to be the average mean of the middle t groups = (xi,j − µi ). C|T | i∈T j=1 of the j-th segment that D (respectively Lo ) assigns to them. In t other words, pj = i=1 D(3 (j − 1) t + t + i)/t (respectively t xi ’s are sub-Gaussian random variables with parameter σ 2 . There- pˆj = i=1 Lo (3 (j − 1) t + t + i)/t). With probability 2/3, we fore, have 1 m Pr |µ2T − µ2T | > m 6 |T | = Pr |µT − µT | (µT + µT ) > m 6 |T | ≥ err(D, Lo ) = (D(i) − Lo (i))2 m i=1 m ≤ Pr |µT − µT | > 12 a |T | 1 k/2 t ≥ (D (3 (j − 1) t + t + i) − Lo (3 (j − 1) t + t + i))2 m j=1 i=1 C C m = Pr | (xi,j − µi )| > 12 a k/2 k/2 2 i∈T j=1 t 2 = (pj − pˆj )2 ≥ |pj − pˆj | 2 2 m j=1 3k2 j=1 C m δ ≤ 2 exp − 288a2 |T | σ 2 ≤ 2m By the union bound, the probability that one of the 2m boundary using the Cauchy-Schwarz inequality. segments has an inaccurate estimate is at most δ. To reach a contradiction, we show that we cannot estimate many of the pj ’s accurately. Since we assumed that A draws o(k/ 2 ) samples, there are at least k/4 segments that o(1/ 2 ) samples are 3.2.3 Deriving a Lower bound drawn from. In the following lemma, we show that we cannot esti- mate pˆj ’s in these segments with high probability. We can derive a lower bound for the sample complexity of any algorithm for Problem 1: Lemma 2. Assume Algorithm B draws o(1/ 2 ) samples from a distribution over {0, 1} with an unknown bias p which is 1/2 + Theorem 2. Say we have a dataset D of m groups with means or 1/2 − . B outputs pˆ as an estimation of p. With probability at (µ1 , µ2 , . . . , µm ). Assume there exists an algorithm A that finds a least 1/3, k-segment approximation which is -close to the optimal k-segment approximation with probability 2/3. For sufficiently small , A |ˆ p − p| ≥ . √ draws Ω( k/ 2 ) samples. The theorem states that any algorithm that outputs a k-segment √ Proof. We reduce the problem of estimating p, to the problem of approximation of which is -close to a dataset has to draw O( k/ 2 ) distinguishing two distributions D1 and D2 over {0, 1} which have samples from the dataset. To show this, at a high level, we pick a average value of 1/2 + and 1/2 − respectively. The reduction dataset goes as follows: Algorithm B , runs B to find pˆ. If pˆ is greater than √ as the “hard case” and show that any algorithm that draws o( k/ 2 ) samples cannot estimate the means of the many of the 1/2, then B outputs D1 . Otherwise, it outputs D2 . It is clear that segments accurately. Therefore, the output has error more than . B is correct if and only if |ˆ p − p| ≥ . In Theorem 4.7, Chapter 4 of Yossef et al. [12], it has been shown that any hypothesis tester that distinguishes between D1 and D2 Proof. Assume for contradiction that A uses o(k 2 ) samples. We with probability 2/3 must use at least Ω(1/h2 (D1 , D2 )) samples build a datasets randomly on m groups, namely D, and show that where h(D1 , D2 ) is the Hellinger distance between D1 and D2 . A cannot find -close k-segment approximations for D with high The Hellinger distance between D1 and D2 over a finite set X (in probability, which contradicts our assumption. our case X = {0, 1}) is defined by Cam et al. [14] as follows: Let be equal to 217 . For sufficiently small , we can assume 2 1 is at most 1/4. Without loss of generality, assume k is even and h(D1 , D2 ) := 2 D1 (x) − D2 (x) m is a multiple of 3k/2. Otherwise, increase k and m by adding x∈X dummy groups and this does not affect our calculation by more than = 1− D1 (x)D2 (x) constant factors. We build D via the following process: First, we x∈X partition the m groups into k/2 segments. Each of the segments √ contains 2m/k groups and we define t to be a third of that i.e. = 1−2 1/4 − 2 = 1− 1−4 2 t := (2m)/(3k). In each segment, all the samples from the first t 4 2 √ groups have a fixed value 1/2 − . All the samples from the last t = √ ≥ 2 groups have a fixed value 1/2 + . To decide about the value of the 1+ 1−4 2 samples from the middle groups, we randomly choose the mean p to be either 1/2 − or 1/2 + each with probability a half. In Since B draws o(1/ 2 ) samples, it cannot output the correct an- particular, a sample from the middle t groups is a Bernoulli random swer with probability at least 2/3. Therefore, with probability at variable which is one (or zero) with probability p (or 1 − p). More least 1/3, the error of pˆ is at least .

7 . Let r be the number of segments j such that we draw o(1/ 2 ) the number of samples by α at each iteration), and all-upfront (i.e., samples from them but |ˆ pj − pj | is less than . Using the lemma non-interactive) sampling. To compare these different approaches above, E[r] is at least k/6. By the Chernoff bound, to the constant (uniform) sampling approach and amongst each other, we first compute the total sample complexity, interactivity, k 1 1 and error guarantees of each of them as a function of their other Pr r < ≤ Pr r < 1 − · E[r] ≤ e−k/48 ≤ parameters. Letting Tk denote the total number of samples drawn 12 2 6 in the first k iterations, the interactivity of a sampling approach de- where the last inequality true for sufficiently large k. Thus, with m Tk probability 5/6 more than k/12 has error of at least . This implies fined in Section 2.3 can be written as: λ = k=1 k , where k that err(D, L0 ) ≥ /216 > which contradicts our assumption. is the number of iterations we take non-zero samples. The error guarantees we consider are, as above, the average error over all iterations. This error guarantee is 3.3 The ISplit Algorithm err = m 2 k = m 288 a σ 2 ln 4m = 288 a σ 2 ln 4m m 1 m m2 Tk δ m2 δ Tk We now present our incrementally improving visualization gen- i=1 i=1 i=1 eration algorithm ISplit. Given the parameters and δ, ISplit main- We provide evidence that the estimated err and λ are similar to err tains the same guarantee of error ( ) in generating the segment ap- and λ on real datasets in Section 5.5.2. We are now ready to derive proximations in each iteration. Theorem 1 and Lemma 1 suffice the expressions for both error and interactivity for the sampling to show that the ISplit algorithm is a greedy approximator, that, at approaches mentioned earlier. each iteration identifies a segment approximation that is at least close to the best segment approximation for that iteration. 3.4.1 Expressions for Error and Interactivity Data: Xa , Y, δ, In this section we derive the analytical expressions of both in- 1 Start with the 1-segment approximator L = (L1 ). 2 for k = 2, . . . , m do teractivity and error for all of these four approaches (see Table 1). 3 Lk−1 = (S1 , . . . , Sk−1 ). In particular, for succinctness, we have left the formulae for Error 4 for each Si ∈ Lk−1 do for geometric and linear decrease as is without simplification; on 5 for each group q ∈ Sp do substituting Tk into the expression for Error, we obtain fairly com- 6 Draw C samples uniformly. Compute mean µ ˜q end plicated formulae with dozens of terms, including multiple uses of 7 Compute µSi = |S | 1 q∈Si µ˜q the q-digamma function Ψ [25]—for the geometric decrease case, i end the Euler-Mascheroni constant [47]—for the linear decrease case, 8 |T |·|U | Find (T, U ) = argmaxi; T ,U ⊆Si |S |·m (µT − µU )2 . and the Harmonic number Hm —for the constant sampling case. i 9 Update Lk+1 = S1 , . . . , Si−1 , T, U, Si+1 , . . . , Sk . Constant sampling. The constant sampling approach is the one end described above where we draw N1 samples in the first iteration Algorithm 1: ISplit and in all subsequent iterations as well. We denote this approach by UN1 . With this approach, the total number of samples drawn The parameter is often difficult for end-users to specify. There- (UN1 ) fore, in I NC V ISAGE, we allow users to instead explicitly specify an in the first k iterations is Tk = kN1 and the error guarantee expected time budget per iteration, B—as explained in Section 2.3, satisfies m the number of samples taken determines the time taken per itera- 288 a σ 2 4m 1 288 a σ 2 Hm 4m tion, so we can reverse engineer the number of samples from B. err(UN1 ) = ln = ln , m2 δ i=1 k N1 m2 N1 δ So, given the sampling rate f of the sampling engine (see Sec- tion 2.1) and B, the sample complexity per iteration per group is where Hm is the m-th Harmonic number. Finally, the interactivity simply C = B × f /m. Using Lemma 1, we can compute the cor- of the constant sampling approach is m responding . Another way of setting B is to use standard rules of k=1 kN1 N1 (m + 1) thumb for interactivity (see Section 2.3). λ(UN1 ) = = . k 2 Linear decrease. In the linear decrease approach, we draw N1 3.4 Tuning Across Iterations samples in the first iteration, and draw Nk = Nk−1 − β sam- So far, we have considered only the case where the algorithm ples for some fixed parameter β ≥ 0 in each subsequent itera- takes the same number of samples C per group across iterations tions. We denote this sampling approach by LN1 ,β . The total num- and does not reuse the samples across different iterations. For any ber of samples drawn in the first k iterations of this approach is given iteration, this provides us with an -guarantee for the error (LN ,β ) relative to the best segment approximation. If we instead reuse Tk 1 = ki=1 Ni = N1 k − k (k−1) 2 β . The guarantee on the the samples drawn in previous iterations, the error at iteration k is average error per iteration for this approach is 288 a σ 2 m 288 a σ 2 2 k = mkC ln 4m δ , where k C is the total number of samples err(LN1 ,β ) = i=1 m2 Tk ln 4m δ drawn so far. Therefore, the decrease in the error up to iteration k, 288 a σ 2 (α−1)(−Ψ(0) (m−2N1 /β)Ψ(0) (−2N1 /β)+Ψ(0) (m+1)+γ ) 4m k , from the error up to iteration (k − 1), k−1 , is (k − 1)/k, = m2 (N1 +β/2) ln δ . where k = (k − 1)/k k−1 . This has the following effect: later (n) where the Ψ (x) is the nth derivative of the digamma function [25] iterations both take the same amount of time as previous ones, and and γ is the Euler-Mascheroni constant [47]. The interactivity of produce only minor updates of the visualization. the linear decrease approach is Sampling Approaches. This observation indicates that we can ob- k Nk (m−k+1) k (N1 −β(k−1))·(m−k+1) tain higher interactivity with only a small effect on the accuracy λ(LN1 ,β ) = k=1 k = k=1 k . of the approximations by considering variants of the algorithm that = N1 m− k −1 + β 2k 2 − 3k + 1 − 3mk + 3m . 2 6 decrease the number of samples across iterations instead of draw- ing the same number of samples in each iteration. We consider Geometric decrease. In the geometric decrease approach, we three natural approaches for determining the number of samples draw N1 samples in the first iteration and draw Nk = Nk−1 /α we draw at each iteration: linear decrease (i.e., reduce the number samples for some fixed parameter α > 1 in each subsequent it- of samples by β at each iteration), geometric decrease (i.e., divide erations. We denote this sampling approach by GN1 ,α . This ap-

8 . Approach decrease parameter Tk Error Interactivity k (k−1) β m 1 k −1 N1 β 2 Linear Decrease β N1 k − 2 A k=1 Tk N1 m− 2 m + 6 2k − 3k + 1 − 3mk + 3m αk −1 m 1 N1 m (α−1) αm +αm −1 Geometric Decrease α N1 k−1 A α (α−1) k=1 Tk m (α−1)2 αm−1 A·Hm N1 (m+1) Constant Sampling α = 1, or β = 0 kN1 N1 2 A·m All-upfront — Tk=1 = N1 and Tk>1 = 0 N1 mN1 288 a σ 2 4m Table 1: Expressions for error and interactivity for different sampling approaches. Here, A = m2 ln δ N1 proach draws Nk = αk−1 samples at iteration k, for a total of the same interactivity measure as the constant sampling approach, (GN ,α ) k αk −1 it still has a strictly worse error guarantee. Tk 1 = Ni = i=1 N1 αk−1samples in the first k iter- (α−1) ations. The error guarantee of this approach is Theorem 3. If for a setting of parameters, a constant sampling m approach and an all-upfront sampling approach have the same in- 288 a σ 2 αk−1 (α−1) err(GN1 ,α ) = m2 N1 αk −1 ln 4m δ teractivity, then the error of constant sampling is strictly less than i=1 the error of all-upfront sampling. 2 (0) (0) 288 a σ (α − 1) Ψα (m + 1) − Ψα (1) = ln 4m Proof. Assume that we have two sampling approaches AUN1 and δ m2 N1 α ln α UN1 with the same interactivity. Thus, we have the following rela- (n) where Ψα (x) is the q-digamma function. The interactivity of the tionship between N1 and N1 . geometric decrease approach is (U ) N (m + 1) k m N1 = λ(AUN1 ) = λ N1 = 1 ≤ m N1 . Nk (m−k+1) 2 λ(GN1 ,α ) = k=1 k Given the equation, we have k k 288 a σ 2 = N1 (m+1) 1 − N1 k err(AUN1 ) = mN1 ln 4m δ k αk−1 k αk−1 k=1 k=1 288 a σ 2 4m N1 (m+1) αk −1 N1 αk +1 −α(k +1)+k ≥ ln = k αk −1 (α−1) − k αk −1 (α−1)2 . mN1 δ m (UN ) For the case where k = m, then ≥ Hm · err 1 . N1 m (α − 1) αm − αm + 1 Note that Hm = O(log n). Hence, given the same interactivity, the λ(GN1 ,α ) = . m (α − 1)2 αm−1 non-incremental sampling has a large error compare to the uniform All-upfront. Finally, the (non-interactive) all-upfront sampling sampling. approach is the one where we draw N1 samples in the first itera- tion and no samples thereafter. Let (AUN1 ) denote this sampling Optimal interactivity and decrease parameters. We now show approach. For any k ≥ 1, the total sample complexity of this ap- that for both linear and geometric decrease approaches, the optimal (AUN1 ) interactivity is obtained for an explicit choice of decrease parameter proach is Tk = N1 . The error guarantee of this approach that depends only on the initial number of samples and the total is number of iterations. To do so, we prove the following two lemmas m 288 a σ 2 288 a σ 2 (lemma 3 and 4). err(AUN1 ) = m2 ln 4m δ 1 N1 = mN1 ln 4m δ , i=1 Lemma 3. In geometric sampling with a fixed N1 , α∗ = (N1 − and its interactivity is 1)1/(m−1) has the optimal interactivity and it has smaller error than all of the geometric sampling approaches with α > α∗ . m N1 λ(AUN1 ) = k=1 k = m N1 . Proof. If α > α∗ , it is clear that we do not draw any sample at the last iteration, and therefore, k < m. However, when α ≤ α∗ , then 3.4.2 Comparing the Sampling Approaches k = m. Hence, We now compare different sampling approaches based on the ex- N1 m (α − 1) αm − αm + 1 pressions of error and interactivity obtained in Section 3.4.1. We λ(GN1 ,α ) = . m (α − 1)2 αm−1 show that among the four sampling approaches, geometric decrease is the most desirable approach in order to optimize for both error Then, (G ∂λ N1 ,α ) α−m ((m+1)αm −(m−1)αm+1 +m−1−α(1+m)) and interactivity. To do so, we first discount the all-upfront sam- = ∂α (α−1)3 . pling approach by showing that this approach has a strictly worse α−m (αm (2−(m−1)(α−1)))−(α−1)(m+1)−2) error guarantee than constant sampling—the sampling approach = (α−1)3 proposed in Section 3.2.2. We then obtain the optimal values in By writing the binomial approximation for αm = (1 − (1 − α))m . terms of interactivity for the decrease parameter of the geometric It is not hard to see that the derivative is negative for α ≥ 1 for suf- (α) and linear (β) decrease approaches. Furthermore, we show that ficiently large m. Therefore, λ(GN1 ,α ) (α∗ ) is the minimum among given the initial number of samples, the interactivity of geometric all α’s in (1, α∗ ]. decrease approach with the optimal decrease parameter is strictly For α > α∗ , we stop sampling after k = log N1 iterations. By better than the interactivity of the linear decrease and constant sam- log α pling approaches. Finally, we suggest a theoretically optimal range replacing k in the interactivity formula we can obtain an expres- of the geometric decrease parameter that leads to a pareto frontier sion for λ(GN1 ,α ) . On examining the derivative we find out that along which users can optimize for either error or interactivity. λ(GN1 ,α ) is again an increasing function of α. Therefore, α∗ has All-upfront VS Constant Sampling. Clearly, if we compare the optimal interactivity. all-upfront sampling with the constant sampling approach with the Now, we show that GN1 ,α∗ has smaller error compared to GN1 ,α same number of initial samples N1 , then the constant sampling ap- for α > α∗ . Suppose α > α∗ . It is clear that proach has smaller error. In fact, more is true: even if we allow the (GN ,α ) N1 N1 (GN ,α∗ ) all-upfront to take more samples in the initial iteration so that it has Nk 1 = ≤ ∗ k−1 = Nk 1 . (α)k−1 (α )

9 . (GN ,α ) (GN ,α∗ ) Thus, for any k ∈ [m], Tk 1 ≤ Tk 1 . Therefore, we can Proof. By Lemma 4, for a fixed N1 , the interactivity of LN1 ,β is see that the error of the GN1 ,α∗ is smaller than GN1 ,α . minimum at β ∗ = (N1 − 1)/(m − 1). In addition, the uniform m 288 a σ 2 sampling approach is a special case of linear decrease sampling ap- err(GN1 ,α∗ ) = i=1 (GN ,α∗ ) ln 4m δ proach when β is zero, and therefore has worse interactivity com- m2 Tk 1 pared to LN1 ,β ∗ . Thus, it suffices to compare the interactivity of m 288 a σ 2 4m ≤ i=1 (GN ,α ) ln δ LN1 ,β ∗ and GN1 ,α∗ to conclude the theorem. m2 Tk 1 (LN ,β ∗ ) (GN ,α∗ ) (GN ,α∗ ) First, we prove that Nk 1 is at least Nk 1 . Nk 1 = = err(GN1 ,α ) . N1 /(α∗ )k−1 is a upward convex function with respect to k. Thus, Thus, the proof is complete. the line segment between any two points of this function lies above it. This sampling approach takes N1 samples in the first itera- Lemma 4. In linear sampling with a fixed N1 , β ∗ = (N1 − (GN ,α∗ ) 1)/(m − 1) has the optimal interactivity and it has strictly smaller tion and one sample in iteration m. Thus, Nk 1 is below error than all of the linear sampling approaches with β > β ∗ . the segment that connects (1, N1 ) and (m, 1). On the other hand (LN ,β ∗ ) Nk 1 = N1 − (k − 1)β ∗ is a linear function of k. Also, Proof. To compute the interactivity of the linear decrease approach, LN1 ,β ∗ takes N1 samples in the first iteration and one sample in it- first we need to find k based on β. If the sampling lasts for m it- (LN ,β ∗ ) erations, we draw at least one sample in the last iteration. In other eration m. Thus, Nk 1 is on the segment (1, N1 ) and (m, 1). (LN ,β ∗ ) (GN ,α∗ ) words, N1 − (m − 1)β is at least one. Therefore, β ≤ β ∗ = Therefore, Nk 1 is at least Nk 1 and the equality hap- (N1 − 1)/(m − 1) the number of iterations, k , is m. Then, the pens only at k = 1 and k = m. Using the interactivity formula it interactivity is is not hard to see (GN ,α∗ ) m+1 β m Nk 1 (m−k+1) λ(LN1 ,β ) = N1 − m2 − 1 . λ(GN1 ,α∗ ) = 2 6 i=1 m m (LN ,β ∗ ) (LN1 ,β ) Nk 1 (m−k+1) For sufficiently large m, λ is decreasing with respect to β. < = λ(LN1 ,β∗ ) Therefore, β ∗ has the optimal interactivity among β’s in [0, β ∗ ]. i=1 m For β > β ∗ , we stop sampling after k = Nβ1 iterations. By Therefore, the geometric decrease sampling approach, GN1 ,α∗ , has replacing k in the interactivity formula, we have better interactivity than the interactivity of any linear decrease sam- pling method. N1 (m + 1) N2 (3m + 1)β λ(LN1 ,β ) = − 1 + In lemma 3, we prove that for fixed N1 λ(GN1 ,α ) (α∗ ) is the min- 2 6β 6 imum among all α’s in (1, α∗ ]. Now, constant sampling is a spe- (LN ,β ) It is not hard to see that ∂λ ∂β1 is positive and this function is cial case of geometric decrease with parameter α = 1. Therefore, given N1 , the geometric decrease sampling approach, GN1 ,α∗ , has increasing with respect to β. Therefore, β ∗ has the optimal interac- better interactivity than the interactivity of the constant sampling tivity among β’s in [β ∗ , β]. Hence β ∗ has the optimal interactivity. approach. This completes our proof. Now, we show that LN1 ,β ∗ has smaller error compared to LN1 ,β for β > β ∗ . Suppose β > β ∗ . It is clear that Next, we discuss how the optimal decrease parameter α∗ con- (LN ,β ) ∗ (LN ,β ∗ ) Nk 1 = N1 − (k − 1)β ≤ N1 − (k − 1)β = Nk 1 . tributes to a knee shaped error-interactivity trade-off curve. (LN ,β ) (LN ,β ∗ ) Optimal α and Knee Region. As shown in Lemma 3, given Thus, for any k ∈ [m], Tk 1 ≤ Tk 1 . Therefore, we can N1 , we can compute the optimal decrease parameter α∗ , such that easily see error of the LN1 ,β ∗ is smaller than LN1 ,β . any value of α > α∗ results in higher error and lesser interactiv- m 288 a σ 2 ity (higher λ). This behavior results into the emergence of a knee err(LN1 ,β∗ ) = i=1 (LN ,β ∗ ) ln 4m δ m2 Tk 1 region in the error-interactivity curve which is confirmed in our m 2 288 a σ 4m experimental results (see Section 5.5). Essentially, starting from ≤ ln i=1 (LN ,β ) m2 Tk 1 δ α = 1 any value α ≤ α∗ has smaller error than any value α > α∗ . Therefore, for any given N1 there is an optimal region [1, α∗ ]. For = err(LN1 ,β ) . example, for N1 = 5000, 25000, 5000, and 10000, the optimal Thus, the proof is complete. range of α is [1, 1.024], [1, 1.028], [1, 1.03], and [1, 1.032], re- spectively. By varying α along the optimal region one can either Next, we show that given the initial samples N1 , geometric de- optimize for error or interactivity. For example, starting with α = 1 crease has the optimal interactivity among the three interactive ap- as α → α∗ we trade accuracy for interactivity. proaches (linear decrease and constant sampling being the other two). 3.5 Extensions Optimal Interactivity Given N1 . Our experimental results in The incrementally improving visualization generation algorithm Section 5.5 suggests that the geometric decrease approach with the described previously for simple queries with the AVG aggrega- optimal choice of parameter α∗ has better error than not only the tion function can also be extended to support aggregations such as all-upfront approach but the linear decrease and constant sampling COUNT and SUM, as well as additional predicates (via a WHERE approaches as well. This remains an open question, but when we clause). fix the initial sample complexity N1 (which is proportional to the bound on the maximum time taken per iteration as provided by the 3.5.1 The COUNT Aggregate Function user), we can show that geometric decrease with the right choice Given that we know the total number of tuples in the database, of parameter α does have the optimal interactivity among the three estimating the COUNT aggregate function essentially corresponds interactive approaches. to the problem of estimating the fraction of tuples τi that belong Theorem 4. Given N1 , the interactivity of geometric decrease with to each group i. Formally, τi = mni nj when nj is the num- j=1 parameter α∗ = (N1 − 1)1/(m−1) is strictly better than the inter- ber of tuples in group j. We focus on the setting below when we activity of any linear decrease approach and constant sampling. only use our bitmap index. We note that approximation techniques

10 .for COUNT queries have also been studied previously [7, 24, 26], Known Group Sizes. A simple variant of Algorithm 1 can also for the case when no indexes are available. As we see below, we be used to approximate SUM in this case. Let ni be the size of will also use the COUNT estimation as a subroutine for incremen- group i and κ = maxj nj . As in the original setting, the algorithm tally improving approximations to the SUM function in the setting computes the estimate µ ˜i of the average of the group. Then s˜i = where we don’t know the group sizes. ˜i is an estimate on the sum of each group i, namely si , that is ni µ Approach. Using our sampling engine, we can estimate the frac- used in place of the average in the rest of the algorithm. We have: tions τi by scanning the bitmap index for each group. When we 288a2 σ 2 mn2 Theorem 6. Assume we have Ci = 2 κ2 i ln 4m δ samples retrieve another sample from group i, we can also estimate the num- † ber of tuples we needed to skip over to reach the tuple that belongs from group i. Then, the refinement Lk+1 of Lk that maximizes the to group i—the indices allow us to estimate this information di- estimated improvement potential φ+ (L†k+1 ) will have error that rectly, providing us an estimate for τi , i.e., τ˜i . exceeds the error of the best refinement L∗k+1 of Lk by at most Theorem 5. With an expected total number of samples Ccount = err + (L†k+1 ) − err + (L∗k+1 ) ≤ κ2 . m + γ −2 ln(2m/δ)/2 , the τ˜i , ∀i can be estimated to within a factor γ, i.e., |˜ τi − τi | ≤ γ holds with probability at least 1 − δ. Proof. Fix any boundary segment T in Lk . Then, we draw xi,1 , xi,2 , . . . , xi,Ci from the groups i ∈ T . It is not hard to see Essentially, the algorithm takes as many samples from each group until the index of the sample is ≥ γ −2 ln(2m/δ)/2 . We show 1 1 Ci ni xi,j that overall, the expected number of samples is bounded above by s˜T − sT = |T | s˜i − si = |T | Ci − ni µi i∈T i∈T j=1 Ccount . Since this number is small, we don’t even need to incre- Ci mentally estimate or visualize COUNT. = 1 ni xi,j − ni µi |T | Ci Ci i∈T j=1 Proof. To show this, we use repeated applications of Hoeffding’s If xi,j is a sub-Gaussian random variable with parameter σ 2 , then inequality and union bound, along with the analysis of a Bernoulli ni xi,j /Ci is a sub-Gaussian random variable with parameter (ni σ/Ci )2 . trial. Let θi,j be the index of the j-th sample from group i. An- Using the Hoeffding bound for sub-Gaussian random variables, other way of interpreting θi,j is that among the first θi,j items in κ2 m the dataset, j of them were from group i. This is equivalent to s2T − s2T | > Pr |˜ 6|T | drawing θi,j samples in the bitmap where only j of them are from κ2 m the group i. Thus, j/θi,j is an unbiased estimate for τi . Using the = Pr |˜ sT − sT |(˜ sT + sT ) > 6|T | Hoeffding’s inequality, we have that Ci j δ ni κ2 m 2 ≤ Pr | (xi,j − µi )| > Pr − τi > γ ≤ 2e−2γ θi,j ≤ i∈T j=1 Ci 12a ni θi,j m i∈T −2   whenever θi,j is greater than or equal to θ0 = γ ln(2m/δ)/2 . 2 4 2 Therefore, if the θi,j ’s are big enough, we can assume we have a ≤ 2 exp −  κ m  2 good estimation of τi ’s with probability 1 − δ.  288a2 σ 2 ni n2 i /Ci In this approach, we do not query the dataset. However, we query i∈T i∈T the bitmap index. To reach θi,j that is greater than θ0 , we query the   bitmap index to obtain θi,1 , θi,2 , . . . , θi,j until we see θi,j which is 2 2 κ m 2 ≤ 2 exp −   at least θ0 . Here, we compute the expected value of the number of  288a2 σ 2 |T |2 n2 i /Ci queries where each query is a Bernoulli trial. We define a Bernoulli i∈T trial as follows: we draw an item from the dataset if the item be- δ longs to group i then it is a success. Otherwise, it is a failure. We ≤ 4m . know that among the first θi,j−1 , j − 1 of them were successful. where the last inequality is true because Thus, we have Ci 288a2 σ 2 m 4m 288a2 σ 2 |T |3 4m E[j] = E[j − 1] + 1 ≤ E[#success in θ0 trials] + 1 = τi θ0 + 1. 2 = 2 2 ln ≥ 2 κ2 m2 ln . ni κ δ δ Therefore, for m groups, the expected number of queries to the Therefore, for every boundary segment T of the k-segment ap- bitmap index, Ccount = θ0 + m = m + γ −2 ln(2m/δ)/2 . proximation Lk , we obtain an estimate s˜T of the mean sT that κ2 m satisfies s˜2T − s2T ≤ 6|T | . 3.5.2 The SUM Aggregate Function By replacing C with Cq in line 6 of Algorithm 1 and substitut- The problem of obtaining the segment approximations for SUM ing the calculations for mean of groups and boundary segments to is similar to the AVG case—at each iteration (k + 1), a segment their respective sums, we obtain our incrementally improving vi- is split into two new segments to obtain the k + 1-segment ap- sualization generation algorithm for SUM with known group sizes proximation Lk+1 such that the estimated improvement potential case. Note that here we have κ2 instead of : while at first glance is maximized. For the SUM problem, we define the estimated im- this may seem like an amplification of the error, it is not so: first, provement potential as φ+ . In the online sampling scenario, we the scales are different—and visualizing the SUM is like visualiz- again obtain a guarantee for the empirical error of the k-segment ing AVG multiplied by κ—an error by one “pixel” for the former approximation for SUM, err + . There are two settings we con- would equal κ times the error by one “pixel” for the latter but are sider for the SUM aggregate function: when the group sizes (i.e., visually equivalent; second, our error function uses a squared 2 - the number of tuples in each group) are known, and when they are like distance, explaining the κ2 . unknown. For both cases we show that if we estimate the boundary Unknown Group Sizes. For this case, we simply employ the segments accurately, then we can find a split which is very close known group size case, once the COUNT estimation is used to to the best possible split. At iteration (k + 1), we define T (I, η), first approximate the τi with γ = /24a. The task of approximat- where I = [p, q] and 1 ≤ p ≤ q ≤ m to be a boundary segment if ing the SUM aggregate function when we do not know the size of either p or q is a split group in Lk (see Section 3.2). each group can be completed by combining the algorithmic tools

11 .described in earlier sections. Specifically, we can use the approach of all flights landing in ORD airport on December 22nd. Our al- described in the COUNT section to first approximate the size of gorithms still work even if we have selection predicates on one or each group. We can then modify the Algorithm 1 for approximat- more attributes, as long as we have an index on the group-by at- ing the AVG function to show the fractional sum. Since we have tribute. The sampling engine’s bitmap indexes allow us to retrieve, ˜i κt , where κt denotes the total number of items: m s˜i = τ˜i µ i=1 ni on demand, tuples that are from any specific group i that satisfy the , it suffices to run our Algorithm 1 for visualizing the τi µi and mul- selection conditions specified, by using appropriate AND and OR tiply all of them by κ2t . Therefore, we state the following theorem: operators [33]. 1152 a2 σ 2 τ ˜2 Theorem 7. Assume we have Ci = 2m i ln 4m δ samples † from group i. Then, the refinement Lk+1 of Lk that maximizes the 4. I NC V ISAGE SYSTEM ARCHITECTURE estimated φ+ (L†k+1 ) will have error that exceeds the error of the Figure 2 depicts the overall architecture of I NC V ISAGE. The I N - C V ISAGE client is a web-based front-end that captures user input best refinement L∗k+1 of Lk by at most err + (L†k+1 ) − err + and renders visualizations produced by the I NC V ISAGE back-end. (L∗k+1 ) ≤ κ2t . The I NC V ISAGE back-end is composed of three components: (A) a query parser, which is responsible for parsing the input query QT Proof. Fix any boundary interval T in Lk . Then, we draw xi,1 , or QH (see Section 2.1), (B) a view processor, which executes IS- xi,2 , . . . , xi,C from the groups i ∈ T . It is not hard to see plit (see Section 3), and (C) a sampling engine which returns the samples requested by the view processor at each iteration. As dis- κt sT − sT | = |T1 | |˜ s˜i − si = |T | ˜ i − τ i µi τ˜i µ cussed in Section 2.1, I NC V ISAGE uses a bitmap-based sampling i∈T i∈T engine to retrieve a random sample of records matching a set of ad- = |Tκt τ˜i ˜ µ i − ˜ τ i µ i + ˜ τ i µ i − τ i µ i hoc conditions. The sampling engine is the same as the sampling | i∈T engine used in IFOCUS [32]. At the end of each iteration, the view κt κt processor sends the visualizations generated to the front-end en- = |T | τ˜i (˜ µ i − µ i ) + |T | µ (˜ i iτ − τ i ) i∈T i∈T coded in json. To reduce the amount of json data transferred, only Ci the changes in the visualization (i.e., the refinements) are commu- κt τ ˜ x i i,j κt ≤ |T | Ci − τ˜i µ i + |T | µ i (˜ τi − τ i ) nicated to the front-end. The front-end then parses the data and i∈T j=1 i∈T generates the visualization. Ci κt τ ˜i xi,j τ ˜i µi κt = |T | Ci − Ci + |T | τi − τi ) . µi (˜ i∈T j=1 i∈T Now, we show that each of the term above are less than m κt /(24a|T |). If xi,j is a sub-Gaussian random variable with parameter σ 2 , then τi σ/Ci )2 . τ˜i xi,j /Ci is a sub-Gaussian random variable with parameter (˜ Using the Hoeffding bound for sub-Gaussian random variables, Ci τ ˜i xi,j τ ˜i µi m Pr Ci − Ci > 24a i∈T j=1 2 m2 ≤ 2 exp − 1152 a2 σ2 ˜i2 /Ci τ i∈T δ ≤ 4m . Figure 2: I NC V ISAGE Architecture Thus, the above expression is true for all T with probability 1−δ/2. The front-end is responsible for capturing user input and ren- Then, the first term in the previous equation is at most mκt /(24a|T |). dering visualizations generated by the I NC V ISAGE back-end. The Let γ = /(24a). Using the algorithm explained for estimating τi visualizations (i.e., the segment approximations) generated by the in the COUNT section, with the right set of parameter, one can back-end incrementally improve over time, but the users can pause assume with probability 1 − δ/2, |τ˜i − τi | is at most γ. Thus, with and replay the visualizations on demand. Figure 3 depicts the probability 1 − δ, s˜T − sT is at most κt /(12a). Thus, we have web front-end for I NC V ISAGE comprising four parts: (A) a query κt κ2t builder used to formulate queries based on user input; (B) a visual- s2T − s2T | = |˜ |˜ sT − sT |(˜ sT + sT ) ≤ · 2κt a = 12a 6|T | ization display pane; (C) a play bar to interact with the incremen- Therefore, for every boundary segment T of the k-segment ap- tally improving visualization being generated by allowing users to proximation Lk , we obtain an estimate s˜T of the mean sT that pause the visualization generation or rewind to older iterations (i.e., κ2 previous segment approximations), and (D) a snapshot pane for satisfies s˜2T − s2T ≤ t 6|T | . saving the image of the current iteration or segment approxima- tion being viewed in case the user wants to compare the current Similar to the known group sizes case, by replacing C with Cq in visualization with future ones and identify the differences. There is line 6 of Algorithm 1 and substituting the calculations for mean of an additional color legend (E) for heatmaps to allow users to inter- groups and boundary segments to their respective sums, we obtain pret the values of the different heatmap blocks. For the user study our incrementally improving visualization generation algorithm for described in Section 6, we additionally added a question-answer SUM with unknown group sizes case. module (F) to the front-end for submitting answers to the user study questions and also for displaying the points (i.e. the score) obtained 3.5.3 Selection Attributes by the user. Consider the following query: QT = SELECT Xa , AVG(Y) FROM R GROUP 5. PERFORMANCE EVALUATION BY Xa ORDER BY Xa WHERE Pred We evaluate our algorithms on three primary metrics: the error Here, we may have additional predicates on Xa or some other of the visualizations, the degree of correlation with the “best” al- attributes. For instance, we may want to view the average delay gorithm in choosing the split groups, and the runtime performance.

12 . Datasets and Queries. The datasets used in the performance eval- uation experiments and the user studies (Section 6 and Section 7) are shown in Table 2 and are marked by ticks ( ) to indicate where they were used. For the US flight dataset we show the results for the attribute Arrival Delay (FLA) and Departure Delay (FLD). For all three years of the NYC taxi data, we present the results for the attribute Trip Time. For the weather dataset, we show results for the attribute Temperature. For all the datasets, we remove the out- liers. To verify our sub-Gaussian assumption, we generated a Q-Q plot [49] for each of the selected attributes to compare the distri- butions with Gaussian distributions. The FL, T11, T12, and T13 Figure 3: Front End datasets all exhibit right-skewed Gaussian distributions while WG exhibits a truncated Gaussian distribution We also performed his- We also discuss how the choice of the initial samples N1 and sam- togram analysis of the datasets to further confirm the observations pling factor α impacts the error and interactivity. obtained from Q-Q plot analysis. The results can be found in Ap- pendix 4. Unless stated explicitly, we use the same query in all the 5.1 Experimental Setup experiments—calculate the average of an attribute at each day of Algorithms Tested. Each of the incrementally improving visual- the year. Here, the number of groups (days), m = 366. ization generation algorithms that we evaluate performs uniform Metrics. We use the following metrics to evaluate the algorithms: sampling, and take either B (time budget) and f (sampling rate of Error: We measure the error of the visualizations generated at each the sampling engine) as input, or (desired error threshold) and δ iteration k via err(Lk ) (see Section 2.3). The average error across (the probability of correctness) as input, and in either case com- iterations is computed as: err(Lk ) = m 1 m k=1 err(Lk ). putes the required N1 and α. The algorithms are as follows: Time: We also evaluate the wall-clock execution time. ISplit: At each iteration k, the algorithm draws Nmk samples uni- Correlation: We use Spearman’s ranked correlation coefficient to formly from each group, where Nk is the total number of samples measure the degree of correlation between two algorithms in choos- requested at iteration k and m is the total number of groups. ISplit ing the order of groups as split groups. We use the order in which uses the concept of improvement potential (see Section 3) to split the groups were selected as split groups by an algorithm to compute an existing segment into two new segments. a ranked list for the purpose of applying Spearman’s correlation RSplit: At each iteration k, the algorithm takes the same samples as coefficient. Spearman’s ranked correlation coefficient captures the ISplit but then selects a segment, and a split group within that, all at correlation between two ranked lists. If we consider the iteration at random to split. Our goal in including this algorithm is to evaluate which a group was chosen as a split group as the rank of that group, whether the improvement potential based approach can generate we can get a ranked list of the groups for each of the algorithms. visualizations with lower error compared to the random choices. Then, we can compute the correlation between any two ranked lists. ISplit-Best: The algorithm simulates the ideal case where the mean Let e1 , · · · , em and f1 , · · · , fm are the two ranked lists where ei of all the groups are known upfront (see Section 3.1). The visual- and fi are the ranks of group i assigned by algorithm E and F , izations generated have the lowest possible error at any given itera- respectively. The Spearman’s ranked correlation coefficient (r) is tion (i.e. for any k-segment approximation) among approaches that defined as follows: perform refinement of previous approximation. We include this al- 6 m 2 i=1 (ei − fi ) gorithm to measure how close the error of ISplit is to the lowest r(E, F ) = 1 − m(m − 1) 2 possible error when the iterative refinement constraint is respected. A value close to 1 (-1) for the Spearman’s coefficient indicates posi- DPSplit: At a high level, at each iteration k, this algorithm takes tive (negative) correlation between the two lists, while a value close the same samples as ISplit, but instead of performing refinement, to 0 indicates no correlation. DPSplit recomputes the best possible k-segment approximation us- ing dynamic programming. We include this algorithm to measure Interactivity: We use a new metric called interactivity (defined in the impact of the iterative refinement constraint. There are two rea- Section 2.3) to select appropriate parameters for ISplit. Interactivity sons why this algorithm is not preferred: the visualizations change is essentially the average waiting time for generating the segment drastically in each iteration, and the dynamic programming com- approximations. We explore the measure in Section 5.5. putation can be extremely costly online. Implementation Setup. We evaluate the runtime performance of DPSplit-Best: This algorithm simulates the case where the means all our algorithms on a bitmap-based sampling engine [33]. In ad- of all the groups are known upfront, and the same approach for dition, we implement a SCAN algorithm which performs a sequen- producing k-segment approximations used by DPSplit is used. The tial scan of the entire dataset. This is the approach a traditional visualizations generated have the lowest possible error among the database system would use. Since both ISplit and SCAN are im- algorithms mentioned above since they have advance knowledge of plemented on the same engine, we can make direct comparisons of the means, and do not need to obey iterative refinement. execution times between the two algorithms. All our experiments are single threaded and are conducted on a HP-Z230-SFF worksta- Name Description #Rows Size (GB) E U tion with an Intel Xeon E3-1240 CPU and 16GB memory running Intel Sensor dataset Sensor Berkeley Research lab [1]. 2.2M 0.73 Ubuntu 12.04 LTS. We set the disk read block size to 256KB. To FL US Flight dataset [5] . 74M 7.2 avoid any speedups resulting from the file buffer cache, we perform T11 2011 NYC taxi trip data 170M 6.3 all the I/O operations using Direct I/O. Unless explicitly stated we for 2011 [3] assume the time budget B = 500ms and use the parameter val- 2012 NYC taxi trip data T12 for 2012 [3] 166M 4.5 ues of N1 = 25000, α = 1.02 (with a geometric decrease), and 2013 NYC taxi trip data δ = 0.05. The choice of the parameters is further explored in Sec- T13 166M 4.3 for 2013 [3] tion 5.5. All experiments were averaged over 30 trials. Weather data of US WG 415M 27 cities from 1987–2015 [6] Table 2: Datasets Used for the Experiments and User Studies. 5.2 Comparing Execution Time E = Experiments and U = User Studies (Section 6 and 7).

13 . In this section, we compare the Wall Clock time of ISplit, DPSplit similarly; this is because there is less skew in that dataset amongst and SCAN for the datasets mentioned in Table 2. the samples taken; and the errors are low because the trendline ends Summary: ISplit is several orders of magnitude faster than SCAN in up having a single prominent feature—a bell-shape. revealing incremental visualizations. The completion time of DPSplit exceeds the completion time of even SCAN. Moreover, when generat- 5.4 Releasing the Refinement Restriction ing the early segment approximations, DPSplit always exhibits higher Next, we compare ISplit with DPSplit and DPSplit-Best. latency compared to ISplit. Summary: Given the same set of parameters, DPSplit and ISplit have We depict the wall-clock time in Figure 4 for three different roughly similar error; as expected DPSplit-Best has much lower error datasets for ISplit and DPSplit at iteration 10, 50, and 100, and at than both ISplit and DPSplit. completion, and for SCAN. First note that as the size of the dataset Figure 7 depicts the error across iterations for ISplit (our best on- increases, the time for SCAN drastically increases since the amount line sampling algorithm from the previous section), DPSplit, and of data that needs to be scanned increases. On the other hand, the DPSplit-Best for all the datasets—we aim to evaluate the impact time for completion for ISplit stays stable, and much smaller than of the refinement restriction, and whether it leads to substantially SCAN for all datasets: on the largest dataset, the completion time lower error. From Figure 7, at each iteration DPSplit-Best has the th is almost 16 that of SCAN. When considering earlier iterations, lowest error, while ISplit and DPSplit have very similar error. Thus, ISplit performs even better, revealing the first 10, 50, and 100 seg- in order to reduce the drastic variation of the segment approxima- ment approximations within ≈ 5 seconds, ≈ 13 seconds, and ≈ 22 tions, while not having a significant impact on error, ISplit is a bet- seconds, respectively, for all datasets, allowing users to get insights ter choice than DPSplit. Furthermore from Section 5.2, we found early by taking a small number of samples—beyond iteration 50 that ISplit’s execution time is much more reasonable than DPSplit. the refinements are minor, and as a result, users can terminate the Once again, for WG, the errors are all very similar due to the single visualization early if need be. Compared to SCAN, the speed-up prominent feature in the visualization. Note also that we noticed in revealing the first 10 features of the visualization is ≈ 12X, cases, especially when N1 is small, where DPSplit is outperformed ≈ 22X, and ≈ 46X for the FL, T11 and WG datasets. Lastly we by ISplit in the earlier iterations, since it may “overfit” based on a note that ISplit reveals each increment within the 500ms bound for few skewed samples. interactivity, as required. When comparing ISplit to DPSplit, we first note that DPSplit 5.5 Optimizing for Error and Interactivity is taking the same samples as ISplit, so its differences in execu- The goal of an incrementally improving visualization generation tion time are primarily due to computation time. We see some algorithm is to provide an approximation of the original visualiza- strange behavior in that while DPSplit is worse than ISplit for the tion at interactive speed. On the other hand, generating highly inac- early iterations, for completion it is much worse, and in fact worse curate approximations is also not desirable. Hence, the algorithms than SCAN. The dynamic programming computation complexity need to optimize for both accuracy (error) and interactivity. So far, depends on the number of segments. Therefore, the computation we have kept the sampling parameters fixed across algorithms; here starts occupying a larger fraction of the overall wall-clock time for we study the impact of these parameters. Specifically, we evaluate the latter iterations, rendering DPSplit impractical for latter itera- the impact of the initial sample (N1 ) and sampling factor (α) on tions. These observations are confirmed in Figure 5—for DPSplit, the error-interactivity trade-off. We also show that our simulations the CPU time accounts for the major portion of the wall clock time. resulting from theoretical claims in Section 3.4 match the experi- As the number of iterations increases, the CPU time increases dras- mental results. tically. By the time DPSplit completes, the CPU time exceeds the Summary: We find: (a) Geometrically decreasing the sample complex- wall clock time of SCAN. Even for earlier iterations, DPSplit is ity per iteration leads to higher interactivity. (b) Given the time budget worse than ISplit, revealing the first 10, 50, and 100 features within (B), only a small set of sampling factors (α) improves interactivity. (c) ≈ 7 seconds, ≈ 27 seconds, and ≈ 75 seconds, respectively, as The theoretical and experimental error-interactivity trade-off curves be- opposed to 5, 13, and 22 for ISplit. Thus, at the same time as ISplit have essentially the same, producing similarly shaped knee regions. has completed 100 iterations, DPSplit has only completed 50.As 5.5.1 Parameter Selection we will see later, this additional time does not come with a com- mensurate decrease in error, making DPSplit much less desirable We now empirically evaluate the trade-off between error and in- than ISplit as an incrementally improving visualization generation teractivity. We focus on “decreasing” sampling factors, i.e., those algorithm. that result in the sample complexity decreasing across iterations— we have found that “increasing” sampling factors lead to signifi- 5.3 Incremental Improvement Evaluation cantly worse λ (i.e., interactivity), while error is not reduced by We now compare the error for ISplit with RSplit and ISplit-Best. much. We consider both geometric and linear decrease, as well as Summary: (a) The error of ISplit, RSplit, and ISplit-Best reduce across upfront sampling (Section 3.4). Figure 8 captures the trade-off be- iterations. At each iteration, ISplit exhibits lower error in generating tween average error (across iterations) on the y axis and logarithm visualizations than RSplit. (b) ISplit exhibits higher correlation with of interactivity, i.e., log λ on the x axis for the Flight (FLA), T11 ISplit-Best in the choice of split groups, with ≥ 0.9 for any N1 greater and WG dataset (other datasets are similar)—focus on the circles than 25000. RSplit has near-zero correlation overall. and triangles for now. Each line in the chart corresponds a certain Figure 6 depicts the iterations on the x-axis and the 2 -error on the number of initial samples N1 , and either geometric decrease (de- y axis of the corresponding segment approximations for each of noted with a “/”), or a linear one (denoted with a “-”). Each point the algorithms for two datasets (others are similar). For example, in each line corresponds to a specific value of α. For all lines, we in Figure 6, at the first iteration, all three algorithms obtain the 1- find a similar behavior or shape as α increases, which we explain segment approximation with roughly similar error; and as the num- for N1 = 25000 for geometric and linear decrease, depicted again ber of iterations increase, the error continues to decrease. The error in Figure 8d with α annotated. If we follow the geometric decrease for ISplit is lower than RSplit throughout, for all datasets, justifying points (circles) Figure 8b, we start at I ≈ 6.75 at the point corre- our choice of improvement potential as a good metric to guide split- sponding to α = 1 for geometric decrease and 0 for linear decrease, ting criteria. ISplit-Best has lower error than the other two, this is and then as α is increased the points move to the left—indicating because ISplit-Best has access to the means for each group before- that the interactivity improves, while error increases slightly. Then, hand. For the WG dataset, ISplit-Best and ISplit perform roughly we have a knee in the curve—for any α > 1.028, the trails start

14 . SCAN Iteration = 50-DPSplit Iteration = 100-ISplit Completion-DPSplit Iteration = 10-DPSplit Iteration = 50-ISplit Iteration = 100-DPSplit Completion-ISplit Iteration = 10-ISplit 400 350 Wall Clock Time (sec) 300 250 200 150 100 50 0 FL (70m) T11 (170m) WG (415m) Datasets Figure 4: Comparison of Wall Clock Time. SCAN Iteration = 50-DPSplit Iteration = 100-ISplit Completion-DPSplit Iteration = 10-DPSplit Iteration = 50-ISplit Iteration = 100-DPSplit Completion-ISplit Iteration = 10-ISplit 400 350 300 CPU Time (sec) 250 200 150 100 50 0 FL (70m) T11 (170m) T12 (166m) T13 (166m) WG (415m) Datasets Figure 5: Comparison of CPU Time. moving back to the right (lower interactivity) but also see a si- linear decrease points, this indicates the reduction in the number of multaneous increase in error. A similar knee is seen if we trace samples per round, e.g., 50, 500, 1000. This behavior of a knee the linear decrease points (triangles)—note that α = 1 is shared in the curve is seen for all N1 values. We also find that for the between the linear and geometric decrease—here, the sampling is same N1 , the linear decrease has worse interactivity compared to constant across iterations. For other values of α depicted for the geometric decrease. Finally, on examining the upfront sampling

15 . ISplit RSplit ISplit-Best −1 4.5 4.0 7 ×10 4.0 3.5 6 3.5 3.0 5 3.0 2.5 err(Lk ) err(Lk ) err(Lk ) 2.5 4 2.0 2.0 3 1.5 1.5 2 1.0 1.0 0.5 1 0.5 0.0 0.0 0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration Iteration Iteration 8 ×10−1 a. FLA 7 ×10−1 b. FLD 2.5 ×102 c. T11 7 6 2.0 6 5 5 1.5 err(Lk ) err(Lk ) err(Lk ) 4 4 3 3 1.0 2 2 0.5 1 1 0 0 0.0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration Iteration Iteration d. T12 e. T13 f. WG Figure 6: Comparison of the 2 squared error of ISplit, Rsplit, and ISplit-Best. ISplit DPSplit DPSplit-Best 1 1 1 4.5 ×10 4.0 ×10 1.6 ×10 4.0 3.5 1.4 3.5 3.0 1.2 3.0 2.5 1.0 err(Lk ) err(Lk ) err(Lk ) 2.5 2.0 0.8 2.0 1.5 0.6 1.5 1.0 1.0 0.4 0.5 0.5 0.2 0.0 0.0 0.0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration Iteration Iteration 1.8 ×101 a. FLA 1.6 ×101 b. FLD 3.0 ×102 c. T11 1.6 1.4 2.5 1.4 1.2 1.2 2.0 1.0 err(Lk ) err(Lk ) err(Lk ) 1.0 0.8 1.5 0.8 0.6 0.6 1.0 0.4 0.4 0.5 0.2 0.2 0.0 0.0 0.0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration Iteration Iteration d. T12 e. T13 f. WG Figure 7: Comparing the 2 squared error of the ISplit, DPSplit and DPSplit-Best scheme (squares), we find that both geometric decrease and linear in this region empirically. We select α = 1.02 to balance between decrease have much better interactivity and lower error. error and interactivity if we set the maximum allowable delay per Overall, when selecting parameters, we would like to identify iteration B = 500ms [38]. Based on our sampling rate, fetching parameters that result in the favorable knee region of the curve. We 25000 samples takes around 500ms; thus we set N1 = 25000. find that α ∈ [1.0, 1.028], with N1 relatively small helps us stay From our theoretical results in Section 3.4, the range of α for this

16 . 5000 (/) 5000 (u) 25000 (-) 50000 (/) 50000 (u) 100000 (-) 5000 (-) 25000 (/) 25000 (u) 50000 (-) 100000 (/) 100000 (u) 2 1 1 1.0 ×10 4.0 ×10 7 ×10 3.5 6 0.8 3.0 5 0.6 2.5 4 err(Lk ) err(Lk ) err(Lk ) 0.4 2.0 3 1.5 0.2 2 1.0 0.0 1 0.5 −0.2 0.0 0 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 5.0 5.5 6.0 6.5 7.0 7.5 8.0 log (λ) log (λ) log (λ) 1 a. FLA 1 b. T11 1 c. WG 4.5 ×10 4.5 ×10 4.5 ×10 Upnfront (u) 4.0 4.0 4.0 5.0 3.5 3.5 3.5 3.0 2.0 3.0 3.0 err(Lk ) err(Lk ) err(Lk ) 2.5 5000 2.5 2.5 2.0 1.2 2.0 2.0 1.5 1000 1.5 500 1.5 1.0 1.028 1.0 1.02 50 1 (/), 0 (-) 0.5 0.5 1.0 0.0 0.0 0.5 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 log (λ) log (λ) log (λ) d. Initial Samples = 25000 (FLA) e. Initial Samples = 25000 (T11) f. Initial Samples = 25000 (WG) Figure 8: Error-interactivity trade off curve. (/) = Geometric decrease, (-) = Linear decrease, (u) = Upfront Sampling. N1 was [1, 1.028], so our experiments agree with the theory. Next, responding simulation. For a fixed α, higher the number of initial we simulate the error-interactivity trade-off curve for the sampling samples, higher the overall sample complexity of the algorithm. approaches using the expressions of error and interactivity obtained Figure 9 confirms that as the sampling complexity increases, ISplit in Section 3.4. starts to exhibit higher correlation with ISplit-Best while choosing the split groups. Beyond a certain sampling complexity r(E, F ) 5.5.2 Simulations vs. Empirical Observations starts to taper-off—indicating that further increasing the sampling Figure 10 captures the trade-off between the upper bound of the complexity will not improve the correlation. RSplit on the other error averaged (across iterations) on the y axis and logarithm of hand, is completely uncorrelated to ISplit-Best. For small sampling interactivity, i.e., log λ on the x axis for the Flight (FLA), T11 and complexity (N1 = 500, the first green circle in the plots) even IS- WG dataset. plit does not exhibit any correlation with ISplit-Best. Due to insuf- Each line in the chart corresponds a certain number of initial ficient sampling, the choice of split groups are so erroneous that it samples N1 , and either geometric decrease (denoted with a “/” and seems as if ISplit is choosing split groups randomly rather than in- represented by circles), or a linear one (denoted with a “-” and rep- telligently. For our choice of initial samples N1 = 25000 (the cir- resented by triangles). Furthermore, for each N1 , we also plot the cle highlighted in red), ISPlit exhibits high correlation (r(E, F ) > Error and interactivity pair for upfront sampling (denoted by “u” 0.78) to ISplit-Best for all three datasets. and represented by squares). For geometric decrease, for all lines, we find a similar behavior as our empirical results—a knee shape 6. EVALUATION: INTERPRETABILITY emerges as α increases starting from 1. The theoretical value for We now present an evaluation of the usability of I NC V ISAGE, the optimal decrease factor α∗ is annotated for each initial sample and more broadly evaluate how users interpret and use incremen- N1 . Furthermore, for each N1 the optimal α∗ is highlighted by a tally improving visualizations. We aim to address the following red arc and is the point with the best interactivity in each line— questions: 1) Are users willing to use approximate visualizations same as the empirical observation. For the linear decrease, given if it saves them time? 2) How confident are users when interpret- the same decrease factors β used in the experiments, the simulation ing visualizations? 3) Of the two types of visualizations, do users results match the experimental results. Similar to the empirical re- prefer the heatmap or the trendline visualization? sults, upfront sampling has the worst error and interactivity than all the other approaches. Therefore, we can clearly see that the simu- 6.1 Study Design and Participants lation results mimic the empirical results obtained in Section 3.4. The study consisted of five phases: (a) an introduction phase that explained the essential components of I NC V ISAGE, (b) an explo- 5.5.3 Error vs. Sample Complexity ration phase that allowed the participants to familiarize themselves Figure 9 captures the correlation between ISplit-Best and both of with the interface by exploring a sensor dataset (see Section 5.1), ISplit, and RSplit in terms of the choice of split groups. We run (c) a quiz phase where the participants used the same interface to several simulations of both ISplit, and RSplit with different initial answer targeted questions on the flight dataset, (d) an interview to samples (N1 ) while setting α = 1.02. Therefore, as N1 increases, collect qualitative data during the quiz phase, and (e) a closing sur- the overall sample complexity of the simulation also increases. The vey to obtain feedback on I NC V ISAGE. We describe the quiz phase x-axis represents the initial samples of the simulations while the in Section 6.2. The interview and survey phases are presented in y-axis represents the Spearman’s coefficient (r(E, F )) of the cor- Section 6.3. All of the studies were conducted by in the same lab

17 . RSplit ISplit 1.0 1.0 1.0 25000 0.8 25000 0.8 0.8 25000 0.6 0.6 0.6 r(E, F ) r(E, F ) r(E, F ) 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Initial Samples ×105 Initial Samples ×105 Initial Samples ×105 a. FLA b. FLD c. T11 1.0 1.0 1.0 25000 25000 0.8 0.8 0.8 25000 0.6 0.6 0.6 r(E, F ) r(E, F ) r(E, F ) 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Initial Samples ×105 Initial Samples ×105 Initial Samples ×105 d. T12 e. T13 f. WG Figure 9: Spearman’s Ranked Correlation Coefficient with varying sample complexity. E = {ISplit, RSplit} and F = ISplit-Best. 1 1 2.0 ×10 2.5 1.6 ×10 1.4 2.0 1.5 1.2 1.5 1.0 1.0 0.8 Error Error Error 1.0 0.6 0.5 0.5 0.4 0.2 0.0 1.024 1.028 1.03 1.032 0.0 1.024 1.028 1.03 1.032 0.0 1.024 1.028 1.03 1.032 −0.5 −0.5 −0.2 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 log (λ) log (λ) log (λ) a. FL b. T11 c. WG Figure 10: Simulation of the error-interactivity trade off curve. setting with the same two researchers. The first researcher led the development of ISplit, and was meant to assess the utility of incre- introduction and exploration phases; the second led the quiz, in- mental visualizations—nevertheless, we ensured that the interac- terview, and survey phases. Our study was conducted prior to the tivity criteria of 500ms per iteration was met [38].

18 . We recruited 20 participants (11 male, 9 female) via flyers across mit an approximate answer early to gain more points. On the other a university and via a university email newsletter. Our participants hand, for the multiple choice questions, a correct answer submitted included 11 graduate students (8 in computer science), one business at later iterations would yield lower scores. undergraduate student, two researchers, and six university employ- Interface Issues. Analyzing the quiz results proved more diffi- ees. All of the participants had experience with data analysis tools cult than expected. Out of 280 total submissions, 10 submissions such as R, Matlab, and/or Excel. The median age of the participants had interface issues—4 of those occurred due to ambiguity in the was 28 (µ = 32.06; σ = 11.9). Out of the 20 participants, 7 re- questions (R1, R4, R7 in Table 3), while others were due to mis- ported that they analyzed or worked with data “Daily”, 7 answered takes made by the participants in selecting the visualization to be “Monthly”, while the remaining participants answered “Weekly”. generated.For example, for one question (R1), dealing with depar- The duration of the sessions ranged from approximately 60 to 90 ture delay two participants selected the “departure airport” attribute minutes. Each participant received $10 per hour at the end of their as opposed to “origin airport”. A similar issue arose with ques- session. tions R4 and R7. The ambiguity arose from attribute names in the original dataset—to maintain consistency, instead of renaming 6.2 The Quiz these attributes on-the-fly to fix the ambiguity, we did not make any We now explain the design of and findings from the quiz phase. changes. We explicitly separate out interface issues when analyz- ing the results. 6.2.1 The Quiz Phase Design The purpose of the quiz phase was to evaluate whether partici- 6.2.2 Quantitative Analysis pants were willing to compromise accuracy to make rapid decisions In discussing the participants’ quiz performance, we first investi- when posed various types of questions. One way to capture such gate their answer submission trends. Finally, we report the progress behavior is via a point based system where early submissions are of each participant. As the participants interacted with the tool, we rewarded while late submissions are penalized. With this incentive, recorded their responses to each question, the iterations at which participants would be encouraged to submit their answers quickly. a participant submitted an answer, the time taken to submit an an- We describe the scoring function we used in our point based system swer after starting the visualization, and the points obtained based later in this section. We first categorize the questions used during on the scoring function. the quiz phase. Summary: The majority of the submissions for both trendlines (75%) We used two types of quiz questions: extrema-based (E1-7), and and heatmaps (77.5%) were within the first 25% of the total number of range-based (R1-7). These questions are listed in Table 3. The iterations. Even though the participants chose to trade accuracy for time extrema-based questions asked a participant to find the highest or in both cases, they did so with reasonably high accuracy. lowest values in a visualization. The range-based questions asked Trading accuracy for time. First, we analyze how the participants a participant to estimate the average value over a time range (e.g., traded accuracy for time. Figure 11 shows a dot graph analysis months, days of the week). The purpose of such a categorization of the participants’ submission trends as a function of iteration. is to evaluate the accuracy and confidence of participants in finding For both the trendline and heatmap visualizations, we separated both specific points of interest (extrema) and patterns over a range the statistics for the correct and incorrect submissions (Figure 11a (range) when given I NC V ISAGE. The extrema based questions and 11b). Correct submissions are represented by green circles. In- were free form questions; the range based questions were multi- correct submissions are either represented by blue circles (interface ple choice questions. Two of the range based questions were for- issue) or red circles. The x-axis represents at what fraction of the matted as free form questions but, operationally, were multiple- total number of iterations an answer was submitted. choice questions with seven possible options (e.g., the day of the For trendlines, the majority of the submissions (75%) were made week). To prevent order effects, ten participants started the quiz at around 25% of the total iterations, except for question E4 (Fig- with heatmaps; the other ten started with trendlines. Addition- ure 11a). Question E4 asks for a day of the year that shows the ally, we randomized the order of the questions for each participant. highest departure delay across all years. During the study, we dis- Next, we define the scoring function used in assessing quiz perfor- covered that the departure delays for December 22 and December mances. 23 were very similar. Due to the proximity of the values, these The Scoring Function. The scoring function relied on two vari- dates were not split into separate groups until a later iteration. One ables: the iteration number at which the participant submitted an participant even waited until iteration 237 (out of 366 iterations) answer—the higher the iteration the lower the score, and whether to submit their answer. Figure 11b shows the trends for heatmaps. or not that answer was accurate—the higher the accuracy the higher Similar to trendlines, the majority of the participants (77.5%) chose the score. The participants were informed prior to the quiz phase to submit answers earlier (within 25% of the iterations) except for that the score was based on these two variables, but the exact func- questions R5 and R7, where once again there were two days of tion was not provided. The maximum attainable score for each the week with very similar delays, leading to the relevant heatmap answer was 100 points.The score was computed as a product S = block being split in a later iteration. P · A, where P was based on the iteration, and A on the accu- The submission trends indicate that participants opted for a higher racy. If a participant submitted an answer at iteration k, we set reward and submitted their answers as early as possible, when the P = m−k m , i.e., the fraction of the remaining number of iterations visualizations became stable or when obvious extrema emerged, over the total number of iterations, m. Thus, based on the defini- trading accuracy for time. This observation is confirmed in Sec- tion of P , any answer submission at the last iteration receives zero tion 6.3. Figure 12 plots the accuracy of all the submissions ac- points, irrespective of question type. To compute A, let c be the cording to the scoring functions described in Section 6.2.1. The correct answer to a question, and let u be the answer submitted by x-axis represents at what fraction of the total number of iterations the participant. The accuracy A of a multiple choice question is 0 an answer was submitted; accuracy appears on the y-axis. For both if u = c and 1 otherwise. The accuracy A of a free-form ques- the trendline (Figure 12a) and heatmap (Figure 12b) visualizations, tion is 1 − |u−c| |c| , measuring how far the answer is from the correct the accuracy of the majority of the submissions is reasonably high. one. For the free form questions, submitting an incorrect answer Submission trends. Here we show at what fraction of the itera- that is close to the actual answer could result in a high score. Since tions (% iteration) the participants typically opted to submit their the free form questions were range based, participants could sub- answers. We also analyze the submission trends for the different

19 . Type Extrema-based questions Range-based questions E1. In the state of NY (destination), which day of the year R1. Choose the correct option based on the day of the year. Anyone traveling from suffers most from the delay caused by the LGA airport (origin) during has to suffer from the highest National Aviation System (NAS delay)? Departure Delay. A) Early January B) Summer C) Late December Trendline R2. Choose the correct option based on the day of the year. UA (carrier) aircrafts E2. Find the busiest day of the year in the ORD (origin) have the worst Aircraft Delay statistics on: airport. The higher the Taxi Out time, the busier the airport. A) Jan 01- Jan 10 B) Jun 10- Jun 20 C) Dec 01- Dec 10 E3. Find the Month that has the day (of month) with the R3. Choose the correct option. Overall, the Arrival Delay is the worst in: shortest Arrival Delay. A) First half of January B) Mid July C) Last half of December E4. Which Day of the year has the highest Departure Delay? E5. For the Week of Year Vs.Year heatmap, which week R4. Choose the correct option. Which of the following months has the most of the year had the highest Departure Delay? Days of Month with a low Arrival Delay? A) Feb B) Jul C) Oct Heatmap E6. Find the (day of month, month) pair that had the R5. Over the course of months, which Day of Week exhibits a higher Carrier Delay highest Aircraft Delay in DEN (destination) airport. for AA (Carrier)? R6. Choose the correct option. For the ATL (destination) airport, which of the E7. Find the (day of month, month) pair in the year 2013, following months exhibited a higher Arrival Delay for the majority of the years? that had the highest Weather Delay. A) Jan B) Jul C) Nov R7. Over the Years, which Day of Week had the highest Arrival Delay in the city of Atlanta- GA (departure city)? Table 3: Categorization of the user study questions. Correct Incorrect Interface (Figure 13b); the extrema-based questions were submitted earlier E3-I E7-I compared to the range-based questions. Comparison of the sub- E3-C E7-C mission trends of the extrema-based questions for trendlines and E4-I E6-I heatmaps shows that heatmap submissions occurred much earlier. E4-C E6-C This may be due to the fact that the color dimension of the heatmap R3-I R7-I helped users compare contrasting blocks, while the continuous gen- R3-C R7-C eration of peaks and valleys in the trendline led to abrupt changes Questions Questions E2-I R6-I in the visualization that were difficult to interpret. Hence, partici- E2-C R6-C pants waited a bit longer for the trendline visualization to stabilize, R2-I R5-I whereas for the heatmap, the changes in color were not as abrupt R2-C R5-C and the participants were more confident when they spotted an ex- R1-I R4-I tremum. R1-C R4-C 100 100 E1-I E5-I E1-C E5-C 80 80 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 % Iterations % Iterations % iteration % iteration a. Trendline b. Heatmap 60 60 Figure 11: Per-question statistics for the iterations at which 40 40 participants submitted answers for trendlines (l) and heatmaps (r). 20 20 Range Based Extrema Based 0 0 R-C R-I E-C E-I R-C R-I E-C E-I 1.0 1.0 Questions Questions a. Trendline b. Heatmap 0.8 0.8 Figure 13: Per category statistics of iterations at which par- Accuracy Accuracy 0.6 0.6 ticipants submitted answers for (a) trendline and (b) heatmap. 0.4 0.4 Extrema = E, Range = R, Correct = C, Incorrect = I. 0.2 0.2 Analyzing the Participants. In this section, we analyze the 0.0 0.0 progress of each participant individually. Figure 14 shows a pivot −10 0 10 20 30 40 50 60 70 −20 0 20 40 60 80 100 120 table with graphical marks that depict the progress of the partic- %Iteration %Iteration ipants during the quiz phase. Each row in the table corresponds a. Trendline b. Heatmap to one participant. Each cell in a row, with the exception of the Figure 12: Accuracy vs Submission (% iteration) statistics for last two, corresponds to the questions that participant answered. (a) trendline and (b) heatmap visualizations. Although all participants answered the same set of questions, the order varied due to randomization. The last two columns show question types in each visualization. Figure 13 presents the box and the average points obtained by the participant and the average of whisker plot of the answer submission trends. For trendlines (Fig- the percentage of iterations at the point the participant to submit- ure 13a), the range-based questions were submitted earlier (75% of ted their answer. Both quantities are represented by circles. The the submissions at % iteration ≈ 15%) compared to the extrema- higher the number of points, the larger the radius of the circle – based questions (75% of the submissions at %iteration ≈ 28%). while the lower the iteration percentage, the larger the radius of the This difference in submission trends across types may be due to circle. For ease of analysis, we divide both the points and the it- the fact that the range based questions asked participants to find eration percentages into five ranges. The point ranges are: ≤ 55, more coarse grained information (e.g., the delay statistics in a spe- 56-65,66-75,76-85 and > 85 while the iteration percentage ranges cific range) which may be easier than finding peaks and valleys. are: ≤ 10%, 11-15%, 16-20%, 21-25% and >25%. All points Also the range-based questions were multiple choice—the hints falling into the same range are represented by a circle with the same provided by the multiple choice options may have provided guid- radius. Similarly, all the iteration percentages falling in the same ance to the participants. We see the opposite trend for heatmaps ranges have the same radius. The participants whose last two cells

20 .contain larger circles performed better during the study. This was not because all participants liked the heatmap and the The first 12 rows are participants with a computer science (CS) trendline equally, but rather because the number of participants who 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P I preferred each visualization was evenly divided: seven participants 1 2 preferred the heatmap, six preferred the trendline, and seven rated 3 both visualizations equally. While some participants found each 4 visualization useful for different purposes, some were more enthu- 5 6 siastic about their preference for one type of visualization over the 7 other. Two participants even indicated that their confidence level 8 varied based on the type of visualization; both rated their confi- 9 Participant No. 10 dence for heatmap answers higher than for trendline answers. The 11 advantages and disadvantages of each visualization became more 12 13 evident through the interview and the survey. For example, the 14 heatmap was useful in finding global patterns quickly but interpret- 15 ing values via hue and brightness had split views. In general, partic- 16 17 ipants found the trendline visualization familiar and could observe 18 the extrema more easily. Next we present the participants’ prefer- 19 ence for trendline and heatmap visualizations. 20 Correct Incorrect Interface Avg .Point Avg. Iteration 6.3.1 Research Q3: Heatmap vs. Trendline Figure 14: Analyzing the participants. Participants who preferred the heatmap over the trendline visual- background, while the remaining participants did not have a CS ization appreciated the extra color dimension of heatmaps. Partici- background. Aside from observing that many participants answered pants generally found the colorful visualization aesthetically pleas- their final three to four questions correctly, participants exhibited ing. P17 commented that the heatmap is “really interesting to watch no patterns during the quiz. The median average point and average over time. Especially, at first when things were one or two blocks iteration percentage were 70.73 and 14.77%, respectively. Only of color and then to sort of watch the data emerge and to then watch one participant, P18 answered all of the questions correctly with the different boxes become something . . . I actually caught myself an average iteration percentage of 13.37% and with the highest watching for a long time.” However, the ease of interpreting the point average (86.62). P6 submitted their answers faster than all colors was debatable; some participants (N=7) stated that colors other participants with a 7.06 iteration percentage, while attaining helped them distinguish the values of the competing groups while the second highest point average (84.73). P16 suffered most from others (N=4) found comparing and averaging colors burdensome interface related issues obtained the lowest point average (38.97). and not intuitive. It was especially difficult to perceive color dif- ferences when the difference of values was small or when the com- 6.3 Interview and Survey Phase pared blocks were distant on the screen. We now present a qualitative analysis of the participants’ percep- Another emerging theme centered about the emergence of eas- tions based on their interview and survey responses. ily noticeable color patterns in the early iterations. One participant (P12) commented that the heatmap isolated the interesting patterns Summary: Participants were highly confident (confidence = 8.5 out of faster than the trendline. Although the quick emergence of color 10) when they submitted answers for both visualization types. Some patterns is advantageous in making faster decisions, one partici- participants, however, suggested providing explicit guarantees to further pant (P8) accurately pointed out the danger of making the decision boost their confidence. Both trendline and heatmap visualizations were too soon as it could lead to confirmation bias if the later iterations highly interpretable. diverge from the initial color pattern. Interview. We conducted semi-structured interviews to gauge our The familiarity of the trendline visualization attracted partici- participants’ motivations for stopping the visualization at a certain pants with its intuitively interpretable values and easily noticeable iteration, and their confidence of their answers. The main motiva- changes for consecutive values. Participants also found differen- tions for terminating a visualization were the emergence of obvi- tiating high and low points that were distant on a line graph much ous extrema (N = 5), gaining sufficient confidence in an answer easier than comparing different the shades of a color in the heatmap (N = 10), or both (N = 5). When asked to rate their confidence at visualization. However, the numerous peaks and valleys disturbed the time of submission on a scale of 1 to 10 (10 being the most con- some participants as the focal point of the visualization became un- fident), most participants rated their confidence very high (µ = 8.5 certain. Hovering over a specific point was harder on the trendline and σ = 1.03 out of 10). However, some participants (N = 4) indi- to determine exact values since the selection was made solely based cated that they would have continued until the final iteration if they on the x-coordinate of the mouse and even a small perturbation to were freely exploring the data (and not in an assessment setting). If the right would result in a different selection and value. they were pressed for time or the dataset was very large, they would All but one participant, P16, believed both the heatmap and trend- choose to stop before the final visualization iteration, but likely at line visualizations were easily interpretable. In the survey, the av- a later iteration than the iteration they stopped at in our quiz phase. erage usability of I NC V ISAGE was rated 4.25 out of 5 (σ = .64). One of those participants (P8) mentioned that providing an explicit Participants also noted easy learning curve for I NC V ISAGE; all of guarantee along with the visualizations would further increase the them felt comfortable with the tool after an hour of use. confidence level when making decisions. Survey. The survey consisted of seven Likert scale questions to 6.4 Limitations and Future Work measure the interpretability of the visualizations and the usability We identified some limitations and possible future directions. of I NC V ISAGE. Also, there were three free-form questions to col- Our approach to approximate visualizations relies heavily on the lect the participants’ feedback on I NC V ISAGE. The heatmap and smoothness of the data; noise and outliers in the dataset impede trendline visualizations received similar average ratings (out of 5) generating a useful approximation quickly. As highlighted in Sec- for interpretability (heatmap: µ = 4.45; σ = 0.51; trendline: µ = tion 7.2, when the value of the point of interest and its neighbor(s) 4.50; σ = 0.83) and satisfaction levels (heatmap: µ = 4.55; σ = is very close, I NC V ISAGE might choose to reveal that point at later 0.60; trendline: µ = 4.50; σ = 0.83). iterations. As a result, any user looking to find quick insights may

21 . Trendline Heatmap Approach Extrema Based Questions Range Based Questions Extrema Based Questions Range Based Questions Accuracy Time (sec) Accuracy Time (sec) Accuracy Time (sec) Accuracy Time (sec) I NC V ISAGE 94.55% 25.0638 62.50% 22.0822 83.47% 31.6012 97.50% 34.7992 OLA 45.83% 26.4022 52.50% 27.3125 79.13% 31.3846 95% 25.4782 Table 4: Overall Accuracy and Submission Time Statistics Per Question Category select an incorrect sub-optimal point. I NC V ISAGE currently does setting with the same two researchers. not offer any explicit guarantee of an answer, which was pointed Quiz Phase Design. In designing the quiz phase, we used the out as a limitation of the tool by one of the participants (P8). In flight dataset (details in Section 5), with 20 questions in total— the next version, we could incorporate a measure of confidence in 10 on trendlines and 10 on heatmaps. In each case, five questions the tool by noting the variation of values from one iteration to the were reserved for I NC V ISAGE, and five for OLA. We used the same next. As the groups converge towards the actual value, the segment categorizations of questions as in our first study—range-based and approximations begin to stabilize, reducing the variation between extrema-based. These questions are listed in Table 5. As before, we successive segment approximation. randomized the order of the tools, and the order of the questions. We discussed the interface issues in Section 6.2. The limita- The Scoring Function. As in Section 6, a score was provided to tions of the user study fall into three categories—the ambiguity in the user as they answered questions. The score was computed as a three quiz questions, interface control, and participant demograph- product S = P · A, where P corresponded to how quickly the user ics. The ambiguity of the three questions (R1, R4, and R7) in the made a decision, and A to the accuracy. The formulae for A were quiz phase led to unintended interface issues (4 issues out of 280 similar to the previous study. Instead of setting P to be proportional submissions). The result is that participants incorrectly answered to the number of remaining iterations, here, in order to allow the questions due to incorrect queries rather than because incorrect in- scores to be comparable between OLA and I NC V ISAGE, we set P terpretation of the visualizations. This highlights the limitations of to be TT−t , where T is the time taken to compute the visualization I NC V ISAGE with prepared attributes, i.e., if someone downloads a by scanning the entire dataset, while t is the time taken by the user. dataset as is and tries to use it with our system, similar ambiguities may occur. From the interface perspective, two participants (P4 and 7.2 Quantitative Analysis of the Quiz Phase P8) suggested adding axes to the snapshots, which would help them In discussing the participants’ quiz performance, we investigate compare values across iterations and in turn, ensure that the ap- both their accuracy (using A above) as well as answer submission proximation is approaching actual values. Participants also desired time for both I NC V ISAGE and OLA. more control as they explored the data set. One participant (P8) Summary: For trendlines, I NC V ISAGE exhibits a 62.52% higher accu- suggested including a command-line interface to allow for more racy than OLA for both question types, while also reducing the submis- specific queries, while others suggested adding more options, and sion time by 10.83%. For heatmaps, I NC V ISAGE exhibits 4.05% higher even different visualization styles. Other interface suggestions that accuracy than OLA for both question types. The submission time for arose included the ability to zoom in and out and to select a specific range-based questions with I NC V ISAGE is higher than OLA. area from which to sample more. Participants also offered archival suggestions. Two (P10 and P15) participants suggested adding an Accuracy and Submission Time Statistics. Table 4 summarizes option to download the final visualization, snapshots, and the data the overall accuracy and the submission times for both I NC V ISAGE summary. This archival feature would help users explore larger and OLA. For trendlines, I NC V ISAGE outperformed OLA in terms data sets over a longer period of time. Finally, our participant pool of both accuracy and submission times. For extrema based ques- demographics do not match the demographics of the general audi- tions, the overall accuracy of I NC V ISAGE was almost double than ence intended for this tool. Future studies will reach a larger, more that of OLA. The submission times were also lower for I NC V ISAGE diverse audience. for both types of questions. Overall, users are able to make faster and more accurate decisions using I NC V ISAGE than OLA. There is 7. EVALUATION: DECISION MAKING a dip in the overall accuracy for the range based questions for both Previously, we evaluated the interpretability of I NC V ISAGE, com- approaches. Since the accuracy of the range based questions was pared trendlines to heatmaps, and qualitatively studied how users either 0 or 1, any incorrect submission incurred a higher penalty, felt about the inherent uncertainty. We now compare I NC V ISAGE thereby reducing overall accuracy. with traditional online-aggregation-like [20] approaches (OLA) that For heatmaps, I NC V ISAGE exhibited better accuracy than OLA— depict approximations of visualizations as samples are taken (as in in particular, an improvement of 4.05%. For extrema based ques- the first and third row of Figure 1). Specifically, does I NC V ISAGE tions, the submission times were almost equal. However, for range lead to faster and/or more accurate decision making? based questions submissions with I NC V ISAGE took longer than OLA. We found that when using I NC V ISAGE with heatmaps, par- 7.1 Study Design and Participants ticipants waited for the larger blocks to break up in order to com- Our study design was similar to our previous study, with four pute averages over ranges, thereby increasing submission times phases, an introduction, a quiz phase, an interview for qualitative but providing more accurate answers. As it turns out, the initial data, and finally a closing survey. We introduced the I NC V ISAGE (larger) blocks in I NC V ISAGE do provide average estimates across approach as the “grouped value” approach, and the OLA approach ranges and could have been used to provide answers to the ques- as the “single value” approach. tions quickly. The unfamiliarity with incremental heatmap visu- We recruited 20 participants (11 male, 9 female) via a university alizations, and heatmaps in general, contributed to this delay. In email newsletter. Our participants included 12 graduate students hindsight, we could have educated the participants more about how (4 in computer science), 5 undergraduate students, one researcher, to draw conclusions from the I NC V ISAGE heatmaps and this may and two university employees. All of the participants had experi- have reduced submission times. ence with data analysis tools. The median age of the participants Per Question Statistics. Figure 15 shows the dot graph analy- was 25 (µ = 26.06; σ = 6.84). Out of the 20 participants, 2 re- sis of accuracy for the extrema based questions for both visualiza- ported that they analyzed or worked with data “Daily”, 10 answered tion types. The submissions using I NC V ISAGE and OLA are high- “Monthly”, 5 “Weekly” while the remaining participants worked lighted by “cyan” and “magenta” circles, respectively. The x-axis rarely. The duration of the sessions ranged from approximately 60 represents the accuracy while the y-axis represents the questions. to 75 minutes. Each participant received $10 per hour at the end For trendlines (Figure 15a), majority (99%) of the submissions of their session. All of the studies were conducted in the same lab with I NC V ISAGE were in close proximity of the correct answer,

22 . Questions Trendlines Heatmaps 1 Which day of the year enjoys the shortest Arrival Delay? Which Week of Year had the highest Departure Delay? 2 Find the day with the highest carrier delay for US airways. Among the following three months, which of month has the highest number of Days (of Month) with a high Arrival Delay?A) FEB B) JUL C) OCT 3 Overall, the Arrival Delay is the worst/highest in- Which of the following Months exhibits a higher Arrival Delay A) JAN 01-JAN 15 B) JUL 11- JUL 25 C) DEC 16 - DEC 31 for the majority of the Years?A) JAN B) JUL C) NOV 4 Which day of the year shows a higher Departure Delay Find the (Day of Month, Month) pair that had the on average for the flights departing the LGA airport? highest Aircraft Delay. 5 Which of the following ranges contains the day with the highest Which Year contains the Day of Week with the Aircraft Delay for the UA aircrafts- highest Departure Delay? A) JAN 01- JAN 10 B) JUN 10- JUN 20 C) DEC 21- DEC 31 6 Which Day of the year has the highest Departure Delay? Find the (Day of Month, Month) pair in the year 2013, that had the highest Weather Delay. 7 In the city of Atlanta-GA, which Day of the year Which of the following Months exhibited a lower has the highest NAS (National Aviation System) Delay? Departure Delay for the majority of the Years?A) FEB B) JUN C) SEP 8 In the city of Chicago-IL, bad weather forces longer Which of (Day of Month, Month) pair has the highest Arrival Delay? weather delays on average in the following period- A) Day 1-15 B) Day 190-205 C) Day 350-366 9 Which day of the year 2010 enjoys the shortest Arrival Delay? Which Week of Year contains the Day of Week with the highest Arrival Delay? 10 In the Year 2011, which of the following months have Which of the following Years contain lower Security Delays? A) JAN B) JUL C) NOV the month with the highest Security Delay? A) 2004 B) 2008 C) 2012 D) 2015 Table 5: User study questions. Incvisage (I) OLA (O) IncVisage OLA O-9 O-9 1.0 1.0 I-9 I-9 0.8 0.8 O-7 O-8 Accuracy Accuracy I-7 I-8 0.6 0.6 O-6 O-6 0.4 0.4 Questions I-6 Questions I-6 0.2 0.2 O-4 O-5 I-4 I-5 0.0 0.0 3 5 8 10 2 3 7 10 Question No. Question No. O-2 O-4 a. Trendline b. Heatmap I-2 I-4 Figure 16: Accuracy statistics of extrema based questions for O-1 O-1 (a) trendline and (b) heatmap IncVisage visualizations. OLA I-1 I-1 1.0 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Accuracy 0.8 0.8 a. Trendline b. Heatmap Accuracy Accuracy Figure 15: Accuracy statistics of extrema based questions for 0.6 0.6 (a) trendline and (b) heatmap visualizations. 0.4 0.4 whereas with OLA the accuracy drops quite a lot—only 55% of the 0.2 0.2 submissions were in close proximity with the correct answer. For 0.0 0.0 heatmaps (Figure 15b), again there are more submissions with I N - 0 20 40 60 80 100 120 0 20 40 60 80 100 C V ISAGE (80%) that are in close proximity of the correct answer Time (sec) Time (sec) a. Trendline b. Heatmap compared to OLA (52%). Figure 16 shows the accuracy for the range based questions for both visualization types. The y-axis rep- Figure 17: Accuracy vs Submission Time statistics for (a) resents the accuracy while the x-axis represents the questions. The trendline and (b) heatmap visualizations. submissions using I NC V ISAGE and OLA are highlighted by “cyan” and survey responses. and “magenta” bars, respectively. For trendlines (Figure 16a), none Summary: Participants were reasonably confident when they submitted of the submissions for Q5 was correct. For the rest of the questions, answers for both visualization types using I NC V ISAGE. The trendline submissions with I NC V ISAGE had higher accuracy than OLA. For visualization using OLA was unstable and had low interpretability that heatmaps (Figure 16b), accuracy of I NC V ISAGE is only slightly resulted in participants submitting answers with low confidence. Ma- better than OLA. jority of the participants preferred the I NC V ISAGE representations for Submission Trends. Figure 17 plots the accuracy of all the both visualizations. submissions according to the scoring functions described in Sec- Interview. We conducted semi-structured interviews to gauge our tion 7.1. The x-axis represents submission time in seconds; ac- participants’ motivations for stopping the visualization at a certain curacy appears on the y-axis. For both trendline (Figure 17a) and iteration, and their confidence on their answers (for both I NC V IS - heatmap (Figure 17b) visualizations, participants opted to submit AGE and OLA). The main motivations for terminating a visualiza- their answers as quickly as possible for both the approaches, i.e., tion were the emergence of obvious extrema (N = 10), gaining the participants chose to trade accuracy for time. sufficient confidence in an answer (N = 6), or both (N = 4). When asked to rate their confidence at the time of submission on 7.3 Interview and Survey Phase a scale of 1 to 10 (10 being the most confident), we had varied re- In this section, we present a qualitative analysis of the partici- sponses depending on the visualization types. For trendlines, par- pants’ perceptions of both the approaches based on their interview ticipants were reasonably confident (µ = 6.53, σ = 1.89) when

23 .using I NC V ISAGE, but much less confident (µ = 4.44, σ = 1.27) from that of incrementally improving visualizations. Offline AQP when using OLA. For heatmaps, participants were slightly more systems [9, 8, 11, 18] operate on precomputed samples. Garo- confident when using OLA (µ = 7.15, σ = 0.86) than when using falakis et al. [18] provides a detailed survey of the work in this I NC V ISAGE (µ = 6.76, σ = 1.97). Majority of the participants space. Unlike these approaches, I NC V ISAGE deals with ad-hoc vi- (N = 3) who preferred OLA liked the fact that they were able to sualizations. see each individual data point at all times. I NC V ISAGE on the other Approximate Visualization Algorithms. We have already dis- hand, hid information early on by approximating the visualizations cussed IFOCUS [32], PFunkH [10] and ExploreSample [53] in which was less desirable to them. Among 20 participants, majority the introduction section. IFOCUS [32], PFunk-H [10], and Ex- (N = 12) preferred the I NC V ISAGE representation over the OLA ploreSample [53] are online approximate visualization algorithms. (N = 6) representation of the visualizations while two participants IFOCUS [32] generates bar charts preserving ordering guarantees equally preferred the two approaches. between bars quickly, but approximately. PFunk-H [10] uses per- When using I NC V ISAGE. participants were able to interpret the ceptual functions to provide approximate answers that differ from initial high level approximations to identify specific regions of in- the true answers by a perceptually indiscernible amount. Explore- terest and eventually found the answer. On the other hand, they Sample [53] is also an online sampling algorithm that deals with thought OLA to be unstable and difficult to interpret. One of the 2d scatterplots coupled with heatmaps; approximating the visual- participants (P14) said the following—“For I NC V ISAGE, it was ization while preserving the outliers and the overall distribution. easier to know when I wanted to stop because I had the overall idea Unlike these one-shot approaches, we generate visualizations that first. And then I was just waiting to get the precise answer because incrementally improve over time. Concurrent work has shown that I knew it was coming. So it was the difference. OLA, it was a shot analysts are willing to use approximate visualizations in real data in the dark where I see a little bit where it is, I would wait to see exploration scenarios [39]. This work introduces an optimistic vi- if it stays as the answer”. Another participant (P15) also expressed sualization system that allows users to explore approximate visu- similar observations—“With single values, there was just so much alizations and verify the results of any visualization they feel un- going on I was like ‘OK, where am I trying to focus on. What certain about at a later time. The visualization to be verified is is either the highest or the lowest?’ versus the grouped values, it computed in the background while user continues to explore the started out really simple and then became more complex to be able data. I NC V ISAGE’s approach can be combined with this approach, to show the differences”. The same participant also preferred the since verification of decisions made using approximate visualiza- aesthetics of I NC V ISAGE—“I preferred the grouped (one), because tions may be valuable (and thus the two directions provide orthog- it made it easier to kind of narrow down at least the range. So if onal and complementary benefits). you didn’t know the exact date, you could at least be close. Versus Incremental Visualization. We have already discussed Online with the single value, it could, there could be two values that look aggregation [20] and sampleAction [17] in the introduction sec- similar and if you picked the wrong one, you were very wrong, tion. Online aggregation [20] introduced the idea of incremen- potentially.” tal data analysis, while sampleAction [17] uses online aggrega- Survey. The survey consisted of four Likert scale questions for tion for displaying incremental visualizations. Jermaine et al. [27, each visualization type to measure the interpretability and the us- 28] further explored incremental database queries. Online aggre- ability of the competing approaches: I NC V ISAGE and OLA. Also, gation places the onus on the users to regulate the sampling of the for each visualization type, there were two free-form questions ask- groups—instead I NC V ISAGE automates the sampling process, and ing the participants to list the positive and negative aspects of both produces smooth refinements across iterations. Recent user stud- I NC V ISAGE and OLA. For trendlines, I NC V ISAGE received higher ies by Zgraggen et al. [54] demonstrate that an OLA-style system average ratings (out of 5) for interpretability (I NC V ISAGE: µ = outperforms one-shot computation of the visualization in terms of 3.93; σ = 0.92; OLA: µ = 2.67; σ = 1.29) and satisfaction levels number of insights gained—they do not contribute any new algo- (I NC V ISAGE: µ = 3.93; σ = 0.88; OLA: µ = 2.67; σ = 0.82). On rithms, however. In our work, we additionally demonstrate that the other hand, for heatmaps, OLA received slightly better average I NC V ISAGE reduces the number of user mistakes made in decision ratings for interpretability (I NC V ISAGE: µ = 3.73; σ = 0.80; OLA: making compared to OLA. µ = 4.00; σ = 0.76) and satisfaction levels (I NC V ISAGE: µ = 3.60; Bertolotto et al. [13] uses a multiresolution spatial maps con- σ = 0.99; OLA: µ = 3.87; σ = 0.64). struction technique [41] to compute representations of spatial maps Limitations and Future Work. Similar to our first study, users at various resolutions offline to reduce time taken for storage and again mentioned uncertainty as an issue when submitting an an- therefore data transfer at each resolution, while preserving geo- swer. For the heatmap representation of I NC V ISAGE the confi- graphical features of interest including the size and interaction of dence of the users could have further boosted if we had a mea- points, lines and shapes—this is therefore a offline data compres- sure of confidence for the visualizations represented which in turn sion approach for a fixed spatial map. I NC V ISAGE, on the other could have further improved the submission times. This could be hand, uses sampling to compute the k-increments of visualizations an interesting direction for future work. One possible approach is online and can support ad-hoc queries. to use both I NC V ISAGE and OLA representations side by side so Visualization Tools. In recent years, several interactive visualiza- that users can gain further confidence by seeing individual blocks tion tools have been introduced [44, 48, 31, 46, 42, 43]. The algo- alongside the larger blocks. rithms provided in this paper can be incorporated in these tools so that users can quickly identify key features of data. 8. RELATED WORK Scalable Visualizations. A number of recent tools support scalable In this section, we review papers from multiple research areas visualization generation [37, 30, 35] by precomputing and stor- and explain how they relate to I NC V ISAGE. ing aggregates—this can be prohibitive on datasets with many at- Approximate Query Processing (AQP). AQP schemes can op- tributes. M4 [29] uses rasterization parameters to reduce the dimen- erate online, i.e., select samples on the fly, and offline, i.e, select sionality of a trendline at query level—selects the groups that cor- samples prior to queries being issued. Among online schemes, cer- rectly represent the distributions. I NC V ISAGE on the other hand, tain approaches respect a predefined accuracy constraint for com- reveals features of visualizations in the order of prominence for ar- puting certain fixed aggregates without indices [22, 23], and with bitrary queries. indexes [19, 36]. The objectives and techniques are quite different Approximation of Distributions. Approximating a data distribu-

24 .tion by histograms has also been studied previously [7, 24, 26]. [20] J. M. Hellerstein et al. Online aggregation. SIGMOD Rec., 26(2), 1997. These methods do not sample iteratively from groups—they op- [21] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963. erate in a one-shot manner, and focus only on COUNT queries. [22] W.-C. Hou et al. Statistical estimators for relational algebra expressions. In Donjerkovic et al. [16] maintains histograms over evolving data, PODS, pages 276–287. ACM, 1988. once again for COUNT queries. [23] W.-C. Hou et al. Processing aggregate relational queries with hard time constraints. In SIGMOD Rec., volume 18, pages 68–77. ACM, 1989. 9. CONCLUSIONS [24] P. Indyk et al. Approximating and testing k-histogram distributions in sub-linear We introduced the notion of incrementally improving visualiza- time. In PODS, pages 15–22. ACM, 2012. tions and demonstrated that our incremental visualization tool, I N - [25] F. Jackson. The basic gamma-function and the elliptic functions. Proceedings of the Royal Society of London., 76(508):127–144, 1905. C V ISAGE , helps users gain insights and make decisions quickly. [26] H. V. Jagadish et al. Optimal histograms with quality guarantees. In VLDB, On very large datasets, I NC V ISAGE is able to achieve a 46× speedup volume 98, pages 24–27, 1998. relative to SCAN in revealing the first 10 salient features of a visu- [27] C. Jermaine et al. The sort-merge-shrink join. ACM Transactions on Database Systems (TODS), 31(4):1382–1416, 2006. alization with suitable error guarantees that are comparable to a [28] S. Joshi et al. Materialized sample views for database approximation. IEEE dynamic programming approach, but without a high computational TKDE, 20(3):337–351, 2008. overhead. Our user studies demonstrate that users chose to trade [29] U. Jugel et al. M4: a visualization-oriented time series data aggregation. VLDB accuracy for time to make rapid decisions, that too at higher accu- Endow., 7(10):797–808, 2014. racy than traditional approaches. There are a number of interesting [30] S. Kandel et al. Profiler: Integrated statistical analysis and visualization for data quality assessment. In AVI, pages 547–554. ACM, 2012. future directions, such as modeling and displaying the degree of [31] A. Key et al. Vizdeck: self-organizing dashboards for visual analytics. In uncertainty, along with a wider range of operations (e.g. pausing at SIGMOD Conf., pages 681–684. ACM, 2012. segment level or group level), and alternative views (e.g., overlay- [32] A. Kim et al. Rapid sampling for visualizations with ordering guarantees. ing incremental visualizations and traditional approaches). Finally, VLDB, 8(5):521–532, 2015. [33] A. Kim et al. Speedy browsing and sampling with needletail. Technical report, gaining a better understanding of the sorts of decisions for which 2016. https://arxiv.org/abs/1611.04705. one-shot approaches and incremental visualization approaches are [34] N. Koudas. Space efficient bitmap indexing. In CIKM, pages 194–201, 2000. appropriate is a promising future direction. [35] L. Lins et al. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG, 19(12):2456–2465, 2013. [36] R. J. Lipton et al. Efficient sampling strategies for relational database 10. [1] IntelREFERENCES sensor dataset. http://db.csail.mit.edu/labdata/labdata.html. operations. Theoretical Computer Science, 116(1):195–226, 1993. [2] Microsoft’s power bi hits 5m subscribers, adds deeper excel integration. [37] Z. Liu et al. immens: Real-time visual querying of big data. In Computer http://www.pcworld.com/article/3047083/. Accessed: 05-22-2016. Graphics Forum, volume 32, pages 421–430. Wiley Online Library, 2013. [3] Nyc taxi dataset. http://publish.illinois.edu/dbwork/open-data/. [38] Z. Liu et al. The effects of interactive latency on exploratory visual analysis. [4] Tableau q2 earnings: Impressive growth in customer base and revenues. IEEE TVCG, 20(12):2122–2131, 2014. http://www.forbes.com/sites/greatspeculations/2015/07/31/tableau-q2-earnings- [39] D. Moritz et al. Trust, but verify: Optimistic visualizations of approximate impressive-growth-in-customer-base-and-revenues. queries for exploring big data. In CHI, 2017. [5] Us flight dataset. http://stat-computing.org/dataexpo/2009/the-data.html. [40] S. G. Perlman. System and method for rendering graphics and video on a [6] Wunderground weather dataset. https://www.wunderground.com/. display, June 26 2007. US Patent 7,236,204. [7] J. Acharya et al. Fast and near-optimal algorithms for approximating [41] E. Puppo et al. Towards a formal model for multiresolution spatial maps. In distributions by histograms. In PODS, pages 249–263. ACM, 2015. Advances in Spatial Databases, pages 152–169. Springer, 1995. [8] S. Acharya et al. The aqua approximate query answering system. In SIGMOD [42] T. Siddiqui et al. Effortless data exploration with zenvisage: an expressive and Rec., volume 28, pages 574–576. ACM, 1999. interactive visual analytics system. VLDB Endowment, 10(4):457–468, 2016. [9] S. Agarwal et al. Blinkdb: queries with bounded errors and bounded response [43] T. Siddiqui et al. Fast-forwarding to desired visualizations with zenvisage. times on very large data. In EuroSys, pages 29–42. ACM, 2013. CIDR, 2017. [10] D. Alabi et al. Pfunk-h: Approximate query processing using perceptual [44] C. Stolte et al. Polaris: A system for query, analysis, and visualization of models. In HILDA Workshop, pages 10:1–10:6. ACM, 2016. multidimensional relational databases. IEEE TVCG, 8(1):52–65, 2002. [11] B. Babcock et al. Dynamic sample selection for approximate query processing. [45] K. Stromberg. Probability for analysts. CRC Press, 1994. In SIGMOD Conf., pages 539–550. ACM, 2003. [46] M. Vartak et al. Seedb: efficient data-driven visualization recommendations to [12] Z. Bar-Yossef. The Complexity of Massive Data Set Computations. PhD thesis, support visual analytics. VLDB Endow., 8(13):2182–2193, 2015. Berkeley, CA, USA, 2002. AAI3183783. [47] Weisstein, Eric W, Wolfram Research, Inc. Euler-mascheroni constant. 2002. [13] M. Bertolotto et al. Progressive vector transmission. In Proceedings of the 7th [48] R. Wesley et al. An analytic data engine for visualization in tableau. In ACM international symposium on Advances in geographic information systems, SIGMOD Conf., pages 1185–1194. ACM, 2011. pages 152–157. ACM, 1999. [49] M. B. Wilk et al. Probability plotting methods for the analysis for the analysis [14] L. Cam and G. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer of data. Biometrika, 55(1):1–17, 1968. Series in Statistics. Springer New York, 2000. [50] A. P. Witkin. Scale-space filtering: A new approach to multi-scale description. [15] M. Correll et al. Error bars considered harmful: Exploring alternate encodings In ICASSP, volume 9, pages 150–153. IEEE, 1984. for mean and error. IEEE TVCG, 20(12):2142–2151, 2014. [51] K. Wu et al. Optimizing bitmap indices with efficient compression. ACM [16] D. Donjerkovic et al. Dynamic histograms: Capturing evolving data sets. In Transactions on Database Systems (TODS), 31(1):1–38, 2006. ICDE’00, pages 86–86. IEEE Computer Society Press; 1998, 2000. [52] K. Wu et al. Analyses of multi-level and multi-component compressed bitmap [17] D. Fisher et al. Trust me, i’m partially right: incremental visualization lets indexes. ACM Transactions on Database Systems (TODS), 35(1):2, 2010. analysts explore large datasets faster. In CHI’12, pages 1673–1682. ACM, 2012. [53] Y. Wu et al. Efficient evaluation of object-centric exploration queries for [18] M. N. Garofalakis. Approximate query processing: Taming the terabytes. In visualization. VLDB, 8(12):1752–1763, 2015. VLDB, 2001. [54] E. Zgraggen et al. How progressive visualizations affect exploratory analysis. [19] P. J. Haas et al. Selectivity and cost estimation for joins based on random IEEE TVCG, 2016. sampling. Journal of Computer and System Sciences, 52(3):550–569, 1996.

25 .APPENDIX Dm,n with means µ1,1 , . . . , µm,n where, µi,j = AVG(Y) for xi ∈ Xa and xj ∈ Xb . To generate our incrementally improving visual- A. PROBLEM FORMULATION (HEATMAPS) ization and its constituent block approximations, we draw samples In this section, we describe the concepts in the context of our from distributions D1,1 , . . . , Dm,n . When drawing samples from visualizations in row 4 of Figure 1, i.e, incrementally improving the distribution Di,j , our sampling engine retrieves a sample only heatmap visualizations. when it satisfies the condition Xa = xi ∧ Xb = xj (see Section Blocks and bk -Block Approximations. We denote the cardinality 2.1). Our goal is to approximate the mean values µ1,1 , . . . , µm,n of our group-by dimension attributes Xa as m and Xb as n, i.e., where, µi,j while taking as few samples as possible. |Xa | = m and |Xb | = n. In Figure 1, Xa = 10 and Xb = The output of a bk -block approximation Mk can be represented 11. Over the course of visualization generation, we display one alternately as a sequence of values (ν1,1 , . . . , νm,n ) such that νi,j value of AVG(Y) corresponding to each combination of groups is equal to the value corresponding to the block that encompasses xi ∈ Xa , i ∈ 1 . . . m and xj ∈ Xb , j ∈ 1 . . . n—thus, the user is xi,j . By comparing (ν1,1 , . . . , νm,n ) with µ1,1 , . . . , µm,n , we can always shown a complete heatmap visualization. We call the pair evaluate the error corresponding to a bk -block approximation, as (xi , xj ) as group combination and denote by xi,j . To approximate we describe next. our heatmaps, we use the notion of blocks that encompass one or Error. We define the 2 squared error of a bk -block approxima- more group combinations. We define a block as follows: tion Mk with output sequence (ν1,1 , . . . , νm,n ) for the distribu- Definition 4. A block β corresponds to a pair (I × J, η), where η tions D1,1 , . . . , Dm,n with means µ1,1 , . . . , µm,n as is a value, while I is an interval I ⊆ [1, m] spanning a consecu- 1 m n tive sequence of groups xi ∈ Xa and J is an interval J ⊆ [1, n] err(Mk ) = (µi,j − νi,j )2 (4) spanning a consecutive sequence of groups xj ∈ Xb . The block mn i=1 j=1 encompasses the m × n group combinations xi,j where xi ∈ I and xj ∈ J. A.1 Problem Statement For example, the block β ([2, 4] × [1, 2], 0.7) has a value of 0.7 and Now we formally define the problem for heatmaps which is sim- encompasses the group combinations x2,1 , x2,2 , x3,1 , x3,2 , x4,1 , ilar to Problem 1: x4,2 . Then, a bk -block approximation of a visualization comprises Problem 2 (Heatmap). Given a query QH , and the parameters bk disjoint blocks that span the entire range of xi , i = 1 . . . m and δ, , design an incrementally improving visualization generation al- xj , j = 1 . . . n. We explain the value of bk later. Formally, a gorithm that, at each iteration k, returns a bk -block approximation bk -block approximation is defined as follows: while taking as few samples as possible, such that with probability Definition 5. A bk −block approximation is a tuple Mk = β1 , . . ., 1−δ, the error of the bk -block approximation Mk returned at itera- βbk such that the block β1 , . . . , βk partition the interval [1, m] × tion k does not exceed the error of the best bk -block approximation [1, n] into bk disjoint sub-intervals. formed by splitting a block of Mk−1 by no more than . As an example from Figure 1, at t2 , we are displaying a 4-block ap- proximation, comprising four blocks ([1, 6]×[1, 9], 0.8), ([7, 10]× B. ALGORITHM FOR HEATMAPS [1, 9], 0.4),([1, 6] × [10, 11], 1.0), and ([7, 10] × [10, 11], 1.4). In this section, we build up our solution to the incrementally When the number of blocks is unspecified, we simply refer to this improving visualization generation algorithm for heatmaps, ISPlit- as a block approximation. Grid. We present the major ideas, concepts and proofs required to Incrementally Improving Visualizations. An incrementally im- explain ISPlit-Grid. proving visualization is defined to be a sequence of block approx- imations, m × n in total, (M1 , . . . , Mm×n ), where the ith item B.1 Case 1: The Ideal Scenario Mi in the sequence is a bi -block approximation, and is formed by We first consider the ideal scenario when the means of the dis- selecting one of the block in the bi−1 -block approximation Mi−1 tributions are known (see Section 3.1. In the context of heatmaps, (the preceding one in the sequence), and dividing that block into when the means of the distributions are known, the task reduces to either two or four blocks. identifying the block βi , splitting which will minimize the bk+1 - Similar to trendlines each block approximation is a refinement block approximation Mk+1 . We now dive straight into the def- of the previous, revealing new features of the visualization and is inition of the 2 -squared error of a block. The 2 squared er- formed by dividing one of the block β in the bi -block approxi- ror of a block βi (Ii × Ji , ηi ), where Ii = [p, q], Ji = [r, s] mation into either two or four new blocks to give an bi+1 -block (1 ≤ p ≤ q ≤ m and 1 ≤ r ≤ s ≤ n), approximating the approximation: we call this process splitting a block. The group distributions Dp,r , . . . , Dq,s with means µp,r , . . . , µq,s is combination within β ∈ Mi immediately following which the split q s occurs is referred to as a split group combination. Any group com- 1 2 err(βi ) = µj,j − ηi bination in the interval I × J ∈ β except for the last one can be (q − p + 1) × (q − p + 1) j=p j =r chosen as a split group combination. Since an existing block can 1 2 be split into either two or four blocks, then k ≤ bk ≤ (3k − 2) = I µj,j − ηi where (bk − k)%2 = 0. As an example, in Figure 1, the entire |βi | × |βiJ | j∈βiI j ∈βiJ fourth row corresponds to an incrementally improving visualiza- tion, where, for example, the 4-block approximation (k = 2 and For the ideal scenario, we can rewrite the expression for err(βi ) b2 = 4 where, 2 ≤ b2 ≤ 4 and (b2 − 2)%2 = 0) is gener- as follows: 1 ated by taking the block in the 1-block approximation correspond- err(βi ) = µ2j,j − µ2βi (5) ing to ([1, 10] × [1, 11], 0.5), and splitting it at group combination |βiI | × |βiJ | j∈βiI j ∈βiJ x6,9 to give ([1, 6] × [1, 9], 0.8), ([7, 10] × [1, 9], 0.4),([1, 6] × Then, using Equation 4, we can express the 2 squared error of the [10, 11], 1.0), and ([7, 10]×[10, 11],1.4). Therefore, the split group bk -block approximation Mk as follows: combination is x6,9 consisting of the pair (6,9). The reasoning be- bk hind imposing such restriction was discussed in Section 2.2. |βiI | × |βiJ | err(Mk ) = err(βi ) Underlying Data Model and Output Model. We represent the mn i=1 heatmap visualization as a sequence of m×n distributions D1,1 , . . .,

26 .Now, let’s assume Mk+1 is obtained by splitting a block βi ∈ Mk close to the best possible split. We can prove the following for into four blocks T , U , V , and W . Then, the error of Mk+1 is: heatmaps: |βiI | × |βiJ | Theorem 8. If for every boundary block T of the bk -block approxi- err(Mk+1 ) = err(Mk ) − err(βi ) mn ˜T of the mean µT that satisfies mation Mk , we obtain an estimate µ I |T ||T | J |U ||U J | I 2 2 mn + err(T ) + err(U ) ˜ T − µT ≤ µ , m mn 10|T I ||T J | |W I ||W J | |V I ||V J | † + err(V ) + err(W ) then the refinement Mk+1 of Mk that minimizes the estimated er- mn mn † ror err(Mk+1 ) will have error that exceeds the error of the best I J |β | × |βi | 2 ∗ = .err(Mk ) + i µβi refinement Mk+1 of Mk by at most mn † ∗ err(Mk+1 ) − err(Mk+1 )≤ . |T I ||T J | 2 |U I ||U J | 2 − µT − µU m mn Proof. The estimated improvement potential of the refinement Lk+1 |V I ||V J | 2 |W I ||W J | 2 satisfies − µV − µW mn mn |φ(Mk+1 ) − φ(Mk+1 )| ≤ . We use the above expression to define the notion of improvement 2 potential for heatmaps. The improvement potential of a block βi ∈ We can obtain this inequality by using the expression for φ(Mk+1 ) Mk is the minimization of the error of Mk+1 obtained by splitting and φ(Mk+1 ), and substituting the terms like µ ˜2T − µ2T with mn βi into T ,U ,V and W . Thus, the improvement potential of block 10|T I ||T J | . Together this inequality, the identity err(Mk+1 ) = βi relative to T ,U ,V and W is † err(Mk ) − φ(Mk+1 ), and the inequality φ(Mk+1 ) ≤ φ(Mk+1 ) |T I ||T J | 2 |U I ||U J | 2 imply that ∆(βi , T, U, V, W ) = µT + µU mn mn † err (Mk+1 ∗ ) − err (Mk+1 ) I J I J |V ||V | 2 |W ||W | 2 ∗ † µV + µW − = φ(Mk+1 ) − φ(Mk+1 ) mn mn I J ∗ ∗ ∗ † |βi | × |βi | 2 = φ(Mk+1 ) − φ(Mk+1 ) + φ(Mk+1 ) − φ(Mk+1 ) µβi mn ∗ ≤ φ(Mk+1 ∗ ) − φ(Mk+1 † ) + φ(Mk+1 † ) − φ(Mk+1 ) Now, The split group combination maximizing the improvement potential of βi , minimizes the error of Mk+1 . Therefore, the max- ≤ . imum improvement potential of a block is expressed as follows: ∆ (βi ) = max ∆(βi , T, U, V, W ) 2.2.2 Determining the Sample Complexity T,U,V,W ⊆βi To achieve the error guarantee for Theorem 8, we need to re- Lastly, we denote the improvement potential of a given Mk+1 by trieve a certain number of samples from each of the distributions φ(Mk+1 , βi , T, U, V, W ), where φ(Mk+1 , βi , T, U, V, W ) D1,1 , . . . , Dm,n . Similar to trendlines, we again assume that the = ∆(βi , T, U, V, W ). Therefore, the maximum improvement po- data is generated from a sub-gaussian distribution. Given the gener- tential of Mk+1 , φ (Mk+1 ) =maxβi ⊆Mk ∆ (βi ). When the means ative assumption, we can determine the number of samples required of the distributions are known, at iteration (k + 1), the optimal al- to obtain an estimate with a desired accuracy using Hoeffding’s in- gorithm simply selects the refinement corresponding to φ (Lk+1 ), equality [21] which leads us to the following Lemma: which is the block approximation with the maximum improvement Lemma 5. For a fixed δ > 0 and a bk -block approximation of the potential. distributions D1,1 , . . . , Dm,n represented by m × n independent random samples x1,1 , . . . , xm,n with sub-Gaussian parameter σ 2 2.2 Case 2: The Online-Sampling Scenario 2 and mean µi,j ∈ [0, a] if we draw C = 8002 mn aσ ln 4mnδ sam- In the online sampling scenario, the means of the distributions are unknown. Similar to trendlines, we describe our approach for ples uniformly from each xi,j , then with probability at least 1 − δ, selecting a refinement at a single iteration assuming samples have µT2 − µT2 ≤ 10|TmnI ||T J | for every boundary block T of Mk . already been drawn from the distributions. Then, we describe our approach for selecting samples. Proof. Fix any boundary block T contained in the block βi ∈ Lk . Then, we draw samples xi,j,1 , xi,j,,2 , . . . , xi,j,,C uniformly from 2.2.1 Selecting the Refinement Given Samples xi,j such that xi ∈ T I and xj ∈ T J , then In the online sampling scenario, we define a new notion of er- 1 C 1 µT − µT = xi,j,g − µi,j ror err and optimize for that error (see Section 3.2). Since, we C|T I ||T J | i∈T I j∈T J g=1 |T I ||T J | i∈T I j∈T J draw samples from the distributions, we can obtain the estimated 1 C improvement potential as follows: = (xi,j,g − µi,j ). I J I J C|T I ||T J | i∈T I j∈T J g=1 |T ||T | 2 |U ||U | 2 φ(Mk+1 , βi , T, U, V, W ) = µT + µU mn mn xi,j ’s are sub-Gaussian random variables with parameter σ 2 . There- I J I J |V ||V | 2 |W ||W | 2 fore, µV + µW − mn mn Pr |µ2T − µ2T | > m = Pr |µT − µT | (µT + µT ) > mn I J 10 |T | 10 |T I ||T J | |βi | × |βi | 2 µβi ≤ Pr |µT − µT | > mn mn 20 a |T I ||T J | At iteration (k + 1), all the blocks in Mk and all the segments C C mn that may appear in Mk+1 after splitting a block are called boundary = Pr | (xi,j,g − µi,j )| > 20 a i∈T I j∈T J g=1 blocks. In the following theorem, we show that if we estimate the boundary blocks accurately, then we can find a split which is very C ≤ 2 exp − 800a m n 2 2 2 ≤ δ 2 |T | σ 2 2mm

27 .By the union bound, the probability that one of the 2mn boundary Data: Xa , Y, δ, 1 Start with the 1-segment approximation: ∀i∈[1,m] D(i, 1) = err(L[1,i] , 1) segments has an inaccurate estimate is at most δ. and P (i, 1) = −1. Set L = (L1 ) 2 for k = 2, . . . , m do 2.3 The ISPlit-Grid Algorithm 3 for i = 1, . . . , m do Given the claims in the previous sections, we now present our in- 4 for j = 2, . . . , k and j ≤ i do 5 D(i, j) = crementally improving visualization generation algorithm for heatmaps argmaxp∗ ∈[1,i−1] D(p∗ , j − 1) + err(S([p∗ + 1, i], η)) ISplit-Grid. Given the parameters and δ, ISplit-Grid maintains the 6 P (i, j) = p∗ such that D(i,j) is minimized same guarantee of error ( ) in generating the segment approxima- end tions in each iteration. Theorem 8 and Lemma 5 suffice to show that end 7 Recursively construct Lk by traversing P starting from P (m, k) the ISplit-Grid algorithm is a greedy approximator, that, at each it- 8 L = L ∪ Lk eration identifies a segment approximation that is at least close to end the best segment approximation for that iteration. Algorithm 3: DPSplit Data: Xa , Xb , Y, δ, 1 Start with the 1-block approximator M = (M1 ). DPSplit maintains a m × k Dynamic Programming (DP) table 2 for k = 2, . . . , m do D where each entry D(i, j) corresponds to the error, err(L[1,i] , j) 3 Mk−1 = (β1 , . . . , βk−1 ). of the j-segment approximation of the distributions D1 , . . . , Di . 4 for each βi ∈ Mk−1 do 5 for each group combination xq,r ∈ βp do DP Split also maintains another m × k table P , where each en- 6 Draw C samples uniformly. Compute mean µ ˜ q,r try P (i, j) corresponds to the split group that minimizes the error end corresponding to D(i, j). Given m distributions, at each iteration 1 7 Compute µβi = I J µ ˜ q,r |β ||β | i i q∈β I r∈β J k, all the entries from D(1, 1) to D(m, k) are updated, i.e., the er- i i end ror of the segment approximation is recomputed based on the new |T I ||T J | 2 samples drawn from the distributions. In the final iteration k = m, 8 Find (T, U, V, W ) = argmaxp∗ ; T ,U,V,W ⊆β µT + p∗ mn the entire table is updated. Therefore, the time complexity of DP- |β I ∗ |×|β J∗ | |U I ||U J | 2 µU + |V I ||V J | 2 µV + |W I ||W J | 2 µW − p p µ2β ∗ . Split is O(m × k2 ) even though the samples taken in each iteration mn mn mn mn p is the same as ISplit. 9 Update Mk+1 = β1 , . . . , βi−1 , T, U, V, W, βi+1 , . . . , βk . end Algorithm 2: ISplit-Grid 4. DATASET ANALYSIS In this section, we present the Q-Q plot [49] analysis of the datasets (Table 2) used in Section 5. We present the results for the 3. RELEASING THE REFINEMENT Arrival Delay attribute of the Flight dataset (FLA), the Trip Time attribute of the NYC taxi dataset for the year 2011 (T11), and the RESTRICTION: DPSplit Temperature attribute of the weather dataset (WG). We also present In this section, we present an incremental visualization gener- the corresponding histograms of the datasets. We exclude the re- ation algorithm for trendlines, DPSplit. At each iteration k, DP- sults of the Departure Delay attribute of Flight dataset due to sim- Split generates the entire k-segment approximation—releasing the ilarity in distribution to the Arrival Delay attribute. For the same refinement restriction. The algorithm works as follows: Given the reason, we also exclude the results for T12 and T13 (similarity to task of approximating the distributions D1 , . . . , Dm , at each it- T11 dataset). eration k, DPSplit draws samples uniformly from each group (as Figure 18(a) and 18(b) shows the Q-Q plot of FLA and T11, in ISplit) and then computes the 2 -squared error of all possible respectively. We plot the theoretical gaussian distribution in the x- k-segment approximations that can be generated. DPSplit then axis and the ordered values of the attributes of the dataset in the chooses the k-segment approximation that yields the least error to y-axis. The shape of the plot determines the type of the distri- be output. Therefore, instead of refining the (k − 1)-segment ap- bution. Both the datasets exhibit a right skewed gaussian distribu- proximation, DPSplit computes the k-segment approximation based tion confirmed by their corresponding histograms (Figure 18(d) and on the improved estimates of the means µ1 , . . . , µm of the distri- 18(e))—in both case, the peak is off center and the the tail stretches butions obtained at iteration k. We use the same notion of error away from the center to the right side. On the other hand, the WG of a segment err(Si ) (Equation 5) as ISplit. We represent the (Figure 18(c)) dataset exhibits a truncated gaussian distribution that error of a segment approximation that approximate the distribu- is clipped on both sides (Figure 18(f)). tions Dp , . . . , Dq with k-segments as err(L[p,q] , k). For ISplit, err(Lk ) = err(L[1,m] , k)

28 .(a) FLA (b) T11 (c) WG (d) FLA (e) T11 (f) WG Figure 18: Q-Q plot of a) FLA, b) T11, and c) WG. Histogram of d) FLA, e) T11, and f) WG.

9点赞

3收藏

0下载