Source: Beyond 20/20 Blog

Beyond 20/20 Blog Pushing the Limits of SQL Server to Manage 2.5 TeraBytes of Complex Geographical Data

How we used cubes and native SQL features to divide and conquer more than 51 billion possible data flows, transparently to end users. The Census Transportation Planning Products (CTPP) is a special tabulation of the American Community Survey, operated by the US Census. It contains valuable data comprising home and work locations as well as journey to work travel flows, including demographic characteristics, for a variety of state, regional and local transportation policy and planning efforts. CTPP data also supports corridor and project studies, environmental analyses, and emergency operations management. The source data is contained in more than 15,000 separate files and over 300 tables, grouped by state. Once loaded in a database, CTPP represents 2.5 TeraBytes of data. Beyond 20/20 Inc. was commissioned by AASHTO to build a web-based data access system that would allow users to interact with the data, perform statistical calculations, build custom aggregations, view charts and maps, and download specific data sets originating from multiple files. This article is one of a series that describes some of the key technical challenges we faced in completing CTPP data access system - and describing the solutions we used to conquer the challenges. CTPP Data can be found at http://data5.ctpp.transportation.org/ctpp Challenge 1: How to choose the appropriate data model for CTPP data and its intended use? The optimal data model was derived through analysis, trials, and convergence methods. The source-data was delivered as a set of 15,000 CSV files with:One or two complex geography dimensions, including Residence and Workplace dataBetween 0 and 3 categorization dimensions (e.g. Age, Sex, Persons in Household)A single measure indicating the meaning and units of the numbers in the table (e.g. number of persons, mean household income)Both an estimate and a 90% confidence margin of errorThe categorization dimensions are used across multiple tables. In fact, the tables contain different cross-tabulations of a relatively small set of variables. In addition, each categorization dimension contains a "total" member, indicating the value of the measure when the dimension is not used for sub-selecting the data at all. In order to make such an application work efficiently, we recognized that the data would need to reside in a database. Furthermore, due to the multi-dimensional nature of the data, the power of an OLAP engine would be beneficial for performing certain operations, such as custom aggregation. We therefore chose to use a Microsoft SQL Server 2012 Analysis Services database. Our challenge, then, was: what is the best approach for designing the data cubes within this database? One of the powerful features of OLAP is that it is intended to be used with record-level data, aggregating the records into statistical tables. In the case of CTPP, however, record-level data is not available. The US Census Bureau is extremely careful to ensure that individuals' privacy cannot be jeopardized in the data. In addition, the survey results also needed to be grossed up using sophisticated statistical models to produce accurate estimates in the tables, as only a small percentage of the population is surveyed each year. If we were to create a cube for each source table, there would be over 15000 cubes in the database: completely unmanageable. Even if the data for all states was brought together in each table, we would still have had more than 300 cubes. Given the overlap between dimensions, it seemed to make sense to try to merge them together into a much smaller number of cubes. However, as the underlying records were not available, we needed to accept that not all cells in the merged cubes would contain data. Conversely, there were also strong reasons for not putting the data into one very large cube:a) It did not make sense to put the tables containing only Residence, those containing only Workplace, and those containing both in the same cube.b) Some tables provided slightly different data for the same cross-tabulations / sub-totals due to perturbation of the data done to protect individuals' privacy. It was therefore clear that we needed to model the data into a relatively small number of complex cubes, but that experimentation was necessary to determine the optimum organization of these cubes. The next challenge was how to load the data in a way that allowed us to experiment to find the right number and content of the cubes. We'll address that challenge in the next article in this series... For more information, please click here.

Est. Annual Revenue

$5.0-25M

Est. Employees

100-250

CEO

Update CEO

CEO Approval Rating

- -/100

Beyond2020's headquarters is located in Ottawa, Ontario. Beyond2020 generates $120.6K in revenue per employee Beyond2020 has 2 followers on Owler.