Data Marts
This page documents a feature in SanteDB version 3.0.
Introduction
SanteDB's Business Intelligence services allow developers of SanteDB applets to define custom reports, queries, views and also (starting with Version 3.0) custom data marts.
A data mart is replica of the primary SanteDB database whereby selective data is pre-pivoted, calculated and stored in a schema which is more suited for running reports. This makes writing reports easier and makes running reports faster. It also allows system administrators to host data for reporting purposes on another physical server instance.
The SanteDB BI data mart infrastructure allows for the extension and derivation of datamarts on one another. A data mart definition is defined in a new bi asset file with a DatamartDefinition
element and contains:
Metadata about the data mart definition including authors, and permissions to access the data mart.
A data source registration which identifies the
<dataSource
which can be referenced by reports, queries and views.A
<schema>
which identifies the tables, views and relationships between these objects within the datamart.A
<flows>
element which defines one or more data flows and data pipelines for extracting data from one or SanteDB data sources (including other data marts) to populate the new data mart.
Previous versions of SanteDB (namely OpenIZ and SanteDB 1.0) shipped with a series of ETL jobs written in Talend, however this solution was not portable, and did not allow definition of data marts in an end-user conifgurable manner, and also did not work on the mobile application environment. The new SanteDB 3.0 data mart seeks to resolve this limitation.
Defining a Datamart
Datamarts are defined by create a new asset file in the bi/
folder of your applet. The XML file has a structure as illustrated below:
Deriving / Extending Data Marts
SanteDB 3.0 provides a default datamart named org.santedb.bi.dataSource.warehouse
, and other projects developed by SanteSuite (such as SanteEMR, SanteIMS) may have their own. Implementers can either extend these data marts or base entirely new marts on them.
Extending an Existing Mart
Extending an existing data mart is useful when:
You wish to add new views to an already existing mart
You wish to add new tables to an existing mart
You wish to change or modify data in an existing mart after it is populated (post processing tasks)
To extend a data-mart, the extends
element is used with a reference to the other data mart
Deriving a New Mart
It is also possible for developers to base an entirely new data mart on an existing definition. When using a derivation, the underlying mart definition is "copied", this saves developers from re-defining common assets.
Deriving a data-mart is done with the basedOn
element.
Produced Data Source
The primary purpose of the data mart is to provide a BI data source which can be used for reports, queries, or like any other connection in SanteDB. The data source which is created or used as the output for the data mart is identified in the <produces
element of the datamart definition.
Existing Data Source
If you would like to output your data mart to an existing BI data source, use the ref
option.
Configured Database
If you have configured a connection string in the SanteDB server configuration or would like to place the output of your datamart in an existing location you can specify the connectionString
attribute on your datamart definition.
New Database
If you would like the BI layer to create a new database on your database server, you can simply specify the name, permissions, and other metadata for your new data source.
You can specify a different databse server and technology for all data marts by creating a new connection string in your server configuration file with the biSkel
attribute.
Schema Definition
The <schema>
element of the data mart definition is used to define the tables and views for the data mart in a database technology agnostic manner.
For example, to define a data mart which has two tables FACILITY_TBL
, PATIENT_BY_DOB_TBL
:
Basic Table Definition
The basic table definition in a schema is expanded below:
Foreign Keys
Links between tables can be established using a type="ref"
on your column and including an <otherTable>
reference. For example, to link an ADDRESS
table to a PATIENTS
table:
Would result in a schema where:
Parent Tables
SanteDB's primary CDR stores data as a series of objects in a hierarchy (see: Conceptual Information Model). In a relational database this is represented as "table per class" pattern, using the entity classes, this can be represented as.
This means that every Place
, Person
, Patient
, and Provider
should have an entry in Entity
as well with the core data elements. Each other table merely adds more data to the previous.
The BI schema definition allows the devleoper to shortcut this, and automatically create views which link these tables up the hierarchy using the <parent>
element on the table, for example:
Would allow a PATIENT
to have a MOTHER
which is either just a PERSON
(who has only a gender, first and last name), or a PATIENT
with an MRN.
The BI layer also automatically establishes a view VW_PATIENT
which can be used to get all applicable fields for the PATIENT
. The SQL generated for this definition is similar to:
Data Flows
The schema of the data mart defines its structure, the data flows of the data mart definition describe how data is extracted from another source to populate the data mart.
Each data flow defines one or more parameters (which can be passed by a caller) and a single data pipeline.
Data Pipeline
The order of the elements in the <pipeline>
element don't necessarily have an impact on the order in which the items are called, the execution of a data pipeline starts with all un-streamed components and works backwards. This creates a streaming IEnumerable
data flow for each component, for example:
The execution engine analysis would conclude that the terminal node of this pipeline is the <writer>
step, which relies on <crosstab>
which relies on <reader>
so the order of execution would be:
Open connection
inputConnection
Execute SQL and name the stream
read_addresses
Pivot the output of
read_addresses
and namd the streampivot_addresses
Open the
outputConnection
connectionWrite the stream
pivot_addresses
to theoutputConnection
Pipeline Steps
The <pipeline>
element of a flow represents the activities which are to be performed for the flow. Each pipeline step must carry a name=""
attribute, and should reference other steps where possible. For example, referencing components is performed with the ref=""
attribute as illustrated below:
Connect to Database
Input: None
Output: DataIntegrationConnection
Connections to the database are made using the <connection>
pipeline step. The connection pipeline step is illustrated below.
@name
String
Identifies the name of the connection for referencing in other data components.
@mode
read-only
or read-write
Indicates the connection mode. Using read-only
for source data is recommended as it allows the data source to connect a replica (if configured)
dataSource/@ref
Reference
If the connection is using an already defined BI data source, this is the reference to that data source.
dataSource/@connection
Connection String
The connection string in the SanteDB configuration to the database (for direct connections)
For example, to define a connection to the main SanteDB iCDR database:
Execute in Transaction
Input: DataIntegrationConnection
Output: IEnumerable
When executing multiple operations, it is often desirable to execute the output in a transaction. Transactions represent an atomic series of updates, and either all the components succeed, or none do. Transactions create a new sub-scope from the data flow which contains them.
The transaction requires that the connection be opened in read-write
mode.
@name
String
Identifies the name of the transaction for referencing in other data components.
connection
Reference to connection step or an inline connection
The connection object which the transaction should be executed on.
pipeline
List of Steps
The pipeline steps to execute in the transaction
For example:
Call Another Flow
Input: None
Output: IEnumerable
The <call>
pipeline instruction can be used to call another flow and pass parameter arguments to the called flow.
dataFlow/@ref
Reference to DataFlow
The data flow which is to be executed/called.
args/*
The argument list indicating the type
One or more arguments which are identiifed by their argument type
args/*/@name
String
Name of the argument being passed
args/*/value
Variable
The value to pass as the variable
For example, to call FOO_FLOW
passing the input connection and a flag, and then output the result to the log:
Data Reader
Input: DataIntegrationConnection
Output: IEnumerable
A data reader is used to open a streaming result set based on a SQL query executed against a connection.
@name
Name
The name of the stream output that the reader creates (to be used in subsequent flow steps with @ref
)
connection/@ref
Data Source
The data source or data connection step that should be used to execute the SQL
sql/add
SQL
The SQL statement to be added for the dialects of SQL indicates in providers
sql/add/providers
Provider Name
The SQL providers that the SQL applies to.
schema/@ref
Table Reference
A reference to the table which contains the schema the output is expected to be in.
schema/columns
Table
If defining a schema inline, the columns returned by the SQL.
For example, to read all identiifers in the SanteDB primary database.
Data Writer
Input: IEnumerable
Output: IEnumerable
A data writer is used to write data to a connection. The output of the data writer is a stream of the records that were written.
Records that could not be written to the output connection are streamed out of the data writer with an additional $reject
and $reject.reason
fields added to the output object. Use a <filter>
or <split>
pipeline component to filter for these records.
Column Transform
Input: IEnumerable
Output: IEnumerable
A column mapping pipeline step is used to change the names of input data stream fields to other output stream field names, and to apply transforms.
Using the lookup
option requires loading of the entire source of the lookup into memory, and is intended for very small reference lookups. If a larger join is desired use a <join>
pipeline component.
Crosstab
Input: IEnumerable
Output: IEnumerable
The crosstab pipeline step allows developers to pivot a dataset based on an accumulator.
The <input>
stream MUST be sorted by the @key
as the most significant sorting column. Failure to sort the input result set may result in the crosstab being unpredictable.
Output to Log
Input: IEnumerable
Output: IEnumerable
For diagnostic purposes, it is often useful to log data and messages to the output. SanteDB data flows can log to one of the following places:
The application log (the trace log)
The registration for the particular execution of the datamart refresh
The console
Logs may be sourced from an input stream or may be standalone components.
For example, to log a simple message to the console:
To log the contents of an input stream to the execution log (which would require an administrator to log in to view):
Filter
The filter object allows developers to filter an incoming result set and pass only those records which match the specified criteria to the next stage of the pipeline.
@name
Name
The name of the flow step
input/@ref
IEnumerable
The input data source
all
or any
Array
Collection of conditions to be satisfied to pass a record through the filter (all = all conditions must pass , any = one condition must pass)
all/when
Condition
A single condition on the filter
all/when/@field
Column
The name of the column in the input stream to use as the LHS.
all/when/@op
eq - Equal ne - Not Equal lt - Less Than gt - Greater Than lte - Less Than Equal gte - Greater Than Equal
Indicates the manner in which @field
is compared to value
all/when/bool
or all/when/string
or all/when/int
bool or string or int or uuid
The value to compare as the RHS of the when condition
This logic can be used to filter out rejects and log them to a separate table.
Union Result Sets
Two or more result sets can be unioned together. The result is a new enumerable with the union result sets in the order they were included.
Halt Execution
The result set can be halted at any time with the <halt>
pipeline step. The HALT operation currently halts on the first encounter of a record being processed.
Last updated