The matching engine in SanteDB is a multi-stage process whereby an inbound (or a target record) is compared with the current dataset within the CDR. The matching process occurs in three stages:
- Blocking : In the blocking phase records are queried from the CDR's database infrastructure. The blocking phase is used to reduce the total number of records being scored, which can be more computationally expensive.
- Scoring : In the scoring stage, the target record is compared to those records retrieved from the blocking phase. Each attribute configured is transformed/normalized and then a score applied for each attribute. Furthermore there are API hooks for implementing partial weighting/partial scoring of an attribute (based on frequency of data in database, NLP, or other future methods)
- Classification : In the classification stage, each record's score (the sum of each attribute's score) is grouped into Match, NonMatch, and ProbableMatch. These thresholds are configured.
SanteDB allows for the configuration of multiple match configurations, and allows configuration for the "default" match configuration to be used for regular operation (whenever data is inserted, updated, etc.).
In the blocking phase of the matching execution, the candidate record (named
$input) is compared with records persisted in the database using one or more blocking configurations.
Records are read from the database (as blocks) with multiple blocks being combined with either an intersect or union function. Blocks can be loaded as either:
- SOURCE Records: Whereby the source records are loaded from the database as they were sent. Blocking in this mode is less CPU intensive and less database intensive, however relies on source information as a "picture" of what data is available for a patient.
- MASTER Records: Where by the blocks are loaded using the MDM layer and are computed based on existing known and suspected links. This method of blocking more closely resembles what users see in the UI when MDM is enabled, however it does slow down matching performance as each record must be cross-referenced with the master data record. It also allows for matching based on records of truth.
Blocks from each statement are combined together to form a result set (in C# an
IEnumerable<T>) which are passed into the scoring stage.
In the example below, the SanteDB matching engine will load records for an
- If the input record contains an
SSNidentifier , it will filter records in the database by matching SSN. It will then perform an MDM gather (i.e. the matching mode is performed on MASTER records) , these records will be UNION with
- The results of a local query whereby:
- If the
$inputhas Given name, then the given name must match, AND
- If the
$inputhas a Family name, then the family name must match, AND
- If the
$inputhas a dateOfBirth then the date of birth must match, AND
- If the
$inputhas a gender concept then the gender must match
In pseudocode terms, the blocking query for an
$inputof John Smith, Male, SSN 203-204-3049 born 1980-03-02 would resemble:
masters AS master
LEFT JOIN locals AS local
local.identifier[SSN].value = '203-204-3049'
locals AS local
local.name.component[Given].value = 'John'
AND local.name.component[Family].value = 'Smith'
AND local.dateOfBirth = '1980-03-02'
AND local.gender = 'Male'
During the scoring phase, the records from the blocking stage are compared to the
$inputrecord on an attribute by attribute basis using a collection of assertions about the attribute. If all assertions on an attribute evaluate to TRUE then the
matchWeightscore is added to that records total score, if the assertions are FALSE then the
nonMatchWeightscore (a negative score) is added to the record's total score.
Overall, the process of comparing a blocked record (named
$block) with the
- 1.The scoring attribute may declare that it depends on another attribute being scored (i.e. don't evaluate the
stateattribute has passed. If the dependent attribute was not scored as a positive (match) then the current attribute is assigned the whenNull() score.
- 2.The attribute path on the configuration is read from both
- 4.The matching engine then determines if any transforms have been configured (see: Transforming Data). This is a process whereby data is extracted, tokenized, shifted, padded, etc. on both the
$blockvariables. The result of each transform is stored as the new attribute value in memory and the next transform is applied against the output of the previous.
- 5.Finally the actual assertion is applied. The assertion is usually a binary logical operator (less than, equal, etc.) the result of which results in the
There can occur instances of either the inbound record or the source records from the database which are missing the specified attribute. When this is the case the attribute's
whenNullattribute controls the behavior of the evaluation of that attribute. The behaviors are:
The date_extract is applied to both records and then the assertion of "eq" is applied. The following data transforms are available in SanteDB.
The scores for each of the scored attributes are then summed for each
$blockrecord and the block is classified as:
- Match: The two records are determined to agree with one another according to configuration. the matching engine is instructing whatever called it (the MDM, MPI, etc.) that the two should be linked/merged/combined/etc.
- Possible: The two records are not "non-matches" however there is insufficient confidence that the two are definite matches. Whatever called the matching operation should flag the two for manual reconciliation.
- Non-Match: The two records are definite non-matches.
After the suspected truth is cleared, the job will begin the process of re-matching the registered dataset for SanteDB. The matching process is multi-threaded, and designed to ensure that the machine on which the match process is as efficient as possible. To do this, the following pattern is used:
The batch matching process registers 4 primary threads on the actively configured thread pool to control the match process:
- Gather Thread: This worker is responsible for querying data from the source dataset in 1,000 record batches. The rate at which the records are loaded will depend on the speed of the persistence layer (SanteDB 2.1.x or 2.3.x) as well as the disk performance of the storage technology.
- Match Thread: This worker is responsible for breaking the 1,000 record batches into smaller partitions of records (depending on CPU of host machine). The configured matching algorithm is then launched for each record in the batch on independent threads (i.e. matching is done concurrently on the host machine).
- Writer Thread: Once the match thread workers have completed their matching task, the results are queued for the writer thread. The writer thread is responsible for committing the matches to the read/write database.
- Monitor Thread: The monitoring thread is responsible for updating the status of the job.
The performance of the batch matching will depend on the speed of the host machine as well as the version of SanteDB that is being used.
SanteSuite's community server was used for testing in the following configuration:
- Application Server:
- 4 VCPU
- 4 GB RAM
- Non-persistent (ram-only) REDIS Cache
- Database Server:
- 12 VCPU
- 12 GB RAM
- RAID 1+0 SSD disk infrastructure (4+4 SSD)
The versions of SanteDB tested yielded the following results:
- Version < 2.1.160 of SanteDB: ~28,000 records per hour
- Version > 2.1.165 of SanteDB: ~50,000 records per hour
- Version 2.3.x of SanteDB (internal alpha): ~100,000 records per hour