Matching Engine
SanteMPI uses the SanteDB matching engine. This page may be moved in the future to a common page as the SanteDB matching engine supports more than just Patient resources.
The matching engine in SanteDB is a multi-stage process whereby an inbound (or a target record) is compared with the current dataset within the CDR. The matching process occurs in three stages:
Blocking : In the blocking phase records are queried from the CDR's database infrastructure. The blocking phase is used to reduce the total number of records being scored, which can be more computationally expensive.
Scoring : In the scoring stage, the target record is compared to those records retrieved from the blocking phase. Each attribute configured is transformed/normalized and then a score applied for each attribute. Furthermore there are API hooks for implementing partial weighting/partial scoring of an attribute (based on frequency of data in database, NLP, or other future methods)
Classification : In the classification stage, each record's score (the sum of each attribute's score) is grouped into Match, NonMatch, and ProbableMatch. These thresholds are configured.
SanteDB allows for the configuration of multiple match configurations, and allows configuration for the "default" match configuration to be used for regular operation (whenever data is inserted, updated, etc.).
Blocking
In the blocking phase of the matching execution, the candidate record (named $input
) is compared with records persisted in the database using one or more blocking configurations.
A blocking configuration is expressed using the HDSI Query Syntax which is translated into SQL. Blocking instructions can use any supported Filter Functions for the selected database.
Records are read from the database (as blocks) with multiple blocks being combined with either an intersect or union function. Blocks can be loaded as either:
SOURCE Records: Whereby the source records are loaded from the database as they were sent. Blocking in this mode is less CPU intensive and less database intensive, however relies on source information as a "picture" of what data is available for a patient.
MASTER Records: Where by the blocks are loaded using the MDM layer and are computed based on existing known and suspected links. This method of blocking more closely resembles what users see in the UI when MDM is enabled, however it does slow down matching performance as each record must be cross-referenced with the master data record. It also allows for matching based on records of truth.
Blocks from each statement are combined together to form a result set (in C# an IEnumerable<T>
) which are passed into the scoring stage.
In the example below, the SanteDB matching engine will load records for an $input
record by:
If the input record contains an
SSN
identifier , it will filter records in the database by matching SSN. It will then perform an MDM gather (i.e. the matching mode is performed on MASTER records) , these records will be UNION withThe results of a local query whereby:
If the
$input
has Given name, then the given name must match, ANDIf the
$input
has a Family name, then the family name must match, ANDIf the
$input
has a dateOfBirth then the date of birth must match, ANDIf the
$input
has a gender concept then the gender must match
In pseudocode terms, the blocking query for an $input
of John Smith, Male, SSN 203-204-3049 born 1980-03-02 would resemble:
The actual SQL generated by the SanteDB iCDR is must more complex, this example illustrates the concept.
Scoring
During the scoring phase, the records from the blocking stage are compared to the $input
record on an attribute by attribute basis using a collection of assertions about the attribute. If all assertions on an attribute evaluate to TRUE then the matchWeight
score is added to that records total score, if the assertions are FALSE then the nonMatchWeight
score (a negative score) is added to the record's total score.
Overall, the process of comparing a blocked record (named $block
) with the $input
record is:
The scoring attribute may declare that it depends on another attribute being scored (i.e. don't evaluate the
city
attribute unlessstate
attribute has passed. If the dependent attribute was not scored as a positive (match) then the current attribute is assigned the whenNull() score.The attribute path on the configuration is read from both
$block
and$input
.The matching engine determines if either the
$block
or$input
attribute value extracted is null. If either is null then the the When Null Actions is taken.The matching engine then determines if any transforms have been configured (see: Transforming Data). This is a process whereby data is extracted, tokenized, shifted, padded, etc. on both the
$input
and$block
variables. The result of each transform is stored as the new attribute value in memory and the next transform is applied against the output of the previous.Finally the actual assertion is applied. The assertion is usually a binary logical operator (less than, equal, etc.) the result of which results in the
matchWeight
ornonMatchWeight
being applied.
When Null Actions
There can occur instances of either the inbound record or the source records from the database which are missing the specified attribute. When this is the case the attribute's whenNull
attribute controls the behavior of the evaluation of that attribute. The behaviors are:
@whenNull | Description |
match | When the value is null in |
nonmatch | When the value is null in either (i.e. assume the missing data would not match) |
ignore | When the value is null in |
disqualify | When the value is null in (i.e. it doesn't matter what the other attribute scores are, the record is considered not a match) |
zero | When the value is null in |
Transforming Data
The date_extract is applied to both records and then the assertion of "eq" is applied. The following data transforms are available in SanteDB.
Transform | Applies | Action |
addresspart_extract | Entity Address | Extracts a portion of an address for consideration |
date_difference | Date | Extracts the difference (in weeks, months, years, etc.) between the A record and B record. |
date_extract | Date | Extracts a portion of the date from both records. |
name_alias | Entity Name | Considers any of the configured aliases for the name meeting a particular threshold of relevance (i.e. Will = Bill is stronger than Tess = Theresa) |
abs | Number | Returns the absolute value of a number |
dmetaphone | Text | Returns the double metaphone code (with configured code length) from the input string |
length | Text | Returns the length of text string |
levenshtein | Text | Returns the levenshtein string difference between the A and B values. |
sorensen_dice | Text | Returns the Sorensen Dice coefficient of text content A and B values. |
jaro_winkler | Text | Returns the Jaro-Winkler (with default threshold of 0.7) between A and B values. |
metaphone | Text | Returns the metaphone phonetic code for the attribute |
similarity | Text | Returns a %'age (based on levenshtein difference) of string A to string B (i.e. 80% similar or 90% similar) |
soundex | Text | Extracts the soundex codification of the input values. |
substr | Text | Extracts a portion of the input values |
tokenize | Text | Tokenizes the input string based on splitting characters, the tokenized values can then be transformed independently using any of the transforms listed in this table. |
timespan_extract | TimeSpan | Extracts a portion of a timespan such as minutes, hours, seconds. |
Classification
The scores for each of the scored attributes are then summed for each $block
record and the block is classified as:
Match: The two records are determined to agree with one another according to configuration. the matching engine is instructing whatever called it (the MDM, MPI, etc.) that the two should be linked/merged/combined/etc.
Possible: The two records are not "non-matches" however there is insufficient confidence that the two are definite matches. Whatever called the matching operation should flag the two for manual reconciliation.
Non-Match: The two records are definite non-matches.
Bulk / Batch Matching
Each data type which is registered in the Master Data Management pattern has a corresponding Match Job registered. This job resets the suspected ground truth using the following rules:
Remove / Delete: | Keep: |
|
|
After the suspected truth is cleared, the job will begin the process of re-matching the registered dataset for SanteDB. The matching process is multi-threaded, and designed to ensure that the machine on which the match process is as efficient as possible. To do this, the following pattern is used:
The batch matching process registers 4 primary threads on the actively configured thread pool to control the match process:
Gather Thread: This worker is responsible for querying data from the source dataset in 1,000 record batches. The rate at which the records are loaded will depend on the speed of the persistence layer (SanteDB 2.1.x or 2.3.x) as well as the disk performance of the storage technology.
Match Thread: This worker is responsible for breaking the 1,000 record batches into smaller partitions of records (depending on CPU of host machine). The configured matching algorithm is then launched for each record in the batch on independent threads (i.e. matching is done concurrently on the host machine).
Writer Thread: Once the match thread workers have completed their matching task, the results are queued for the writer thread. The writer thread is responsible for committing the matches to the read/write database.
Monitor Thread: The monitoring thread is responsible for updating the status of the job.
The performance of the batch matching will depend on the speed of the host machine as well as the version of SanteDB that is being used.
SanteSuite's community server was used for testing in the following configuration:
Application Server:
4 VCPU
4 GB RAM
Non-persistent (ram-only) REDIS Cache
Database Server:
12 VCPU
12 GB RAM
RAID 1+0 SSD disk infrastructure (4+4 SSD)
The versions of SanteDB tested yielded the following results:
Version < 2.1.160 of SanteDB: ~28,000 records per hour
Version > 2.1.165 of SanteDB: ~50,000 records per hour
Version 2.3.x of SanteDB (internal alpha): ~100,000 records per hour
It is important to ensure that your host system is configured such that the thread pool (accessed through the Probes administrative panel) has at minimum, 5 available worker threads to complete batch matching.
Last updated