I have a requirement to compare one million DB records against another one million records( DB ). Could you please suggest the best possible approach to achieve this.
Ranjini Elamkulam Mana
Please provide specific details of your request.
Is every record from each db need to be compared against?
Is the comparison based on primary key?
What you have said is very broad and abstract, provide more details to help you
Yes we would like to compare every record in the table to every other record in the second table.
Even before addressing questions about technology, you realize that is 1m^2 comparisons, right (i.e. 1,000,000,000,000)?
Thats correct... we have to have these many comparisons
So, assuming each comparison takes 1ms, that will take 31 years.
This can be achieved in abinitio using a join component... i believe that is something it does in a few minutes.
So presumably, then, you will be joining the records on some sort of key? in which case, you wouldn't be comparing each record to each other record. If so, then you should be able to use the Change Data Capture shape (described in a comment below).
Could you please help me understand the way Change Data Capture works. Do you have an example to show its functionality.
Absolutely: Find Changes shape
Out of curiosity can't this be done using SQL queries instead?
The ChangeDataCapture shape works pretty well is you need to compare 2 different sets with the same primary key. This is not a quadratic complexity algorithm since it uses primary key to make it linear with the size of the set. We've this in production with millions of records and it performs pretty well.
If you're sure you need 1:1 comparison of each record you must enter more into detail in the overall algorithm (what is a "comparison"? which result do you want to retain from each comparison? can you stop before the end if you find a "good" result? etc....) in order to optimize the algortihms. Once you have a very detailed algorithm then the community will surely help you with more specific hints.
Boomi is an integration platform, my 2 cents are that - comparing millions of records between 2 different databases is not a boomi task and should not be done in boomi.
It is a database task, you have to look from database point of view, what options are available in databases to do this. Putting this in boomi is not a good design, from all aspects.
You may write database objects to do this and simply call this from boomi process. I dont know abinito, but it must be using some db techniques to perform this, explore what it is doing behind the scenes and implement that logic in your database.
Assuming that you have a proper database design, with all the correct indexes included, you should do this is the database, using either a INNER/OUTER JOIN OR A CROSS/OUTER APPLY.
Doing this anywhere else will require that you transmit all those records, before stepping through each, which you want to avoid doing.
As mentioned by others here though, it really depends on what you're comparing, and what you're looking for in there.
I mean, there are materialised views if you need the query to run really fast, if it runs regularly and the data doesn't change that regularly. If this is all an offline operation and speed is less of a concern, then there wouldn't be any point.
Then the comparisons in the database - the query you design needs to work on the data as a set operation, not on a record by record basis, if that makes sense. So you will probably need to make use of common table expression type of setup as this allows you to filter datasets inline, and then perform operations on that or those datasets (similar to temp tables in how it operates). That functionality that you would run from a stored procedure, could then be executed from Boomi. This definitely isn't a one size fits all answer, and with this amount of data you need to know what you're doing.
Retrieving data ...