Find Same Patters

Started by Saleh, June 06, 2023, 01:59:25 PM

Previous topic - Next topic

Saleh

Hi friends, what do you think is the fastest algorithm for identifying similar patterns inside two different files?

for example:

file1.txt:
abcd vbtegu nurf ert jivre wtryg myff vrtgg

file2.txt:
bygw 3wtgr vwrere we ert tge abcd yth5g

Same patters are:  ert  abcd

MrBcx

Hi Saleh,

Right off the bat, check out Levenshtein Distance: https://en.wikipedia.org/wiki/Levenshtein_distance
It might give you some ideas.

I would probably compose my own algorithm in BCX but I would need to know more about the data.

* Do file1.txt and file2.txt have the same number of lines of text in them?

* Are you wanting to compare the matches only between corresponding file1 and file2 line numbers?
In other words, if file1.txt, line number 1 contains one or more patterns that might be found
in multiple lines of file2.txt, is that something you want reported?

* Are the patterns ( I'd call them tokens ) always space separated, like in your examples?

The basic approach I would use is:

parse the patterns from file1.txt into a string array

parse the patterns from file2.txt into a string array

iterate through the first string array looking for matches in the second string array.


A completely different approach could be to use a dictionary.  Parse and load a dictionary object
with the patterns from file1.txt.  Next, attempt to add the parsed patterns of file2.txt to that
dictionary.  If the add fails, it means it already exists in the dictionary and you've found a match.



Saleh

Hi kevin, Thank you for your attention.

Quote* Do file1.txt and file2.txt have the same number of lines of text in them?
No, it is not always, but it can be assumed that the number of lines in both files are the same.

Quote* Are you wanting to compare the matches only between corresponding file1 and file2 line numbers?
In other words, if file1.txt, line number 1 contains one or more patterns that might be found
in multiple lines of file2.txt, is that something you want reported?
No, line numbers don't matter, it's enough to determine if only one of the thousand patterns that are inside of the file1 is exists in anywhere on file2.

Quote* Are the patterns ( I'd call them tokens ) always space separated, like in your examples?
Yes of course, and the delimiter can be anything like - or anything else, and the patterns are numbers, they're actually nothing but CRC32 hashes:
Like 12455246-589856-57986-57898666

QuoteThe basic approach I would use is:

parse the patterns from file1.txt into a string array

parse the patterns from file2.txt into a string array

iterate through the first string array looking for matches in the second string array.


A completely different approach could be to use a dictionary.  Parse and load a dictionary object
with the patterns from file1.txt.  Next, attempt to add the parsed patterns of file2.txt to that
dictionary.  If the add fails, it means it already exists in the dictionary and you've found a match.

Interesting idea, Kevin, to complete what I have in mind, the files are very large, around 500MB, and contain fixed-length numeric strings (CRC32) separated by -.
Exactly like the following two files:

FILE1.TXT
{
   78576585674-8968457469-3553545466-136467645664-....
}

FILE2.TXT
{
   345647457865-8968457469-68758978946-67885785677-....
}

airr

Hi, Saleh.

Would you be willing to provide two sample files we could work with containing a subset of the entire set?

Also, what are you using to generate the CRC32 output as decimal?  I ask because the tool I use (rhash, also does a bunch of other formats: md5, sha, etc) outputs in hex, as does the crc32 utility that is available on Linux.  Having it generate a hex number saves space in the long run.

Thanks,

AIR.

airr

If you don't mind having your eyes possibly glazed over by the time you finish reading, you may want to look at STRING SEARCHING ALGORITHMS

Saleh

Quote from: airr on June 07, 2023, 10:07:37 PM
Hi, Saleh.

Would you be willing to provide two sample files we could work with containing a subset of the entire set?

Also, what are you using to generate the CRC32 output as decimal?  I ask because the tool I use (rhash, also does a bunch of other formats: md5, sha, etc) outputs in hex, as does the crc32 utility that is available on Linux.  Having it generate a hex number saves space in the long run.

Thanks,

AIR.
Hi air,
CRC32 was just an example to express the issue, you can even assume that we compressed for example the ten-digit number 4356432546 and turned it to be a thing such as "XcT3".

For example, we can consider:

FILE1.TXT
{
   awzft-mytyh4-sffre-Qarf4-....
}

FILE2.TXT
{
   frtge,Ueguj,tyhwgf,sffre-...
}

The correct answer should be: sffre

The problem is that reading line by line from the file and even loading it in memory at once slows down the algorithm.

QuoteIf you don't mind having your eyes possibly glazed over by the time you finish reading, you may want to look at STRING SEARCHING ALGORITHMS
Thanks for sharing. I will definitely take a look at these solutions.

MrBcx

A few more thoughts on this ...

When you said the data were numbers, I thought the search time could be significantly reduced because string searching is inherently more costly.  But when you said you were applying your own compression technique that transforms the numbers into text, that possibility of faster scanning went out the window.

Whichever way you decide to load the data, removing duplicates from file1.txt and file2.txt before storing those in a data structure will help speed up the comparisons.  In other words, if file1.txt contains tokens that are duplicated inside file1.txt, remove the duplicates from the file1.txt input stream.

Then do the same thing for file2.txt.



MrBcx

Saleh, I posed your original question to ChatGPT:

Question:  What do you think is the fastest algorithm for identifying similar patterns inside two different files?

Answer:  There isn't a definitive answer to the question of the fastest algorithm for identifying similar patterns inside two different files, as it depends on various factors such as the size and complexity of the files, the type of patterns being sought, and the specific requirements of the task. However, I can mention a few common algorithms that are often used for pattern matching and comparison tasks:

    Hash-based algorithms: These algorithms involve computing hash values for patterns in the files and comparing them to identify similarities. Examples include Rolling Hash and Rabin-Karp algorithm. Hash-based algorithms can be efficient for finding exact matches but may not be suitable for more complex pattern matching.

    Edit distance algorithms: These algorithms measure the difference between two strings by counting the minimum number of operations required to transform one string into another. Levenshtein distance and its variations are commonly used for comparing the similarity of sequences or text.

    Suffix trees and arrays: Suffix trees and suffix arrays are data structures that store all the suffixes of a given string. They can be used for efficient substring matching and can help identify common substrings between two files.

    Longest Common Subsequence (LCS): LCS algorithms find the longest subsequence shared by two sequences. These algorithms are useful for finding patterns that do not need to be contiguous and can be applied to text or other sequences of symbols.

    Content-based algorithms: These algorithms involve analyzing the content of the files using techniques such as text mining, natural language processing, or machine learning. They can be effective for finding semantic similarities between files but may be more computationally expensive.

The choice of algorithm depends on the specific requirements of your task, the nature of the patterns you are trying to identify, and the trade-offs between speed and accuracy. It's often beneficial to analyze and understand the characteristics of your data before selecting an algorithm or combination of algorithms to achieve the desired performance.

Saleh

#8
Hi Kevin, Thank you for your attention to resolve this issue.
It is true that all of the both files contents can be written as a sequence of numbers, but in fact there is no numerical relationship between the lines to use special numerical sorting algorithms, and also there are no duplicates in the contents of the files. In fact, each line is a sign of a non-repeating random string.
The patterns are simple and straightforward, so I think hash-based algorithms are suitable for this. Anyway,  if you have an idea or a piece of code in this field, it can be very useful.
Thank you.

MrBcx

One thing seems certain - you're going to need a lexer to break apart the strings: 

I like this one ...  ;)

https://bcxbasiccoders.com/smf/index.php?topic=529

Saleh

Quote from: MrBcx on June 08, 2023, 02:28:43 PM
One thing seems certain - you're going to need a lexer to break apart the strings: 

I like this one ...  ;)

https://bcxbasiccoders.com/smf/index.php?topic=529

Thanks Kevin, yes a quick lexer was a basic prerequisite.