PhraseMiner
PhraseMiner is a powerful tool for identifying and extracting intra-document fuzzy match sentences and frequently occurring sub-sentence phrases, expressions and project-specific terms in several languages. It is an easy-to-use set of Word macros with a small download size of under 200KB which requires no installation as such, other than just being copied to the Word template directory and loaded as an add-in. PhraseMiner includes the following four main modules that allow you to perform a whole range of useful text analysis and extraction functions at the click of a button and provide results in a form that can be fed into and used with translation environments such as Déja Vu or memoQ. It requires Word 2003 upwards and will not run natively on a Mac.
FuzzyMiner identifies and extracts "internal" or intra-document fuzzy matches, i.e. sentences in the current document that are similar to each other but which do not necessarily yet have a corresponding fuzzy match in a TM. While analysis against a TM often flags a disappointing number of fuzzy matches, there are often quite a few such "intra-document" matches in the average document. Some CAT tools give you a percentage analysis of such sentences or "homogeneity" analysis but do not identify or extract the actual sentences to let you work on them in one go. FuzzyMiner opens these sentences in a new document and displays the first fuzzy in each series in normal font and the other "fuzzy repetitions" in each series in italics underneath.
LCSMiner compares each sentence against all the other sentences to identify and extract the Longest Common Subsegment (consecutive sequence of words) in any two pairs of sentences. Some documents have almost no fuzzy matches but may have quite a high percentage of sentences with a common subsentence portion. Tests have shown that by setting the consecutive number of words to about five, LCSMiner can often provide another level of leveraging not available through conventional fuzzy matching mechanisms. Like FuzzyMiner, LCSMiner opens a new document and displays the first LCS in normal font and the other LCS leveraged sentences in italic.
SubsegmentMiner identifies and extracts sentence subsegments i.e. whole sentences that are actually part of and included in longer sentences but which are not long enough to be flagged as a fuzzy match at sentence level.
TermMinerEn, Fr, Sp,Ger and Swe uses stop words in several languages and regular expressions to identify and extract key sub-sentence expressions and terms which can be sorted by frequency, highlighted in the source document and displayed in context in several languages (English, French, Spanish, German or Swedish). Having opened the source file and pressed the TermMiner button, you are then prompted to choose a stop word file in the appropriate language (supplied with PhraseMiner). The advantage of this approach is that the user has access to these files (which just contain a list of words in Word files) and can tweak stop words according to the results obtained. TermMiner calculates the number of times each extracted term or phrase occurs in the source document and displays them in a new document:
3 currency and treasury bonds
7 safe-haven asset
2 debt ceiling
HLSource highlights fuzzies, terms and LCSs in the original document. You start with the extracted fuzzies, LCSs or terms (without frequencies) screen with the source file open in the background.
SortByFreq sorts the TermMiner terms list in first descending order with the term with the highest frequency at the top and then by alphabetic order:
7 safe-haven asset
3 currency and treasury bonds
2 debt ceiling
DelFreq removes the frequencies and the tab from the SortByFreq terms.
DVXEVFuzzyMin basically does the same thing as FuzzyMiner but starts from a DVX external view table and also inserts an "Internal fuzzy" comment in the "Comments" column. The same thing could be done with a memoQ external view table. The rows containing the extracted internal fuzzies can then be viewed together in DVX by selecting "All rows with comments".
Segment: splits the source document into sentences for use with Context and SplitAtComma
SplitAtComma: having split the source document into sentences with Segment, SplitAtComma further splits the sentences of the source document at commas. Quite often, sudividing sentences at commas can provide even further leverage for FuzzyMiner, LCSMiner and SubsegmentMiner.
TermSnts identifies and extracts the sentences from the segmented document that contain two or more terms mined by TermMiner, or any other list of terms.
So if you have a list of terms like this:
great freedom
form sentences
larger phrases
clumped into groups
time consuming
TermSnts will extract sentences like this (where each sentence extracted contains two or more of the above terms):
Although we may feel that we have great freedom in how we can use words to express ourselves, the truth is that languages impose strong constraints on how words can be combined to form sentences.
PhraseMiner screenshots
PhraseMiner is available under the same conditions as CodeZapper, i.e. you pay a small, one-time development donation of twenty euros, which entitles you to free future updates
The latest version is 3.1.8
Contact me here to obtain PhraseMiner by email
|