MultiTrans 4.4 Term Extractor Tutorial, Level I


Other term extractor tutorials

Other MultiTrans tutorials



I. Introduction


When you create a TextBase in MultiTrans (as described in the MultiTrans TextBase Builder Tutorial, Level I), you have the option of automatically creating a Terminology Extraction file. When this option is selected, MultiTrans automatically creates two lists for each of the languages in the TextBase: one is of word forms and their frequencies in the TextBase (similar to the list created by the WordSmith Tools WordList function as described in the WordSmith Tools WordList Tutorial, Level I); the other is of series of two or more word forms (candidate multi-word terms) that occur twice or more in the TextBase.


To see how the term extractor fits into the translation process in MultiTrans, consult the MultiTrans Work Flow diagram.


You can find out more about MultiTrans by consulting the MultiCorpora website at http://www.multicorpora.com. When you open MultiTrans, you can also click on the MultiTrans Help icon to read the help files providing information on MultiTrans’s different functions.


II. Getting ready


  1. Save the files you will need for the exercises:
    1. Create a sub-directory called MultiTrans Term Extraction (or any other name that you wish). (For instructions, see Creating a sub-directory in Windows.)
    2. Copy or move to this sub-directory the TextBase and TermBase you created in the MultiTrans TextBase Builder Tutorial, Level I. Be sure to copy or move all of the files that MultiTrans has created for this TextBase and TermBase. For a list of these files, see the MultiTrans TextBase Builder Tutorial, Level I. (See Note 1.)
  2. Open MultiTrans.


III. Viewing the terminology extraction lists created by MultiTrans


  1. Open your TextBase or the TextBase you saved and extracted to your folder at the beginning of this tutorial. (For instructions, see Opening a TextBase and TermBase in MultiTrans.)
  2. The TextBase opens in the TextBase Search window. The TextBase window is divided into three sections (not including the Process Bar, which you will see on the left unless you have chosen to hide it).
    1. The rightmost part of the screen is divided vertically in two. The left-hand pane contains the source (French) text, and the right-hand pane contains the target (English) text. You will notice different colours of text; the colours are there simply to differentiate between segments in the texts.
    2. The pane to the left contains three tabs: Search, TextBases and Terminology. The TextBases tab is divided into two sub-tabs at the bottom: File List and Alignments. The Terminology tab is also divided into two sub-tabs at the bottom: Word Count and Term Count.
    3. The Word Count sub-tab displays the list of graphical words in the TextBase and their frequencies. If a paperclip icon appears beside a word form, it indicates that there is a record for this word form in the TermBase.
    4. The Term Count sub-tab shows you automatically identified candidate multi-word terms and their frequency in the TextBase. If a paperclip icon appears next to the candidate term, this indicates that there is a record for this unit in the TermBase. (See note.)


IV. Evaluating the kinds of items that are extracted and analyzing the usefulness of each list


  1. Look through the TextBase and identify five to ten single- and multi-word items in your source language (French) that you think might be pertinent for terminological or other research.
  2. Look at the Word Count sub-tab of the Terminologie tab.
    1. What kinds of units are identified (parts of speech, forms, etc.)? How are they organized?
    2. Click on the headings of the columns. What does this do? How can this help you to find items that interest you?
    3. Do you find all of the single-word items you identified as potentially pertinent for research? Where are they in the lists?
    4. What proportion of the items identified do you think might be useful for doing research? Where are they located in the results?
  3. Look at the Term Count sub-tab.
    1. What kinds of units (or series of word forms) are identified (parts of speech, structures, etc.)? How are they organized?
    2. Click on the headings of the columns.What does this do? How can this help you to find items that interest you?
    3. Do you find all of the multi-word items you identified as potentially pertinent for research? Where are they in the lists?
    4. What proportion of the candidate-terms identified do you think might be useful for doing research? Where are they in the lists?
    5. Do you observe challenges in the identification of these multi-word units? What are they?
    6. Do you observe any potential challenges that MultiTrans handled well? How do you think it did so?
  4. From the File menu, select Close TextBase Search. You will be returned to the MultiTrans General references page.


V. Observing the options available in MultiTrans for term extraction 


  1. Find the copies of the Word files you downloaded and used to create your TextBase in the MultiTrans TextBase Builder Tutorial, Level I (WHO-obesityEN.doc and WHO-obesityFR.doc). (See note.)
  2. Click on the TextBase Builder icon on the MultiTrans General references page. Follow the first few steps for creating a new TextBase, as you did in the MultiTrans TextBase Builder Tutorial, Level I, to create a new French-English TextBase. Stop at Step 3 – Output Method Selection.
    1. When asked by MultiTrans, give this new TextBase a name such as YourLastName_Extraction.tcs and store it in the sub-directory you created to store the files for this exercise.
  1. When you reach Step 6 – Term Extractor, ensure that the Create Terminology Extractor file checkbox is checked. Click on the Customize button beside this option. The Terminology Extractor Options dialogue box opens.
  2. The first tab displayed is the Exclude List tab. This tab allows you to specify a list of word forms that will be excluded from extraction (e.g. that will not be identified as beginnings or ends of candidate terms). Make sure the Use exclude list (recommended) checkbox is checked.
  3. Click the Browse button beside the English field. Find and select the exclude list for English: exclude.en. Click the Open button. In the French field, do the same for the French exclude list: exclude.fr.
  4. The Length tab allows you to specify a maximum length for candidate terms extracted. Using the arrow buttons, set this length at 6 words and check the checkbox for this option.
  5. The Frequency tab allows you to set a minimum number of occurrences candidate terms must have to appear in the list. Using the arrow buttons, set this frequency to 3.
  6. Click the OK button. You will return to the Step 6 – Terminology Extractor dialogue box.
  7. Follow the remaining steps to create and open your new TextBase. (You can continue to use the TermBase you created earlier in this exercise by selecting it at the appropriate step in opening the TextBase.)
  8. Evaluate the results of the terminology extraction. Do you see any differences in the extraction of the terminology, as compared to the original TextBase? If so, what are they and how do they affect the usefulness of the lists in your opinion?
  9. If you wish, try repeating the steps above, but choosing different settings. What effect do these changes have on the results? (See note.)


VI. Wrapping up


  1. Close the TextBase (File > Close TextBase Search) and exit MultiTrans (File > Exit).
  2. To make a copy of your files as a backup or to transfer them to another computer:
    1. In My Computer or from the Start menu, find the sub-directory you created to store the files for this exercise.
    2. Make a compressed folder that contains this sub-directory. (For instructions, see Creating a compressed folder.)
    3. Copy this compressed folder to a USB key or, if it is less than 2 MB, send a copy as an attachment to your e-mail.





NOTE 1: If you have not completed the MultiTrans TextBase Builder Tutorial, Level I, download the ready-made database, MultiTrans Bases – Extractor and TermBase Manager and extract its contents to the sub-directory you created above. (For instructions, see Extracting files from a compressed folder.)













NOTE: If you do not see anything on the Terminology sub-tabs, you may have forgotten to check the Create Terminology Extractor File box when you created your TextBase, or to copy all of the files for the TextBase to your sub-directory. If this is the case, you can download the ready-made TextBase as described in the above note to do this exercise, and re-do your TextBase afterwards.

































NOTE: If you cannot find these files, you can download the compressed folder MultiTrans TextBase Builder Level I files















NOTE: To make it easier to compare the various versions of the extraction, you can save the list of candidate-terms as a text-only file (.txt). To do this, right-click somewhere in the list of candidate terms (on the Term Count sub-tab of the Terminology tab) and choose Enregistrer liste au format texte... option from the contextual menu that appears. Choose the directory where you would like to save the new file, which you can then open in Word, another word processor, or even Excel.


VII. Questions for reflection


  • As you went through this tutorial, what were your first impressions of the functions and functioning of MultiTrans? 
  • What could the MultiTrans term extractor help you to do? In what kinds of situations?
  • What criteria can be used to evaluate term extractors? (Note that some of them are discussed in the questions you were asked above.)
  • What kinds of word forms were identified by the term extractor? Are these the forms that would usually be included on term records? Why or why not? If you wanted to create term records from the results of the extraction, would you change the forms that were found? Why or why not? (You may want to consider this question again later on, in light of the results of the MultiTrans Tutorial: TermBase Agent, Level I.
  • Did the MultiTrans term extractor identify any items other than terms that you thought would be interesting for a translator or terminologist to be aware of? If so, what kinds of items? How might they be useful?
  • How does MultiTrans compare to other extractors you have seen?
  • What do you think of the options MultiTrans offers a user to adjust the extraction process? Do you think they are useful? Why or why not? Try experimenting with these settings and comparing your results.
  • What are some of the advantages and disadvantages of using MultiTrans to extract single- and multi-word items (candidate terms)? Compared to a manual approach? Compared to using another tool?


Tutorial created and updated by the CERTT team. (2010-01-29)