Loading...
 

WordSmith Tools WordList Tutorial, Level I

 


Other monolingual concordancer tutorials

Other WordSmith Tools tutorials


 

 

I. Introduction


 

 

WordSmith Tools, developed by Mike Scott, is a corpus analysis tool that includes three text analysis tools: a monolingual concordancer (Concord) and wordlist extractors (WordList and KeyWords). This tutorial will focus on the basic features of KeyWords.

 

You can learn more about WordSmith Tools by consulting www.lexically.net. From the WordSmith Tools page on the www.lexically.net site, you can also download a demo version of WordSmith Tools which offers the same options as the commercial version, but which limits the number of occurrences that are displayed to 25. 

 

WordSmith Tools can process .html, .xml and .txt files. WordList is a tool that analyzes texts or groups of texts (corpora) to extract statistics about the words they contain. The results of this analysis are presented in wordlists (alphabetical lists or lists ordered by frequency of occurrence) and tables. Such information can be used for multiple purposes: studying the lexical characteristics of text types and genres, tracking changes in lexical usage over time, identifying plagiarism and extracting terminology. This tutorial will pay special attention to this last application and will show how WordList can be used as a term extractor.

 

II. Getting Ready


  1. Prepare the files you will need for the exercises:
    1. Create a sub-directory of the U: drive (also called My Documents).
    2. Download the files from the English Wind Power Manual files.
    3. Extract the files to the sub-directory you created.
  2. Open WordSmith Tools.
  3. When WordSmith asks you if you’d like to enable basic functions only, choose No. By choosing No, you will have access to all functions.
    1. The program will open a main window from which the various functions can be accessed. The three principal functions — Concord (C), KeyWords (K) and WordList (W) — are available via buttons. Other functions can be accessed from the Utilities menu.

 

III. Choosing texts


 

  1. Choose the texts for processing:
    1. In the main WordSmith window, go to the File menu and select Choose Texts.
      1. In the Choose texts dialogue box, the files that have been selected for processing are displayed in the right-hand pane, and the directory structure appears in the left-hand pane.
    2. A demo file that comes with the software (a chapter from A Tale of Two Cities by Dickens) appears in the list on the right. Remove this file by selecting it and then pressing the Delete key. If you do not remove this file, your results will include occurrences coming from this text.
    3. From the drop-down menu in the upper left-hand corner of the Choose texts window, select the drive where you have stored the documents that you wish to analyze (U: or My Documents).
    4. You can also choose to see files of only a certain format (e.g. plain text or .txt files) by choosing the corresponding option from the drop-down menu located immediately to the right of the drive menu. (See note.)
    5. In the left-hand pane, browse through the directories on the U: drive to find the files that you wish to process. (Double-click on the yellow folder icons (see screenshot) to open the directories).
    6. Select all the files. (To select all the files at once, select the first one in the list, hold down the Shift key, and click on the final file in the list).
      1. Click on the long vertical button with two arrows that appears between the left-hand pane (the directory structure, labelled Files available) and the right-hand pane (the list of files, labelled Files selected). The files that you have selected should now appear in the right-hand pane. (See note.)
  2. Once you have selected the files you want to use, exit the Choose texts window by clicking on the green check-mark button (see screenshot).

 

IV. Generating a WordList


 

  1. In the WordSmith main window, click the WordList button. The WordList window appears in front of the Getting Started dialogue box.
  2. In the Getting Started dialogue box, click the Make a word list now button. The word list appears in the WordList window.
  3. Observe the information that is displayed. At the bottom of the window, you can see information organized by 5 tabs.
    1. Frequency
      1. The information in this tab is divided into 5 columns:

Word

List of the words that appear in the analyzed texts ordered by decreasing frequency of occurrence.

Freq.

Number of times each word occurs in the set of texts.

%

Percentage of the set of texts made up by the each word.

Texts

Number of texts that contain that word.

%

Percentage value of the number of texts that contain that word.

 

  1. What are the most common words? Are there many terms among the highest-ranking words?
  2. Do all of the entries in the list correspond to one and only one word or meaning? Does every word that appears in the texts correspond to one and only one entry in the list? What does this tell you about how WordList works?
  3. Can you think of any strategy to filter the terms?
  4. Is the information found under the Texts column relevant for a terminologist? Why or why not?
  1. Alphabetical
    1. This tab shows the same information as the Frequency tab except that the words are shown in alphabetical order (A-Z).
    2. What is the advantage of studying the word list in this order? What types of words appear together?
  2. Statistics
    1. This tab shows a series of data on the total of the texts and on each file. The names of the rows are self-explanatory. (See note.)
  3. File names
    1. This tab lists the names of the files analyzed.
  4. Notes
    1. This tab allows you to make notes about the wordlist.
    2. When could it be useful to analyze a text with a word list? Would a human do a better job at extracting terms? How might a frequency wordlist help or hinder the terminologist?
  1. If you want to use this word list in the future, you can save it by opening the File menu in the WordList window and selecting the Save option (see screenshot). You will then have to select a sub-directory to store the file and give it a name. (See note.)

 

V. Viewing and manipulating the results


 

  1. Clean the raw wordlist to reduce the number of grammatical words (e.g. articles, prepositions, auxiliary verbs, conjunctions), which are very frequent, due to their linking role in language, and appear at the top of any wordlist.
    1. Make sure you are on the Frequency tab in the Wordlist window. If you are not, return to it by clicking on the Frequency tab.
    2. The first word of the list appears in blue. In almost any list, this word will be the article the, as this is the most frequent word in written English. To remove it from your list, press the Del key on your keyboard. The word the turns grey and has a stroke through it.
    3. Browse through the list with the arrow keys on your keyboard and delete all the words you do not want in your wordlist. Repeat the previous step as many times as needed. (See note.)
    4. Once you have gone through your entire list (if you are analyzing a short text), or through the most frequent words of the list (if you are working with a corpus or a very long text), and are sure you want to remove the deleted words permanently, go to the Edit menu and select the Zap option (see screenshot). Zapping will remove the deleted words and reorganize the wordlist based on the frequency of the remaining words. (See warning and note.)
  2. Generate a concordance in order to examine the context of a word and evaluate whether the word is worth keeping. Some words are easily identifiable as grammatical words that do not constitute a term. Others fall into a rather grey zone.
    1. Select the word you want to examine.
    2. Open the Compute menu and select the Concordance option (see screenshot). The software automatically launches a simple search for this word in its Concord tool. For more information on how to use Concord, see WordSmith Concord Tutorial and Exercises: Level I. (You can also consult WordSmith Tools Concord Tutorial and Exercise: Level II.)
  3. Sort the results.
    1. Resort the wordlist alphabetically, in increasing or decreasing order.
      1. Make sure you are on the Alphabetical tab in the WordList window. If you are not, return to it by clicking on the Alphabetical tab.
      2. Invert the alphabetical order (which is usually set at A-Z) so the words are listed in Z-A order: click the Word button on top of the words column, or open the Edit menu and select the Resort option (see screenshot). To return to A-Z order, click the Word button again.
      3. Resort the wordlist alphabetically by word ending: open the Edit menu, select the Other sorts option, and, in the new menu, select the Reverse word option (see screenshot).
    2. Resort the wordlist by word length.
      1. Open the Edit menu, select the Other sorts option, and, from this menu, select the Word length option (see screenshot).
    3. Resort the wordlist by word consistency (presence across texts). This option showcases the words from the wordlist that appear across the most texts. This is useful for separating words that are relevant to the field from words that are specific to a sub-area of specialization.
      1. Click on the Texts button on top of the fourth column.

 

VI. Generating an index 


 

An index records the position of each token and type of a text. In terms of appearance, it greatly resembles a wordlist. However, an index differs in potential because it allows the user to compute word clusters, among other information.

 

  1. Identify where you want to save your index.
    1. In the main WordSmith Tools window (the window where you access the three analysis tools, Concord, KeyWords and WordList), open the Settings menu and click on the Adjust settings option.
    2. Click on the Index tab.
    3. Under Index File, enter the path to the location where you want to store the index.
      1. Type it, or find it by clicking on the Browse button (see screenshot) that appears to the right.
    4. Validate your changes by clicking on the OK button at the top right of the window.
  2. Generate an index.
    1. In the main WordSmith Tools window, click the WordList button.
    2. From the Getting Started... dialogue box that appears, click the Make/Add to Index button.
      1. In the new dialogue box that appears, you can modify where the index will be stored. To do so, follow the steps in 1. c.
      2. In the same dialogue box, you can also choose between (a) deleting an existing index and creating a new one, (b) backing-up an existing index and adding words to it, and (c) adding words to an existing index without backing it up. (See note.)
      3. Select your preferred location and saving option, and click the OK button. The main WordSmith Tools window will show the status of the process and an information box will appear, notifying you that the indexes have been correctly saved.
  3. Open the index.
    1. In the WordList window, open the File menu and select the Open option.
    2. The software will automatically look for the folder where the indexes are stored. If the folder is not found automatically, browse through your computer, and locate the sub-directory you identified in 1.c or 2. b. i.
    3. Open the index directory and double-click on the file with the extension .tokens.
      1. A window that resembles a wordlist appears. You will know that you are working with an index because the name of the window is Index: main_index.

 

VII. Generating word-cluster lists


 

Terms can be single-word or multi-word units. Word-cluster lists identify groups of words that tend to appear together and that are therefore likely to be terms. However, not all groups of words that appear together are terms, and the user must use his or her judgement to filter the lists.

 

  1. Open your text or corpus index.
    1. If you have not yet generated an index, read the previous section, Generating an Index.
    2. If you have already generated an index, open the WordList tool from the WordSmith Tools window by clicking on the WordList button.
      1. In the WordList window, open the File menu and select the Open option.
      2. The software will automatically look for the folder where the indexes are stored. If the folder is not found automatically, browse through your computer and locate the sub-directory you identified in 1.c or 2.b.i.
      3. Open the directory index, and double-click on the file with the extension .tokens.
  2. Once the Index window is open, open the Compute menu and select the Clusters option (see screenshot).
    1. From the dialogue box that appears, configure the settings of your word clusters.
      1. Select whether to generate clusters for all the words in the list or for only a selection of them.
        • The all option will take longer and will cover all of the words appearing in the list.
        • The selection option can be very useful if you have previously identified words that are likely to be part of multi-word term units. To generate word clusters from selected words, you must first select them by clicking on them. If you want to select more than one word, click on the first one and then hold down the Ctrl key on your keyboard while you click on the rest of the words that you want to compute word clusters for.
      2. Set the size of your word clusters. Size refers to the number of words that the clusters will include. The minimum size is 2 words; the maximum is 8.
      3. Set the minimum frequency of the word cluster. This is the number of times that a word cluster must appear in the text in order to be included in the word-cluster list. The threshold chosen will depend on the size of the text/corpus analyzed. If the corpus is very large, the minimum frequency will be high; if the corpus is small, the minimum frequency will be lower.
      4. Set the maximum frequency percentage. This will exclude clusters beginning with words whose percentage of the corpus exceeds the maximum set as a limit. This option is designed to eliminate clusters beginning with grammatical words such as the, a, and is, which are very frequent and would generate many noisy clusters. (See note.)
      5. Set where the software must stop. This criterion tells the software to ignore clusters that include punctuation, a paragraph break or a sentence break, since terminological units do not occur across such structures.
      6. Generate the word-cluster list by clicking the OK button.
      7. A new window appears, looking exactly like a wordlist, except that, instead of single words, it contains word clusters. (See note.)
  3. If you want to use this word-cluster list in the future, you can save it by opening the File menu in the wordlist window and selecting the Save option (see screenshot). You will then need to give the file a name and select a sub-directory to store the file in. (See note.)

 

 

 

 

 

 

 

 

 

 

 

 

 

NOTE: The default selection for this menu is the *.* option, which means that all the files located in the directory will be displayed, regardless of their name or extension. If you choose *.txt, for example, only the files with a .txt extension will be displayed, though the asterisk means that these files may have any name. Similarly,  only Web pages will be displayed if you choose *.htm and *.html, and .xml documents will be displayed when you choose *.xml 
 

Screenshot coming soon!

 

NOTE: If you think that you will want to consult this group of files regularly, you can save the list by clicking on the Save favourites button (see screenshot). You can even add comments about these files in the lower pane of the window.
 
The files can then be added as a block the next time that you use Concord. This can be done by opening the list using the Get favourites button (see screenshot). Loading the files can take a few minutes.

Screenshot coming soon!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NOTE: Each occurrence of word in a file is called a token, while each different word is referred to as a type. For example, if we analyzed a file that contained only the three words “bridge, bridge, bridge”, the file would contain 3 tokens and 1 type.

 

 

 

Screenshot coming soon!

 

NOTE: When you choose the Save option, WordSmith saves the wordlist in its own proprietary format (.lst), which means you will be able to read the file with WordSmith only. You can also export the wordlist to other formats such as plain text, xml text or Excel spreadsheet. To do so, select Save As (see screenshot) from the WordList File menu.
 

 

 

 

 

 

 

 

 

NOTE: If at any time you delete a word by mistake or change your mind about deleting a word, you can re-insert it in the wordlist by selecting it and pressing the Ins key on your keyboard. 

 

Screenshot coming soon!

 

WARNING: Zapping will eliminate the previously deleted words from your wordlist permanently. There is no undo option.

 

NOTE: This process can also be automated by means of a stoplist. To learn more about this, see WordSmith WordList Tutorial, Level II.

 

Screenshot coming soon!

 

 

 

 

 

 

Screenshot coming soon!

 

Screenshot coming soon!

 

Screenshot coming soon!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Screenshot coming soon!

 

 

 

 

 

 

 

 

NOTE: The first time you create an index, however, WordSmith may not ask you to make this choice.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Screenshot coming soon!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NOTE:  To decide where to set this threshold, you can consult the frequency of these words in your wordlist.
 
NOTE: You can clean and resort this list as if it were a wordlist. For more information on how to do so, read the section Viewing and manipulating the results

 

Screenshot coming soon!

 

NOTEWhen you choose the Save option, WordSmith saves the word list in its own proprietary format (.lst), which means you will be able to read the file with WordSmith only. You can also export the wordlist to other formats such as plain text, xml text or Excel spreadsheet. To do so, select the Save As (see screenshot) option from the File menu in the WordList window.

 

 

VIII. Questions for Reflection


  • After having used WordSmith WordList, what are your first impressions of the functions and functioning of the tool?
  • What are your first impressions of its interface and the searching options it offers?
  • What options could be more useful to you and why?
  • Compared to other corpus analysis tools or term extractors, what are the advantages and disadvantages of WordList?
  • What aspects of the tool present the greatest obstacles for identifying terms? Can you think of solutions that you would like WordList to offer?

Tutorial created by the CERTT Team. (2007-10-07)