|
GOLDSEARCH©
created by David Boas, Miriam Meyerhoff,
and Naomi Nagy
General description
Much data-driven linguistic research relies on coordinating data of two types:
- a linguistic corpus (a collection of speech or writing from a number of
sources or speakers) that has been tagged, or marked up, to allow
researchers to identify linguistic features of interest to them, and
- a record of the characteristics of the speakers or writers contributing to
the corpus (sometimes also including the context of the recording)
In order to discover the patterns of linguistic variation and language use in
the corpus, it is necessary to examine how the language in the corpus varies
according to the different individual, social, and linguistic conditions also
encoded in our corpus. For this purpose, we compare the frequencies of variants
of a dependent linguistic variable across the (putatively) independent
variables: speaker, context, and linguistic environments.
A common method for this purpose has been to extract each occurrence (token)
of the linguistic variable from the corpus one-by-one (or speaker-by-speaker),
and then list the codes associated with each speaker in a separate file. This
file can then be analyzed by the programs Varbrul or Goldvarb
or MacVarb (created by David
Sankoff et al. and David
Rand, and Gregory Guy & Stan Lipa, respectively). These programs tally
up the number of occurrences of each combination of factors (cells). Varbrul
then allows univariate and multivariate analyses of the interactions of factors
coded in the data.
Goldsearch automates the first (and most time-consuming) step by creating
lists of all occurrences of tokens illustrating each variant of the dependent
variable. It does this by treating the corpus file and
the file containing information about each
contributor to the corpus as linked databases. The feature unique to
Goldsearch, that commercial database programs do not seem to offer, is the
ability to conduct iterative searches in one of the files while maintaining an
active link with the other. This means that when you conduct a search in one
file, every token that the program finds is referenced to the information in the
other file. This information is used to create two new files. One is a list of
all tokens matching the search string that were found in the corpus. The other
is a cumulative list of the independent factors associated with the tokens found
in the linguistic corpus. This output
file is a text file ready to be analyzed by Varbrul. In addition, a raw
count of the tokens found for each contributor to the corpus is shown at the end
of each search run.
To summarize,
- this application allows you to perform a search of a (bracketed and
tagged) corpus.
- It records the occurrences of a certain type of token (those which match
the search string) for each speaker and setting.
- It counts and codes the number of occurrences of each type of token (or
search string) for each speaker
- It produces an output list of the matches in the form of a list of strings
of independent factors associated with each speaker or turn, ready to use as
a token file for Goldvarb.
Features of GOLDSEARCH
Getting started
Defining the search
Document requirements
System requirements
Applications
Input and output
How to get Goldsearch from the WWW
For further information, or to suggest improvements or report problems, contact:
Naomi Nagy
English Department
Hamilton Smith Hall
University of New Hampshire
Durham, NH 03824
You can send e-mail to ngn@unh.edu or
call (603) 862-2783.
To learn more about the creators of this application, look at the home pages
of:
|