Skip to content

A note on extended and composite tags

Muhammad Shakir edited this page May 1, 2023 · 5 revisions

The extended tags consist of features adapted from Biber (2006) and Biber et al. (1999). These features are semantic categories of different verbs, nouns, verbs, and adjectives. These semantic subclasses are subsets of the original tags in the simple tag set. For example, the various types of adjectives like evaluative adjectives (JJEVAL) are based on JJAT and JJPR. So if you want to use the various semantic types of adjectives, e.g. JJEVAL, in your analysis, this will overlap with the counts of JJAT and JJPR. In other words, JJEVAL is also counted in the simple tag JJAT and JJPR. To eliminate this double counting, you need to use JJATother and JJPRother which are mutually exclusive with the various semantic subclasses of adjectives. The same logic applies to the semantic classes of adverbs, nouns and one semantic subclass of prepositions (PrepNSTNC). In conclusion, there are simple tags for adjectives, adverbs, nouns, and prepositions and there are other versions of these tags which have been added to avoid double counts (same is true for THSC, THRC and WHSC). It also means that simple tags should not be used in combination with their other variants, for example either use JJAT or JJATother, but never simultaneously.

The composite tags have been added to facilitate the gradual process of feature selection. Composite tags generally have the word all at the end. For example WhVSTNCall is the sum of WhVATT, WhVFCT, WhVLIK, WhVCOM. This choice is provided based on Egbert and Staples' (2019) suggestion. In simple words, you will select fine-grained features as a first step (in the above example the four subtypes of WH clauses are the first option). If none of these features load on your dimensions or factors, the next step is to exclude the 4 different types and only use the combined version WhVSTNCall and re-run the analysis. Obviously, you can select the all variants from the very beginning and discard subclasses if they are so infrequent. However, the important thing is to avoid overlap at all costs. So you do not want to simultaneously use WhVSTNCall along with the four sub classes it represents.

Composite tags that are specifically based on Biber (2006) are – like their individual siblings – counted based on simple tags. For example, WhVSTNCall like its individual siblings WhVATT, WhVFCT, WhVLIK, WhVCOM is a subset of WHSC. So it is not advisable to combine WHSC and WhVSTNCall in the same factor or principal component analysis. To avoid overlap use WHSCother instead of WHSC whenever you are using the semantic subclasses.

References

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Longman Publications Group.

Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Benjamins.

Egbert, J., & Staples, S. (2019). Doing multidimensional analysis in SPSS, SAS and R. In T. Berber-Sardinha & M. V. Pinto (Eds.), Multidimensional analysis research methods and current Issues (pp. 125--144). Bloomsbury.

Clone this wiki locally