All Issues


Towards a Methodology for Measuring Lexical Density in Arabic

Updated on

Towards a Methodology for Measuring Lexical Density in Arabic[1]


Ahmed Seddik Al-Wahy

Faculty of Languages (Al-Alsun),

Ain Shams University, Cairo, Egypt



تحاول هذه الدراسة تحقيق هدفين مترابطين. فهي تهدف أولا إلى بيان ما إذا كانت هناك فروق في الكثافة اللفظية بين مستويين من مستويات العربية، وهما الفصحى المعاصرة والفصحى الوسيطة، كما تسعى ثانيا إلى التوصل إلى الطريقة المناسبة لقياس الكثافة اللفظية في اللغة العربية، بما تتميز به من خصائص نحوية وصرفية وأعراف متفق عليها في طرق الكتابة. ولتحقيق هذين الهدفين، تقارن الدراسة بين الطريقتين الرئيسيتين لقياس الكثافة اللفظية، وهما طريقة جينيور وطريقة مايكل هاليداي، وتناقش ما يكتنف كلا منهما من صعوبات عند تطبيقها على العربية، مع محاولة اقتراح الحلول للتغلب عليها كلما أمكن ذلك. وبعد ذلك تطبق الدراسة كلا من الطريقتين على عدد من النصوص التي تمثل المستويين التاريخيين المذكورين، والتي تنتمي إلى جنس لغوي واحد وهو السرد التاريخي. وتشير نتائج المقارنة إلى وجود اختلاف في الكثافة اللفظية بين المستويين، وإن تعارضت الطريقتان فيما بينهما في تحديد أي المستويين أشد كثافة، كما تشير إلى أن طريقة هاليداي هي الأنسب لقياس الكثافة اللفظية في العربية، وأنها تظهر فروقا ذات مغزى بين الفصحى المعاصرة ونظيرتها الوسيطة في النصوص محل الدراسة.



The purpose of this study is two-fold. First, it aims to show whether there are differences in lexical density between two historical varieties of Arabic, namely, Late Middle Arabic and Modern Standard Arabic. Second, it seeks to find out the method of measuring lexical density that best suits the Arabic language, with its orthographic and morpho-syntactic peculiarities. To this end, it compares the two main methods for measuring lexical density, Ure’s and Halliday’s, and discusses the difficulties that arise when each of them is applied to Arabic, suggesting solutions where possible. Each method is then used to measure lexical density in a selection of texts representing the two varieties of Arabic and belonging to the historical narrative genre. The results of the comparison indicate that the two varieties display different degrees of lexical density, though the two methods of measurement yield opposing results. However, it is shown that Halliday’s method is more appropriate for Arabic and that it consistently reveals significant differences between the historical varieties as represented by the texts analysed.

Keywords: lexical density, grammatical intricacy, lexical variation, Late Middle Arabic, Modern Standard Arabic


  1. Introduction

Since its introduction by Ure (1971), the concept of lexical density has been applied to a wide range of languages, mainly to compare their varieties for descriptive or applied purposes. Few studies, however, have applied this concept to Arabic, in spite of its potential to cast new light on the differences between its genres and its social and historical varieties. This is probably due to the fact that Arabic is typologically different from English and most of the other languages that have been studied in terms of lexical density, which is manifest in its morpho-syntactic features and orthographic system. Measuring lexical density in Arabic may require modifying the existing models or adopting a functional rather than a traditional approach to Arabic grammar.

The present study is based on the hypothesis that one of the main differences between the modern and earlier varieties of Arabic has to do with the degrees of lexical density they display. To test this hypothesis, the study compares the two methods for measuring lexical density, namely, Ure’s (1971) and Halliday’s (1989, 1994; Halliday & Matthiessen, 2014),with respect to their applicability to Arabic and suggests solutions to the problems encountered in this respect. After that, each method is applied to texts belonging to two historical varieties of Arabic to see which of them reveals more significant differences between them with respect to lexical density.

The importance of applying the concept of lexical density to Arabic lies in its potential for opening up new avenues of research and practical applications in many language-related fields. These include the study of language variation, genre analysis, translation studies, and language teaching. Lexical density can work as an explanatory tool for differences between historical and other varieties of Arabic and can provide new insights into the processes and directions of the Arabic language development. Another important area is the study of the readability of different Arabic texts, which can have a wide range of applications. For instance, it is useful in evaluating textbooks and other works for appropriateness for the young or non-expert reader. This is based on the idea that texts with low lexical density are generally more accessible than those with high density (e.g., Halliday, 1989; Stubbs, 2004). The same idea can be useful for translation and translation evaluation. Translators into Arabic may choose to reduce the lexical density of their target texts to make them more accessible to the intended recipients, as in the case of translating an encyclopedia entry for children or a specialized text for the general reader. This technique can be regarded as a type of explicitation, which is assumed to be one of the universals of translation. In order to make the best use of such potential, it is necessary to adopt a method of measurement that reflects the realities of the Arabic language and suits its morphological and syntactic features. Only an appropriate method can add to the value of the studies based on lexical density and can ensure the reliability of their results.

The texts examined in this study belong to two historical varieties of Arabic. The first is Modern Standard Arabic (MSA), which is the contemporary standard variety generally used across the Arab world in formal writing, news bulletins, and formal political speeches. As Owens (2006, p. 5) describes it, MSA is “a largely standardized form of the Classical language … which is close to the language of contemporary journalism in the Arabic world.” The second variety is the written Arabic of the eighteenth and early nineteenth centuries. This variety falls under the general label “Middle Arabic”—a polysemous term that has been used to refer to the period of transition from Classical to Modern Arabic and also to the generally standard varieties of Arabic that incorporate elements of colloquial dialects (Owens, 2006, pp. 46–47). These two senses apply to the source chosen to represent this variety in the present study, and as such the second variety can well be described as “Middle Arabic.”It is more accurate, however, to use the term “Late Middle Arabic” (LMA), to represent the period of time to which the text belongs. LMA is thus the standard variety of Arabic which immediately precedes its first major contact with Western civilization in modern times, which begins with the French Expedition to Egypt (1798-1801). The LMA source examined here chronicles the events of the French Expedition, among many other events.

While both LMA and MSA follow the rules of Arabic grammar concerning case inflections, word endings, sentence structure, and word order (which is not the case, for example, with colloquial varieties of Arabic), MSA is more influenced by Western culture and civilization as well as modern European languages, especially English and French, which are the languages with which Arabic has had the most contact in modern times. Conversely, LMA is closer to Classical Arabic in terms of structural patterns, phraseology, and lexis. It may include local or obsolete words, but it is neither influenced by Western civilization nor by modern languages. This feature, which distinguishes LMA as a variety of Arabic, also applies to varying degrees to those contemporary Arabic texts that would be described as “Heritage Standard Arabic” (fuṣḥᾱ al-turᾱth) by Arabic sociolinguists (e.g., Badawi, 1973, pp. 89–90), which is almost used only by Muslim scholars in the religious register.

One of the difficulties in this respect is to find comparable texts belonging to the same genre in LMA and MSA to use as data for comparison. Genres belonging to a given historical variety do not necessarily have counterparts in others, and this has various reasons, including incomparability (as in the case of canonical religious texts), disappearance of the genre in question (as in the case of maqāmāt, a rhymed prose literary genre common in Middle Arabic), or newness (as in the case of modern sciences and modern literary genres). One genre that exists across both varieties is that of history, which has therefore been chosen as the source of data for the present study. The texts chosen for analysis are drawn from cAjā’ib al-’Āthārfī al-Tarājimwa al-’Akhbār ‘Wonders of Traditions in Biographies and Events’ (1880/1997) by Abdurrahman al-Jabarti (1753-1825) for LMA, and Suqūṭ Niẓam ‘The Downfall of a Regime’ (2013) by Mohamed Hassanein Heikal (1923-2016) for MSA. Both books were written by Egyptian historians and deal with important periods of transition in the history of Egypt.

Given the lack of corpus tools that can accurately test lexical density in Arabic using both methods, the analysis is performed manually based on selected passages drawn from each book. A number of conditions have been applied to ensure the highest possible degree of consistency and reliability of measurement. For instance, proper names consisting of more than one word (e.g., Muhammad Nagīb and Yūsuf ibn ’Ayyūb) have been regarded as single words, since they refer to single entities. The same applies to numbers and years, which are written in the LMA texts as separate words (e.g., thamānin wakhamsīna wasitti mi’ah ‘six hundred and fifty-eight’), but written in figures in the MSA texts. In the analysis, these have been joined with hyphens so as to count them as single words (e.g., sanata thamānin-wa-khamsīna-wa-sittimi’ah ‘the year 658’). The passages selected from each variety are of approximately the same length (738 words for LMA and 744 words for MSA) and are mainly narrative, i.e., passages based on dialogue and lists of separate items have been avoided. In addition, passages including translations from English (which are abundant in Suqūṭ Niẓām) have been excluded to avoid any possible source language influence.


  1. Lexical Density and Related Terms

This section deals in some detail with the basic terms and concepts upon which the present study is based. First, it defines lexical density, and then it elaborates on the distinction between content words and function words, which is essential for its measurement, and discusses how it applies to Arabic. It also refers to some other terms that are related to, and sometimes confused with, lexical density, such as grammatical intricacy and lexical variation.


2.1 Lexical Density

Most definitions of lexical density focus on the quantitative aspect of the term, which is related to the frequency of content words in a text (e.g., Linnarud, 1977, p. 86; Laviosa, 1998, p. 10; Stubbs, 2002, p. 41, 2004, p. 122). A typical definition states that “lexical density is the term most often used to describe the proportion of content words (nouns, verbs, adjectives, and often also adverbs) to the total number of words” (Johansson, 2009, p. 146). Such characterizations, however, do not define the concept of lexical density, but rather state how it is measured. It is more revealing to define lexical density as the degree of richness of a text in terms of meanings, ideas, and information. Halliday (1989, p. 62) describes lexical density as “the density with which the information is presented.” Lexical density, therefore, is mainly the density of the informational and ideational load of texts, which is realized by content words, as opposed to function words.

The concept of lexical density has been particularly used to distinguish between written and spoken varieties of language, where written language has been shown to be lexically denser than spoken language (Ure, 1971; Halliday, 1989). One of Ure’s(1971) findings is that spoken English texts tend to have a lexical density of less than 40%, whereas written texts tend to have a lexical density higher than 40%.According to Halliday (1989, p. 80), “the lexical density of written language is likely to be of the order of twice as high as that for speech.” Lexical density is also inversely proportional to text readability; the denser a text is, the harder it is to process and understand (e.g., Harrison & Bakker, 1998; Stubbs, 2004; Castello, 2008).

Lexical density has also been used for the description and characterization of scientific and technical texts (Vande Kopple, 2003), for assessing the writing proficiency level of foreign language learners in comparison with that of native speakers (Linnarud, 1976), for comparing newspaper discourse over periods of time (Štajner & Mitkov, 2011), for comparing translated and non-translated texts (Laviosa, 1998; Xiao & Yue, 2009),and for comparing different registers within the same languages (Yates, 1996) and across languages (Neumann, 2014).

As far as Arabic is concerned, there is clear lack of studies that deal with lexical density or provide a detailed discussion of its theoretical basis and applicability to Arabic. In addition, the few studies that have attempted to measure lexical density in certain genres of Arabic have generally used the Ure’s method, which, as suggested below, does not suit the morphological and orthographic characteristics of the Arabic language. For instance, El-Farahaty (2015, p. 48, p. 149) refers to lexical density as one of the characteristics of legal Arabic, which she attributes to the recurrent listings of consecutive nouns joined by a coordinating conjunction, especially wa ‘and’ and ‘aw ‘or’. El-Farahaty does not state the method she uses to measure lexical density, nor does she define the term itself, though her reference to Dickins et al. (2002) and her use of the terms “syndetic” (using connectives) and “asyndetic” (without connectives) suggest that she associates the term with lexical repetition. Dickins et al. (2002, p. 59) have used these terms to refer to the phenomenon of semantic repetition in Arabic, which is achieved through the use of synonyms or near synonyms. Lexical density, however, is much broader than lexical and semantic repetition, which is only one among several factors that contribute to the lexical density of a text. It is noted that El-Farahaty’s main concern is with legal translation between English and Arabic, which is probably why she does not elaborate on the theoretical basis for using the term, nor does she refer to other genres to see the norm against which it is judged that legal Arabic is lexically dense. However, El-Farahaty is quite right in her observation that coordination and the use of lists of nouns are among the factors that increase lexical density in Arabic.

In a different vein, Mat Daud et al. (2014), who subscribe to the view that lexical density is inversely proportional to readability, develop an index for Arabic text readability for pedagogical purposes, in which lexical density is one of the main factors. Like other researchers using corpus tools, they measure lexical density in terms of the ratio of content words to the overall number of words in the text, which is the method that lends itself more easily to corpus analysis, again without discussing the extent of its applicability to Arabic.

In a study published in Arabic, Al-Wahy (2014) examines lexical density in sociology texts from different historical varieties of Arabic, using different methods of measurement. The study discusses the difficulties that arise when Ure’s method is applied to Arabic and experiments with the idea of taking grammatical morphemes into account when measuring lexical density in Arabic. One of the findings of the study is that MSA is generally lexically denser than earlier varieties of Arabic in the sociology genre, which is corroborated by the present study with reference to history texts.


2.2 Content Words and Function Words

Measuring lexical density, irrespective of which method is used, depends on the theoretical distinction between content words and function words. This distinction is well-established in English linguistics and has been discussed under various labels, including “lexical items” and “grammatical items” (Halliday, 1989), and “open-set items” and “closed-set items” (Cruse, 2011). It goes back to the 19th century grammarian Henry Sweet (Stubbs, 2002), and has also been used by Fries (1952), the American structural linguist, as the basis for his taxonomy of English word classes, where words are divided into four classes (roughly corresponding to nouns, verbs, adjectives, and adverbs, which are usually content words) and fifteen groups (representing function words). However, research into the characteristics of each type in Arabic and the word classes associated with it is rather lacking. It is necessary, before attempting to measure the lexical density of Arabic texts, to decide on the criteria for distinguishing between the two types.

Content words, to begin with, are words that express meanings that can be understood relatively independently of the verbal or non-verbal context; they carry the ideational load of any text. Function words are connected to the verbal or situational context in which they occur and they have little meaning outside this context. Words like huwa ‘he’, hādhā ‘this’, or alladh ī ‘who/which’ do not have an independent semantic content, but depend on their referents in the context. They perform a grammatical rather than a semantic role. As Stubbs (2002, p. 39) puts it, “content words tell us what a text is about, and function words relate content words to each other.”

In the Arabic lexicographic tradition, many function words do not appear as headwords—a fact which reflects the nature of these words as devoid of independent semantic content compared with content words. In English lexicography, however, the dictionary is regarded not only as a reference book that defines the meanings of words, but also as a record of the vocabulary of the language (Jackson, 2002), which explains why it defines all its headwords, including common words and function words, though the latter are defined in terms of their usage. In Arabic lexicography, function words are defined rather vaguely (if at all they appear as headwords in dictionaries), as in the definition of hiya ‘she’ in Mukhtār al-iḥāḥ (Al-Razi, 1907/1995), where the word occurs under huwa ‘he’, and where both are defined scantily as “huwa is for masculine and hiya for feminine”, without even stating that they are pronouns. In al-Mucjam al-Wasī, issued by the Arabic Language Academy in Cairo (Majmac al-Lughah al-cArabiyyah, 1985), and currently considered the standard dictionary of Modern Arabic, there are no entries for the pronouns huwa ‘he’ or hiya ‘she’, though there is one for ’anta/’anti ‘you (singular)’ (masculine and feminine, respectively).

In addition, content words are regarded as open classes, as opposed to function words, which are closed classes (e.g., Cruse, 2011). New nouns and verbs can enter the Arabic language, either by borrowing or by derivation, but it is not expected to coin or borrow a new pronoun or preposition into the Arabic language. Function words may become obsolete and disappear from current usage, just as some lexical words do, but such changes are gradual and may take centuries to occur. Changes in function word systems are usually associated with the colloquial varieties rather than the standard ones, as in the case of the disappearance of the dual forms from the pronominal system in colloquial Arabic varieties. Though limited in number, function words occur more frequently in discourse than content words if the proportion of each type in the language is taken into consideration. Generally, the words most frequently used in language are function words rather than content words.

If the traditional Arabic classification of the parts of speech (where the word classes are only three: nouns, verbs, and particles) is accepted, content words will be included under the classes of nouns and verbs, whereas function words will be included under particles. This, however, is rather a generalization, for not all nouns and verbs in Arabic are content words. For instance, demonstratives and most pronouns are included in traditional Arabic grammar under nouns, though they are function words in the light of the criteria listed above. The same applies to certain classes of verbs, such as kāna ‘be’ and “its sisters,”[2] which precede nominal sentences and assign the accusative case to their predicates. Therefore, the traditional Arabic part-of-speech taxonomy does not provide reliable grounds for distinguishing between content words and function words. It is more useful to depend on the above criteria, particularly whether the word has an independent semantic content and whether it belongs to an open set or a closed set of words.

There are two points that seem problematic in this respect. First, some words are ambiguous, in the sense that they are lexical in one sense and functional in another. This applies to verbs that are mainly functional but are also used lexically. For instance, a verb like kāna ‘be’ is usually used as an defective verb (ficl nāqiṣ), where it is simply a tense carrier, in which case it is a function word, but it is sometimes used as a full verb (ficl tāmm), as in wa-ka’anna shy’an lam yakun ‘as if nothing had happened’ (Text 4 MSA), in which case it is a content word. Another problem with the content-function word dichotomy is that it suggests that function words are devoid of semantic content, or that they only have grammatical meaning, which is contrary to the realities of language. As is the case with most linguistic taxonomies, there are borderline cases that are neither fully lexical nor fully functional. An example from English is modal auxiliary verbs. While modal auxiliaries represent a closed system, occur with, or assume, lexical verbs and are used as grammatical words in questions and negation, they are not totally devoid of semantic content; otherwise there would not be any differences in meaning among them. An example from Arabic is the category of verbs known as ’afcāl al-muqᾱrabah wa-al-rajā’ wa-al-shurūc ‘verbs of appropinquation, hope, and beginning’. These resemble modal auxiliaries in English in that they represent a closed set and typically occur with fully lexical verbs to modify aspects of their meanings, such as kāda yabkī ‘he was about to cry’ or bada’at tatakallam ‘she started to talk’. Items in many closed systems perform a grammatical function and have semantic content at the same time.

This suggests that the difference between content and function words is rather a matter of degree. As Halliday (1989, p. 63) observes, “there is a continuum from lexis into grammar,” with the result that there are “intermediate cases” between lexical and grammatical items. Similarly, Cruse (2011, p. 268) explains that “in reality there is not a strict dichotomy between closed-set and open-set items, but rather a continuous scale of lexicality/grammaticality.” He roughly orders categories of words according to the semantic richness they display, starting with full content words, followed by prepositions, classifiers, and then other items, including “light verbs” (i.e., verbs that do not add much to the meaning, such as make in make a move), auxiliaries, articles, pronouns, and conjunctions (Cruse, 2011, p. 269). The idea of degrees of lexicality has also been raised in syntax. For example, while Corver and van Riemsdijk (2001, p. 10) recognize the usefulness of the content/function-word distinction, they believe that “there are content words with a degree of ‘functionalness’ and there are function word [sic]having a degree of ‘lexicalness’.” They argue for a third category, namely that of “semi-lexical” words, which combine features of both lexical and functional words.

The problem of ambiguity can be overcome by examining each possible case individually to decide its category before measurement. As for the borderline cases, it has to be decided whether to include them as content or function words in all cases. One might also consider the option of assigning an intermediate value to such words, though the results will still be approximate. The most important point in this regard is to remain consistent in all cases (Halliday, 1989). In this paper, I have regarded borderline cases as function words if they are members of closed sets.


2.3 Grammatical Intricacy

Grammatical intricacy is the type of complexity associated with spoken language, where sentences tend to consist of many clauses that are related to one another through parataxis and hypotaxis (Halliday & Matthiessen, 2014, p. 726). The information represented by content words is distributed among these clauses, resulting in the lower degree of lexical density that is typical of spoken language. In writing, sentences tend to consist of fewer clauses, each packed with a larger number of content words. The same information that is expressed by one clause or a few clauses in written language can be expressed by a larger number of grammatically related clauses in spoken language. This results in “lexical sparsity,” a term presented by Halliday (1989, p. 79) as the opposite of lexical density and as the direct result of grammatical intricacy.

Eggins (2004, p. 97) suggests a measure of grammatical intricacy based on dividing the number of clauses in the text by the number of sentences (or clause complexes). If, for instance, a given text has a number of 20 sentences and 80 clauses, its grammatical intricacy will be 4. This method can be represented by the following equation:


The difficulty of measuring grammatical intricacy in Arabic lies in the lack of clear sentence boundaries in Arabic texts. Punctuation use in Arabic is not as strict as it is in English, and the use of coordinating conjunctions such as wa ‘and’ and fa ‘and so’ is rather ambiguous, as these can be used to join clauses within larger sentences or to introduce new sentences. This feature applies particularly to the earlier historical varieties of Arabic, where long stretches of text can be regarded as single sentences, but it is sometimes encountered in MSA as well.


2.4 Lexical Variation

Another related concept is that of lexical variation (also called “lexical variety” (e.g., Johansson, 2009) and “lexical diversity” (e.g., Jarvis, 2013)). Like lexical density, lexical variation has been the focus of many corpus-based studies that aim to measure text readability for a variety of purposes, ranging from examining different registers (e.g., Sotov, 2009) to the analysis of learner corpora (e.g., Linnarud, 1976). Lexical variation refers to the range of vocabulary used by the text writer; it is concerned with the degree of diversity of the content words used in the text, i.e., with “how many different words are used in a text” (Johansson, 2009, p. 141). In measuring lexical variation, it does not matter how many times a word is repeated in the text; what counts is whether or not the word is new, in the sense that it has not occurred in the preceding co-text.

Of particular relevance in this context is the distinction made in corpus linguistics between word-tokens and word-types (or tokens and types, for short). In a given text, the number of tokens is the total number of words, whereas the number of types is the number of different words. As a text proceeds, any word will count as a token, but only new words will count as types (Stubbs, 2002, p. 133). Lexical variation is calculated in terms of the type/token ratio, i.e., the ratio of the number of types to the number of tokens (Stubbs, 2002). Other things being equal, the higher the percentage of lexical variation, the less readable the text will be (Stubbs, 2004), since different words normally represent different concepts that add complexity to the text being processed. Conversely, low lexical variation indicates relative repetitiveness, which facilitates text readability. The measure of lexical variation can be represented by the following equation (cf. Linnarud, 1976, p. 46):


There seems to be some confusion in the literature between lexical density and lexical variation. For instance, Crystal (2008, p. 276) defines lexical density as “a measure of the difficulty of a text, using the ratio of the number of different words in a text (the ‘word types’) to the total number of words in the text (the ‘word tokens’),”adding that it is “also called type/token ratio”. Similarly, in their study of lexical density in translated and native Chinese fiction, Xiao and Yue (2009, p. 253) state that “there are two common measures of lexical density,” the first being “the ratio between lexical words (i.e., content words) and the total number of words,” and the second being “the type-token ratio.” As has been seen, the content-word/running-word ratio is a measure of informational load, while lexical variation is a measure of its informational range. In spite of aspects of similarity between them, lexical variation and lexical density refer to two different concepts, and it is better to keep them distinct.


  1. Measuring Lexical Density in Arabic

This section reviews the two methods suggested for measuring lexical density by Ure (1971) and by Halliday (e.g., 1989) and discusses some problems that emerge when each is applied to the Arabic language, suggesting solutions where applicable. In the transliteration of the Arabic examples, I have used the Library of Congress (LOC) Romanization system (available at: Admittedly, this system does not represent the natural pronunciation of words in connected speech, but it has the advantage of clearly showing the word boundaries and the exact number of words in the Arabic examples, which serves the purposes of the present study. The transliteration is followed by a glossing that reflects as closely as possible the relevant syntactic and morphological structure, and an idiomatic translation that shows the meaning of the text.


3.1 Ure’s Method

Ure’s (1971) method is based on counting the lexical words of a text and then calculating their percentage in relation to all the words of the text. This method has been adopted by many scholars, such as Linarud (1976), Eggins (2004), and Johansson (2009). It is also the method favoured by corpus linguists, such as Stubbs (2002) and Castello (2008). Lexical density in this method is expressed as a percentage; it is the result of dividing the number of content words by the total number of words multiplied by one hundred, as shown by the following equation:


The word in this model is understood in the orthographic sense, i.e., it is a unit of writing consisting of a sequence of letters (or sometimes a single letter) with spaces before and after it. This means that a solid compound would count as a single word, while an open compound would count as two words. For instance, Linnarud (1976, p. 46) counts phrasal lexemes such as turn up as two words, with turn as a content word and up as a function word.

If this method is applied to Arabic, a number of problems will emerge. First, in Arabic orthography, some particles (which represent a separate word class) are always attached to the following words, such as the coordinating conjunctions wa ‘and’ and fa ‘then’, the prepositions bi ‘with’ and ka ‘as’, and the preverbal li of purpose ‘in order to’, while others are written as separate words. This is an orthographic rule in Arabic, where the particles written as single letters are consistently attached to the following words. If Ure’s method is adopted, a word with an attached conjunction (e.g., huwa wa-hiya ‘he and she’) will be counted as one word, while the same word with a separate conjunction (e.g., huwa ‘aw hiya ‘he or she’) will be counted as two words. The same applies to causative particles, such as li-yacīsh ‘in order for him to live’, which counts as one word, as opposed to kay yacīsh, which has the same meaning but counts as two words. These orthographic conventions are thus not related to the meaning or the use of the particle, though they affect lexical density in Ure’s method. If lexical density, as shown above, is essentially density of information, then the above expressions, which are similar in meaning and in form, should have the same lexical density.

Another problem is that Arabic is a highly synthetic language in which many grammatical morphemes are attached to the lexical word in writing, which is not the case with analytic languages such as English, where many such forms are typically written as single words. Therefore, it is possible to find a full grammatical clause in Arabic realized by a single word. An example from the corpus is wa-hazama-hum ‘and-defeated-he-them’ (Text 2 LMA), which consists of a conjunction, a verb whose conjugation denotes a third-person singular masculine subject, and a bound pronoun functioning as object. Measured by Ure’s method, the lexical density of this one-word clause would be 100%, while its English translation ‘and he defeated them’ would have a density of 25%, since in English translation the clause has to be broken down into its component morphemes, resulting in four orthographic words. Though the English and Arabic clauses have the same informational load, which is regarded here as the essence of lexical density, they vary considerably when measured by Ure’s method.

This shows that if the measurement is based on orthographic words and on the ratio of content words to the total number of words, the results will not always reflect the true informational density of Arabic texts. One possible way to overcome this problem would be to break down Arabic orthographic words into their component grammatical morphemes, which can significantly reduce lexical density in Arabic texts, but it will not show significant differences between varieties of Arabic, since it will result in the reduction of lexical density in all cases (Al-Wahy, 2014). In addition, it is not the standard practice in measuring lexical density to include bound morphemes, even with agglutinative, polysynthetic languages (e.g., Stegen, 2007; Johansson, 2009).


3.2 Halliday’s Method

Problems of the kind shown above will not emerge if Halliday’s method is used. Halliday (1996/2007, p. 104) criticizes measuring lexical density as the ratio of lexical words to function words, arguing that in languages “where the ‘function’ elements more typically combine with the ‘content’ lexeme to form a single inflected word, such a measure would not easily apply.”The alternative method He suggests, which can presumably apply to all types of language, links lexical density with the number of clauses in the text, not with the total number of words. Meanings do not exist in a vacuum, but are expressed within linguistic frames, normally the clauses, that organize their presentation in the text. Words, as Halliday (1989, p. 66) explains, “are not packed inside other words; they are packaged in larger grammatical units – sentences and their component parts.” For Halliday, lexical density is represented by the number of lexical words in a ranking clause. In the case of whole texts, lexical density is the result of dividing the number of lexical words by the number of ranking clauses in the text, as shown in the following equation:

Lexical density here is not a percentage as it is in Ure’s method, but a figure whose value is proportional to the informational density of the text. In applying this method, it is not required to count the number of function words in the text, nor is there any need to count the number of grammatical morphemes attached to the word. All that is needed is the number of lexical words and the number of ranking clauses in the text. This point can be illustrated by the example discussed in 3.1 above (wa-hazama-hum ‘and-defeated-he-them’), where it is shown that Ure’s method would give strikingly different results for the Arabic sentence and its English translation. By contrast, Halliday’s method would give the same value for both the Arabic sentence and its English translation, namely, the value of 1, since the ranking clause in both languages contains one content word. This indicates that Halliday’s method measures density at the deeper level of information, even if it is expressed in a different language.

However, Halliday’s method involves difficulties of a different type when it is applied to Arabic. In particular, these have to do with the differences between traditional Arabic grammar and Systemic Functional Grammar (SFG) regarding certain clause types. In SFG (e.g., Halliday, 1994; Halliday & Matthiessen, 2014), clauses are divided into two main types: ranking clauses, which are counted when measuring lexical density, and embedded, or rank shifted clauses, which are not included in the counting. Ranking clauses are those that do not perform a grammatical function typically performed by units lower in the rank scale like words and groups, such as the subject or object functions, which are typically performed by nouns or nominal groups. If a clause performs such a function in a larger clause, it is regarded as embedded. Ranking clauses fall into three categories: independent clauses, paratactic clauses (roughly corresponding to clauses joined by coordination in traditional grammar), and hypotactic clauses (roughly corresponding to clauses joined by subordination).In Arabic, these types are represented by examples (1)–(3), respectively:

(1)        ||| qaṣada         al-’Ifrinj          al-diyār           al-Miṣriyyah    fī

headed                       the-Franks       the-lands         the-Egyptian   in

            jaysh                    caẓīm |||

an-army          huge

‘The Franks headed for the Egyptian lands in a huge army.’ (Text 1, LMA)

(2)        ||| Fa-cinda      dhālika                        malaka                       al-Nāṣir           al-qaṣr

So-at               that                 seized              al-Nasir           the-palace

|| wa-ḍayyaqa                 calā      al-khalīfah |||

and-tightened-he        on        the-Caliph

‘And then, al-Nasir seized the palace and tightened it on the Caliph.’ (Text 1, LMA)

(3)                   ||| Wa-lammā   qutila               ||wallaw                      ibna-hū

And-when      was-killed-he  put-they-in-office       son-his

al-Muẓaffar      cAlī |||

al-Muzaffar      Ali

‘And when he was killed, they put in office his son, al-Muzaffar Ali.’(Text 2, LMA)

There are two points where SFG differs from traditional Arabic grammar in this connection. First, when it comes to measuring lexical density, SFG distinguishes between defining and non-defining clauses—a distinction which has not been observed by traditional Arabic grammarians but which still exists in the Arabic language with its different varieties. Of the two types, defining clauses are regarded as rank shifted, and thus are not counted as separate clauses when measuring lexical density, while non-defining clauses are regarded as hypotactic clauses and are included in the measurement. Examples (4) and (5) below represent defining and non-defining relative clauses in Arabic, respectively.

(4)        … ’anna          al-mucāhadah [[allatī            waqqaca           calayhā]]

            … that                        the-treaty        which              signed-he        on-it

                        hiya     al-mumkin       fī          zamānihā

it          the-possible     in         time-its

‘… that the treaty which he signed was the possible at its time.’ (Text 3, MSA)

(5)        …wa-mucāhadat Virsay         ||allatī  rattabat           natā’iga-hā ||

…and-Treaty Versailles        which arranged-it      outcomes-its

‘… and the Treaty of Versailles, which arranged its outcomes.’ (Text 1, MSA)

Second, in SFG, projected clauses are regarded as ranking clauses, with quoted speech being paratactic and reported speech hypotactic, though in traditional Arabic grammar both types are regarded as embedded clauses functioning as object. For the sake of consistency with the SFG theory, this paper adopts the Hallidayan approach to such clause types, which is essential for the measurement of lexical density. If, for the sake of simplicity, all clauses were counted in the calculation of lexical density, whether they are ranking or rank shifted, content words would have to be counted twice, once as part of the rankshifted clause and once for the ranking clause in which it is embedded (Van de Kopple, 2003), which would obviously detract from the reliability of the results.

Most corpus-based studies of lexical density have used Ure’s method, not least because it is easier to apply automatically to large corpora. Halliday’s method seems to require much pre-editing of texts to prepare them for the use of corpus linguistic tools, to distinguish between ranking and embedded clauses and exclude the latter from the calculation. For instance, Castello (2008, p. 53) observes that Halliday’s method “is not ready-made, in that its calculation needs the same manual tagging that has to be carried out for measuring grammatical intricacy,” and therefore adopts Ure’s method instead. Similarly, in his corpus-based analysis of computer-mediated communication, Yates (1996) refers to Halliday for the differences in lexical density between spoken and written language, though in the application he adopts Ure’s method. In the same vein, while Neumann (2014, p. 76) admits that Halliday’s method “is certainly better suited” to the type of contrastive genre analysis she performs, she uses “the less accurate” method of measurement suggested by Ure because of its applicability to corpus analysis. As noted above, lexical density is measured manually in the present study, given the lack of corpus tools that can accurately apply the Hallidayan method, with its distinction between ranking and rankshifted clauses, to the Arabic language. Manual analysis is also more appropriate than electronic counting for the size of the texts (about 700 words for each variety) and can produce more reliable results.


  1. Lexical Density in Sample LMA and MSA Texts

The two methods outlined above are applied here to sample texts from LMA and MSA, first to illustrate how each method can be practically applied to the Arabic language and, second, to see if they display significant differences between the two varieties regarding lexical density. Only one sample short text from each variety is shown in detail here, while the eight texts that have been examined are given in the Appendix. To avoid repetition, each text is written once using the conventions of delimiting clause boundaries used in SFG (e.g., Halliday & Matthiessen, 2014), which are required in Halliday’s method only. The content morphemes are printed in bold face.


4.1 Sample Text 1 (LMA)

(6)        |||Fa-cinda       dhālika                        malaka           al-Nāṣir          al-qaṣr

            So-at               that                 seized              al-Nasir           the-palace

            ||wa-ḍayyaqa  calā      al-khalīfah      ||wa-ḥabasa    ’aqāriba-

            and-tightened-he on   the-caliph        and-imprisoned-he his-relatives

            || wa-qatala     ’acyān             dawlati-hī        || wa-stawlā      calā     

            and-killed-he  nobles             rule-his            and-seized-he             on

            [[mā    fī          al-quṣūr          min      al-dhakhā’ir   wa-al-’amwāl

                        what    in         the-palaces      of        the-treasures   and-the-money

            wa-al-nafā’is]]           || bi-ḥaythu      istamarra al-bayc        fī-hi

            and-the-valuables       so-that                        continued the-selling in-it

            cashra sinīn    ghayra [[mā    iṣṭafā-hu          Ṣalāḥuddīn

            ten       years    apart-from       what    selected-it       Salahuddin

            li-nafsi-h]]. |||


‘And then, al-Nasir [Salahuddin] seized the palace, tightened it on the Caliph, imprisoned his relatives, killed the notables of his rule, and seized all the treasures, valuables, and money that the palaces contained, such that selling them continued for ten years, apart from what Salahuddin kept for himself.’ (Text 1, LMA)


4.1.1 Ure’s method.

The number of running words in Sample Text 1 is 33, of which 21 are content words. Since Ure’s method is based on the orthographic word, the bound morphemes are not counted separately. Thus the lexical density in Ure’s method is 64%, which is a high percentage.


4.1.2 Halliday’s method.

Halliday’s method requires breaking down the text into clauses, marking and excluding rankshifted clauses, and then dividing the number of content words by the number of ranking clauses. This is shown in (6) above, in which ranking and embedded clauses and clause complexes as represented differently. Sentence (6) is a single clause complex consisting of 8 clauses, of which 6 are ranking and 2 are embedded, both performing nominal group functions and beginning with ‘what’. In this method, the total number of words is irrelevant; what counts is the number of content words and the number of ranking clauses. The resulting lexical density is 3.5, which is of a median value in Halliday’s account.


4.2 Sample Text 2 (MSA)

(7)        |||Kānat           bidāyat            al-Ḥarb al-cĀlamiyyah          al-Thāniyah

            Was-it beginning        the-War the-World                 the-Second

Yawma thalāthah      Sibtambir        ‘alf-wa-tiscimi’ah-wa-tiscah-wa

day      three                September      thousand-nine-hundred-and-nine-and-thalāthīn


||| wa-qad        nashabat         bayna  ’Almānya       al-Nāziyyah    min

and-indeed     erupted-it        between Germany      the-Nazi          on

            nāḥiyah          wa-bayna        Briṭānya         wa-Faransā    min

one-side          and-between   Britain             and-France      on

al-nāḥiyah      al-’ukhrā         |||Wa-sababu-hā          al-mubāshir

the-side           the-other.        And-cause-its the-direct

                        raghbat           Hitlar              (zacīm  ’Almānya       al-Nāziyyah)   fī

                        desire              Hitler              (leader Germany         the-Nazist)      in

                        isticādat           minṭaqat         Danzig                     fi             Bulandā         ||kay

restoration       area                 Danzig            in         Poland so-that

            tacūda  ’ilā       ’Almānya       ||bacda ’an       sulikhat           min-hā

return-itto       Germany         after    that      was-stripped-it           from-it

                        natīgah           li-al-Ḥarb        al-cĀlamiyyah al-’Ūlā            wa-mucāhadat

result   of-the-War      the-World       the-First          and-Treaty

            Virsay ||allatī              rattabat           natā’iga-hā |||

Versaille which           arranged-it      outcomes-its

The beginning of the Second World War was on September 3rd, 1939. It erupted between Nazi Germany on the one hand and Britain and France on the other. Its direct cause was the desire of Hitler(the leader of Nazi Germany) to restore the area of Danzig in Poland, after it had been stripped from Germany as a result of the First World War and the Treaty of Versailles, which arranged its outcomes.’(Text 1, MSA)


4.2.1 Ure’s method.

The total number of words in (7) is 52, of which 37 are content words. Thus, the lexical density of this text is 71%. Like the sample text in (6), this is a high percentage in Ure’s account.


4.2.2 Halliday’s method.

The text in (7) has 2 clause simplexes and 1 clause complex, consisting of 4 ranking clauses, including the final non-defining relative clause. Accordingly, the total number of ranking clauses in the text is 6. Since the text has 37 content words, its lexical density according to Halliday’s method is 6.1, which is much higher than the density of the LMA text.

The results obtained for the sample texts are in line with those yielded by measuring lexical density in the eight other texts taken from the same sources (see Appendix), four representing LMA and four representing MSA (Table 1).

Table 1

Lexical density (LD) in LMA and MSAtexts using Ure’s and Halliday’smethods



LMA Texts Mean LD in LMA MSA Texts Mean LD in MSA
1 2 3 4 1 2 3 4
Ure’s 74% 77% 74% 72% 74% 64% 70% 63% 57% 64%
Halliday’s 3.6 2.7 3.2 2.8 3.0 6.7 8.5 7.2 10.5 8.2



  1. Discussion of the Results

A number of observations can be made based on the analysis of the texts and the results shown above. As shown in Table 1, it is noted that Ure’s method yields high lexical density irrespective of the variety of the text, though the mean lexical density is about 10% higher in LMA than it is in MSA. This percentage is not in itself significant, given that the density in both varieties is high, but the fact that the lexical density in LMA is almost consistently higher than that of MSA warrants some explanation. As shown above, Arabic is a synthetic, agglutinating language that attaches many grammatical items to content words, which is why the lexical density is high in Ure’s method in both varieties. However, the slightly higher values of lexical density in LMA suggest that this variety tends to attach more grammatical morphemes to lexical items than MSA. In other words, LMA uses fewer function words, and is therefore more synthetic, than MSA. An illustrative example is the final bound pronoun in the phrase wa-fī ṣabīḥat yawm al-sabt sādis cishrīna-h ‘and-on the-morning [of]Saturday its-twenty-sixth’ (Text4, LMA). This structure is not used in MSA, but has to be rephrased more explicitly into something like wa-fī ṣabaḥ yawm al-sabt al-sādis wa-al-cishrīn min dhālika al-shahr ‘and-in the-morning [of]Saturday the-twenty-sixth of that month’, where the preposition min ‘of’ and the demonstrative dhālika ‘that’ replace the bound genitive morpheme. Since Ure’s method measures orthographic words rather than morphological units, it is only natural that it yields higher lexical density with more agglutinating varieties. Generally, however, Ure’s method yields only minor differences between LMA and MSA regarding lexical density, as shown in Figure 1.

Figure 1. Differences in lexical density between the LMA and MSA texts by Ure’s method

By contrast, Halliday’s method displays sharp differences between the two varieties, with LMA being much lower in lexical density than MSA (Figure 2). On average, Halliday’s method indicates that lexical density in MSA is more than double that of LMA, which is even greater than the ratio observed by Halliday for written and spoken English. The reason for this striking difference can be attributed to the fact that LMA uses more, but shorter, ranking clauses than MSA, usually joined by coordination. These clauses, which are usually verbal, are used as the divisor in calculating lexical density, and thus decrease the result obtained. On the other hand, MSA tends to use longer clauses and clause complexes, which are fewer in number than those of LMA but carry more content words. The content words are divided by fewer ranking clauses, resulting in higher lexical density.

Figure 2. Differences in lexical density between the LMA and MSA texts by Halliday’s method

It is significant that Ure’s method yields higher lexical density in LMA while in Halliday’s method it is quite the opposite. This apparent contradiction suggests that the two methods do not measure the same phenomenon, or at least that they measure two different kinds of lexical density. Ure’s measure is single-layered and rather static, as it is based only on the lexical distinction between content and function words, without showing how the ideas mentioned in individual clauses contribute to the density of the whole text. By contrast, Halliday’s method is multi-layered and more dynamic. It incorporates the lexical level as well as the grammatical level, with an additional deeper distinction between ranking and rankshifted clauses. It takes into consideration the role of smaller grammatical units by showing how meanings are presented in individual clauses and how they build up the lexical density of the text as a whole. The two methods may happen to give similar results in the case of English (though this claim may require further research), but, as has been seen, in a language like Arabic, the results are contradictory. The fact that only Halliday’s method reveals significant differences between the two historical varieties of Arabic suggests that it is more illuminating than Ure’s method. This supports Halliday’s (1996/2007) view that depending on the number of content words as a ratio against the total number of words fits only languages like English, where grammatical items are often lexicalized.

The examination of the texts suggests that there are different patterns of lexical density in LMA and MSA. Lexical density in LMA usually stems from the modification of nouns by the use of coordinating conjunctions, with the coordinated constituents being short and structurally similar to each other, reflecting the ornamental rhetorical device of parallelism (or ḥusn al-taqsīm ‘beautiful division’)—a feature which in frequently encountered in LMA texts. In example (8), the coordination of nouns results in lexical density of 10 if calculated by Halliday’s method, which is remarkably higher than the other parts of the text.

(8)        ||| Wa-‘asarū               [[man bi-hā    min      jumhūr          

And-captured-they    who     in-it     from    nobles

            al-Muslimīn   wa-al-fuqahā’            wa-al-culamā’             wa-al-’a’immah

            the-Muslims    and-the-jurists and-the-scholars         and-the-imams

wa-al-qurrā’              wa-al-muḥaddithīn                wa-’akābir     

                        and-the-readers          and-the-hadith-scholars         and-chief

al-’awliyā’      wa-al-ṣaliḥīn]]            |||Wa-fīhā         khalīfat

the-devout      and-the-righteous       and-in-it          vicegerent

                        Rabb   al-cālamīn       wa-imām       al-muslimīn    wa-ibn

            Lord    the-worlds      and-imam        the-Muslims    and-son

                camm  Sayyid al-Mursalīn|||

‘They captured the Muslim nobles, jurists, scholars, imams, hadith scholars, and eminent devout and righteous people living in it.In it, there was the Caliph—the Vicegerent of the Lord of the Worlds, the Ruler of Muslims, and the cousin of the Master of Messengers.’ (Text 3, LMA)

It is observed that the coordination of verbs, as in example (9), which is more common in LMA, does not have a similar effect on lexical density, since any verb is necessarily an obligatory component (rukn) of a verbal clause, with an explicit or an implicit subject. A verb necessarily results in a new clause that is counted when measuring lexical density if it is a ranking clause, which lowers the lexical density of the text. The frequency of successive verbal clauses in the LMA texts can be the main reason for their lexical sparsity:

(9)        … bacda          ’an       kānū                || malakū         mucẓam          

            … after           that      had-they         seized-they     most

            al-macmūr      min      al-’arḍ || wa-qaharū               al-mulūk

            the-populated of        the-land          and-vanquished-they the-kings

            || wa-qatalū     alcibād           || wa-akhrabū             al-bilād.||

and-killed-they the-people     and-damaged-they     the-lands.

‘… after they had conquered most of the populated parts of the earth, vanquished kings, killed people, and devastated lands.’ (Text 2, LMA)

If measured by Ure’s method, the lexical density of (9) will be 64%, which is a high percentage. If measured by Halliday’s method, it will be 2.3, which is a low figure. This conforms to Eggins’ (2004) view that lexical density mainly results from nominalization and noun modification, rather than the use of other content words, including verbs. Noun modification, she explains, adds more content words to the nominal group, such as those denoting the number, description, and class of the noun, in addition to relative clauses and prepositional phrases qualifying it, while most forms of verb modifications add only function words related to tense, aspect, voice, and modality (Eggins, 2004, pp. 96–97).

In MSA, where lexical density is always higher if measured by Halliday’s method, there are different factors that contribute to such density through loading clauses with content words. First, the MSA texts abound in examples of nominalization, such as ’iclān ḥālat al-ṭawāri’ ‘the declaration of the state of emergency’, isticmālmarāfiqMiṣr ‘the use of Egypt’s public utilities’, and cadam al-mushārakah fī mayādīn al-qitāl ‘non-participation in the battlefields’ (Text 2, MSA). Second, there is the heavy modification of nouns, whether by adjoining other nouns to make possessives or multiple possessives, as in isticādat minṭaqat Danzig ‘the restoration of the area of Danzig’ (Text 1, MSA) and siyāsat tagnīb miṣr waylāt al-ḥurūb ‘the policy of saving Egypt the horrors of war’ (Text 2, MSA), both consisting of successive content words, or by relative clauses and tawābic (through coordination, apposition, and qualification (or post-modification)). This, however, occurs without attending to parallelism, as is the case with LMA. A third source of density is the use of parenthetical phrases, which are sometimes long and full of content words. An example is: —’ay bacda arbacat shuhūr min nushūb al-qitāl wa thalāthat shuhūr min suqūt Būlandā— ‘—that is, four months after the eruption of the fighting and three months after the fall of Poland—’ (Text 1, MSA). In addition, the frequency of embedded clauses increases the lexical density in MSA texts, since such clauses come loaded with content words but are not counted in the measurement of lexical density.


  1. Concluding Remarks

The present study has shown that Ure’s and Halliday’s methods display more differences than similarities between them when applied to Arabic and that they possibly measure two different linguistic phenomena. Ure’s method presents a rather static, single-layered view of lexical density, while Halliday’s provides a more dynamic, multi-layered view that reveals how the framing of meanings within grammatical units contributes to the overall density of a text. Of the two methods, Ure’s is easier to apply and more suitable for corpus analysis. However, since it is based on the orthographic word, it is not appropriate for languages like Arabic, where many grammatical items are attached to content morphemes and are not counted as separate words. In addition, it does not show significant differences between the LMA and MSA regarding lexical density. Halliday’s method, on the other hand, does not easily lend itself to corpus analysis, since it requires much manual pre-editing to distinguish between ranking and rankshifted clauses. It also requires adopting a functional approach to the analysis of Arabic sentences, which can be different from the traditional approach in many cases. However, Halliday’s method has advantages that outweigh these difficulties. First, it is universal and applicable to different language typologies. Second, it can reveal significant differences between historical varieties of Arabic. Third, it is based on the plausible idea that meanings do not exist in a vacuum but occur within the framework of syntactic structures that organize them in the text. It is hoped that this will encourage researcher to consider the Hallidayan method before pronouncing on lexical density in Arabic.

The study also indicates that in the cases where lexical density exists in the two varieties, this is usually due to different reasons. When LMA displays lexical density, this is mostly due to the high rate of use of noun coordination, which often aims to create parallelism as an ornamental rhetorical device. In MSA texts, lexical density stems from loading clauses with many content words and the frequent use of embedded clauses, parenthetical phrases, in addition to nominalization and the qualification of nouns by adjectives, adjectival phrases, and defining relative clauses.

If, as most of the literature on lexical density suggests, lexical density is inversely proportional to text readability, then it can be claimed that, all other things being equal, LMA texts are more accessible to the reader than MSA texts. Similarly, if lexical density is a feature that is associated with writing rather than with speech, then the study leads to the conclusion that LMA, though still a written variety of Arabic, is closer to orality than MSA, which is more influenced by the patterns and structures of modern European languages. Before this conclusion can be generalized, however, it needs to be confirmed by further studies on a wider range of texts representing different genres of Arabic.

The results of the study can be applied to various purposes that involve the analysis of the Arabic texts, as has been done with other languages. In addition to comparing and characterizing various genres and language varieties, both historical and otherwise, measuring lexical density can be used for pedagogical purposes, such as evaluating the appropriateness of textbooks for students at different stages and comparing the product of learners and native speakers of Arabic or describing the development of learners’ writing skill. Measuring lexical density in Arabic is also particularly relevant both to translation theory and practice. It can be used for comparing translated and non-translated Arabic texts to see if there are divergences between them in terms of informational density. In addition, since lexical density affects text readability, translators can use it as a tool for making target texts more accessible to a certain readership by spreading the information over a larger number of clauses to make it more suitable for the intended reader. Reducing density can then be a means of explicitation, which is a common feature of target texts. This is another area in which there are possibilities for further research.




Al-Jabarti, A. H. (1880/1997).cAjā’ib al-’āthār fi al-tarājimwa al-’akhbār [Wonders of traditions in biographies and events]. Ed. A. A. Abdul-Rahim. Cairo: Dar al-Kutub.

Al-Razi, M. A. (1907/1995). Mukhtār al-Ṣiḥaḥ [Al-Ṣiḥaḥlexicon abridged]. Ed. M. Khatir. Beirut: Librairie du Liban.

Al-Wahy, A. S. (2014). Al-kathafah al-lafẓiyyah bayna al-Fuṣḥā al-Mucāṣirah wa Fuṣḥā al-Turāth [Lexical density in Modern Standard Arabic and Heritage Standard Arabic]. In M. M. Al-Saidi (Ed.), Al-dirasāt al-cArabiyyah fi cālam mutaghayyir [Arabic Studies in a changing world]. Vol. 1 (pp. 157–182). Cairo: Ain Shams University.

Badawi, A. M. (1973). Mustawayāt al-cArabiyyah al-mucāṣirah fī Miṣr [Varieties of contemporary Arabic in Egypt]. Cairo: Dār al-Macārif.

Cachia, P. (1973). A dictionary of Arabic grammatical terms: Arabic-English, English-Arabic. London: Longman.

Castello, E. (2008). Text complexity and reading comprehension tests. Linguistic Insights: Studies in Language and Communication, No. 85. Bern: Peter Lang.

Corver, N.& van Riemsdijk, H. (2001). Semi-lexical categories. In N. Corver& H. van Riemsdijk (Eds.), Semi-lexical categories: The function of content words and the content of function words (pp. 1–22). Berlin: Mouton.

Cruse, A. (2011). Meaning in language: An introduction to semantics and pragmatics (3rd ed.). Oxford: Oxford University Press.

Crystal, D. (2008). A dictionary of linguistics and phonetics (6th ed.) Malden, MA: Blackwell.

Dickins, J., Hervey, S. & Higgins, I. (2002). Thinking Arabic translation: A course in translation method, Arabic to English. London: Routledge.

Eggins, S. (2004). An introduction to systemic functional linguistics (2nd ed.). London: Continuum.

El-Farahaty, H. (2015). Arabic-English-Arabic legal translation. London: Routledge.

Fries, C. (1952). The structure of English. New York: Harcourt.

Halliday, M. A. K. (1989). Spoken and written language (2nd ed.). Oxford: Oxford University Press.

Halliday, M. A. K. (1994). Introduction to functional grammar (2nd ed.). London: Routledge.

Halliday, M. A. K. (1996/2007). Literacy and linguistics: A functional perspective. In J. J. Webster (Series Ed.), The collected works of M. A. K. Halliday. Vol. 9, Language and education (pp. 97–129). London: Continuum.

Halliday, M. A. K. & Matthiessen, C.M.I.M. (2014). Halliday’s introduction to functional grammar (4thed.). London: Routledge.

Harrison, S. & Bakker, P. (1998). Two new readability predictors for the professional writer: Pilot trials. Journal of Research in Reading, 21(2), 121–138.

Heikal, M. H. (2003). Suqūt niẓam: Limādha kānat Thawrat Yulyū 1952 lāzimah? [The downfall of a regime: Why was the July 1952 Revolution necessary?]. Cairo: Dār al-Shurūq.

Jackson, H. (2002). Lexicography: An introduction. London: Routledge.

Jarvis, S. (2013). Capturing the diversity in lexical diversity. Language Learning, 36, Supplement 1, 87–106. DOI: 10.1111/j.1467-9922.2012.00739.x

Johansson, V. (2009). Developmental aspects of text production in writing and speech. Travaux de l’Institut  de Linguistique de Lund, 48. Lund: Lund University.

Laviosa, S. (1998). Core pattern of lexical use in a comparable corpus of English narrative prose. Meta, 43 (4), 557–570.

Linnarud, M. (1976). Lexical density and lexical variation: An analysis of the lexical texture of Swedish students’ written work. Studia Anglica Posnaniensia: An International Review of English Studies, 7, 45–52.

Linnarud, M. (1977). Some aspects of style in the source and in the target language. In J. Fisiak (Ed.), Papers and Studies in Contrastive Linguistics (Vol. 7, pp. 85–94). Poznan: Adam Mickiewich University Press.

Majmac al-Lughah al-cArabiyyah [The Arabic Language Academy]. (1985). Al-mucjam al-wasīṭ [The middle-size dictionary] (3rd ed.). Cairo: Majmac al-Lughah al-cArabiyyah.

Mat Daud, N., Hassan, H., El-Tingari, S., & Abdul Aziz, N. (2014). Web-based Arabic text readability index. In L. Gómez Chova, A. LópezMartínez, & I. Candel Torres (Eds.), INTED 2014, 8th International Technology, Education and Development Conference: Conference proceedings (pp. 1574–1581). Valencia: IATED Academy.

Neumann, S. (2014). Contrastive register variation: A quantitative approach to the comparison of English and German. Berlin: Mouton.

Owens, J. (2006). A linguistic history of Arabic. Oxford: Oxford University Press.

Sotov, A. (2009). Lexical diversity in a literary genre: A corpus study of the Rgveda. Literary and Linguistic Computing, 24 (4), 435–447.

Štajner, S. &Mitkov, R. (2011). Diachronic stylistic changes in British and American varieties of 20th century written English language. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, 16 September 2011 (pp. 78–85). Retrieved from: ~in3168/papers/Stajner&Mitkov-11.pdf.

Stegen, O. (2007). Lexical density in oral versus written Rangi texts. SOAS Working Papers in Linguistics, 15, 173–184.

Stubbs, M. (2002). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell.

Stubbs, M. (2004). Language corpora. In A. Davies & C. Elder (Eds.), The handbook of applied linguistics (pp. 106–132). Malden: Blackwell.

Ure, J. (1971). Lexical density and register differentiation. In G. E. Perren & J. L. M. Trim (Eds.), Applications of linguistics: Selected papers of the Second International Congress of Applied Linguistics, Cambridge 1969 (pp. 443–452). Cambridge: Cambridge University Press.

Van de Kopple, W. J. (2003). M.A.K. Halliday’s continuum of prose styles and the stylistic analysis of scientific texts. Style, 37 (4), 369–381.

Xiao, R. & Yue, M. (2009). Using corpora in translation studies: The state of the art. In P. Baker (Ed.), Contemporary corpus linguistics (pp. 237–262). London: Continuum.

Yates, S. J. (1996). Oral and written linguistic aspects of computer conferencing: A corpus based study. In S. C. Herring (Ed.), Computer-mediated communication: Linguistic, social and cross-cultural perspectives (pp. 29–46). Amsterdam: John Benjamins.



The Arabic Texts Used in the Study

  1. LMA texts

Text 1

قصد الإفرنج الديار المصرية في جيش عظيم وملكوا بلبيس وكانت إذ ذاك مدينة حصينة ووقعت حروب بين الفريقين فكانت الغلبة فيها على المصريين وأحاطوا بالإقليم برا وبحرا وضربوا على أهله الضرائب.

ثم إن الوزير شاور أشار بحرق الفسطاط فأمر الناس بالجلاء عنها وأرسل عبيده بالشعل والنفوط فأوقدوا فيها النار فاحترقت عن آخرها واستمرت النار بها أربعة وخمسين يوما وأرسل الخليفة العاضد يستنجد نور-الدين وبعث إليه بشعور نسائه فأرسل إليه جندا كثيفا وعليهم أسد-الدين-شيركوه وابن أخيه صلاح-الدين-يوسف فارتحل الإفرنج عن البلاد وقبض أسد-الدين على الوزير شاور الذي أشار بحرق المدينة وصلبه وخلع العاضد على اسد-الدين الوزارة فلم يلبث أن مات بعد خمسة وستين يوما فولى العاضد مكانه ابن أخيه صلاح-الدين وقلده الأمور ولقبه الناصر فبذل لله همته وأعمل حيلته في إظهار السنة وإخفاء البدعة.

فثقل أمره على الخليفة العاضد فأبطن له فتنة أثارها في جنده ليتوصل بها إلى هزيمة الأكراد وإخراجهم من بلاده فتفاقم الأمر وانشقت العصا ووقعت حروب بين الفريقين أبلى فيها الناصر يوسف وأخوه شمس-الدولة بلاء حسنا وانجلت الحروب عن نصرتهما فعند ذلك ملك الناصر القصر وضيق على الخليفة وحبس أقاربه وقتل أعيان دولته واستولى على ما في القصور من الذخائر والأموال والنفائس بحيث استمر البيع فيه عشر سنين غير ما اصطفاه صلاح-الدين لنفسه.

وخطب للمستضيء العباسي بمصر وسير البشارة بذلك على بغداد ومات العاضد قهرا وأظهر الناصر يوسف الشريعة المحمدية وطهر الإقليم من البدع والتشيع والعقائد الفاسدة وأظهر عقائد أهل السنة والجماعة وهي عقائد الأشاعرة والماتريدية وبعث إليه أبو-حامد-الغزالي بكتاب ألفه له في العقائد فحمل الناس على العمل بما فيه ومحا من الإقليم مستنكرات الشرع وأظهر الهدي. ولما توفي نور-الدين الشهيد انضم إليه ملك الشام وواصل الجهاد وأخذ في استخلاص ما تغلب عليه الكفار من السواحل وبيت المقدس بعد ما أقام بيد الإفرنج نيفا وإحدى وتسعين سنة وأزال ما أحدثه الإفرنج من الآثار والكنائس.

ولم يهدم القيامة اقتداء بعمر رضي الله عنه وافتتح الفتوحات الكثيرة واتسع ملكه ولم يزل على ذلك إلى أن توفي سنة تسع وثمانين وخمسمائة ولم يترك إلا أربعين درهما وهو الذي أنشأ قلعة الجبل وسور القاهرة العظيم.  (Al-Jabarti 1880/1997, pp. 24–26)

Text 2

ولما انهزم الإفرنج ومات الصالح وتملك ابنه توران-شاه استوحش من مماليك أبيه واستوحشوا منه فتعصبوا عليه وقتلوه بفارسكور و قلدوا في السلطنة شجرة-الدر ثلاثة أشهر ثم خلعت وهي آخر الدولة الأيوبية ومدة ولايتهم إحدى-وثمانون سنة‏.‏ثم تولى سلطنة مصر عز-الدين-أيبك-التركماني-الصالحي سنة ثمان-وأربعين-وستمائة وهو أول الدولة التركية بمصر‏.‏ولما قتل ولوا ابنه المظفر-علي فلما وقعت حادثة التتار العظمى خلع المظفر لصغره وتولى الملك المظفر-قطز وخرج بالعساكر المصرية لمحاربة التتار فظهر عليهم وهزمهم ولم تقم لهم قائمة بعد ذلك بعد أن كانوا ملكوا معظم المعمور من الأرض وقهروا الملوك وقتلوا العباد وأخربوا البلاد‏.‏(Al-Jabarti 1880/1997, pp. 26–27)

Text 3

وفي سنة أربع-وخمسين-وستمائة ملكوا سائر بلاد الروم بالسيف وفي البحر‏. ‏

فلما فرغوا من ذلك نزل هولاكو-خان وهو ابن طولون-بن-جنكيز-خانعلى بغداد وذلك سنة ست-وخمسين وهي إذ ذاك كرسي مملكة الإسلام ودار الخلافة فملكها وقتلوا ونهبوا وأسروا من بها من جمهور المسلمين والفقهاء والعلماء والأئمة والقراء والمحدثين وأكابر الأولياء والصالحين وفيها خليفة رب العالمين وإمام المسلمين وأبن عم سيد المرسلين فقتلوه وأهله وأكابر دولته وجرى في بغداد ما لم يسمع بمثله في الآفاق‏. ‏

ثم إن هولاكو-خان أمر بعد القتلى فبلغوا ألف-ألف-وثمانمائة-ألفوزيادة ثم تقدم التتار إلى بلاد الجزيرة واستولوا على حران والرها وديار-بكر في سنة سبع-وخمسين ثم جاوزوا الفرات ونزلوا على حلب في سنة ثمان-وخمسين-وستمائة واستولوا عليها وأحرقوا المساجد وجرت الدماء في الأزقة وفعلوا ما لم يتقدم مثله‏. ‏

ثم وصلوا إلى دمشق وسلطانها الناصر يوسف-بن-أيوب فخرج هاربا وخرج معه أهل القدرة ودخل التتار إلى دمشق وتسلموها بالأمان ثم غدروا بهم وتعدوها فوصلوا إلى نابلس ثم إلى الكرك وبيت المقدس فخرج سلطان مصر بجيش الترك الذين تهابهم الأسود وتقل في أعينهم أعداد الجنود فالتقاهم عند عين جالوت فكسرهم وشردهم وولوا الأدبار وطمع الناس فيهم يتخطفونهم‏. ‏ووصلت البشائر بالنصر فطار الناس فرحا.  (Al-Jabarti 1880/1997, pp. 27–28)

Text 4

وفي صبيحة يوم السبت سادس-عشرينه خرج الفريقان إلى خارج القاهرة من باب قناطر-السباع واجتمعوا بالقرب من قصر-العيني ومعهم المدافع وآلات الحرب فتحارب الفريقان من ضحوة النهار إلى العصر وقتل من الفريقين من دنا أجله وأيوب-بك ومحمد-بك بالقصر.

ثم تراجع الفريقان إلى داخل البلد وتأخرت طائفة من العزب فأتى إليهم محمد-بك-الصعيدي واحتاط بهم وحاصرهم وبلغ الخبر قانصوه-بك فأرسل إليهم يوسف-بك ومحمد-بك وعثمان-بك فتقاتلوا مع محمد-بك-الصعيدي وهزموه وتبعوه على قنطرة السد وقد كان أيوب-بك داخل التكية المجاورة لقصر-العيني فلما رأى الحرب ركب جواده ونجا بنفسه فبلغ يوسف-بك بأنه بالتكية فقصدوه واحتاطوا بالقصر فأخبرهم الدراويش بذهابه فلم يصدقوهم ونهبوا القصر وخربوه وأحرقوه وعادوا إلى منازلهم.

وفي صبيحة يوم الأحد ذهب يوسف-بك-الجزار ونهب غيط إفرنج أحمد الذي بطريق بولاق. ثم اجتمعوا في محل الحرب، وتحاربوا ولم يزالوا على ذلك، وفي كل يوم يقتل منهم ناس كثير.

وفي ثاني أيام جمادى-الأولى اجتمع الأمراء الصناجيق بمنزل قائمقام وتنازعوا بسبب تطاول الحرب وامتداد الأيام ثم اتفقوا على أن ينادوا في المدينة بأن من اسمه في وجاق من الوجاقات السبعة ولم يحضر إلى بيت أغاته نهب ماله وقتل. (Al-Jabarti 1880/1997, pp. 85–86)

  1. MSA texts

Text 1

كانت بداية الحرب العالمية الثانية يوم 3 سبتمبر 1939، وقد نشبت بين “ألمانيا النازية” من ناحية وبين “بريطانيا” و”فرنسا” من الناحية الأخرى، وسببها المباشر رغبة “هتلر” (زعيم ألمانيا النازية) في استعادة منطقة “دانزيج” في بولندا كي تعود إلى ألمانيا بعد أن سلخت منها نتيجة للحرب العالمية الأولى (ومعاهدة فرساي التي رتبت نتائجها).

وتمكنت القوات الألمانية من احتلال “بولندا” كلها في سبعة عشر يوما ثم توقف القتال، وبدا أن الحرب نامت. لأن “ألمانيا” خطت الخطوة الأولى ونفذت بالسلاح ما أرادته في “بولندا”، ثم عجزت “بريطانيا” و”فرنسا” عن نجدة “بولندا” في الشرق، وتلا ذلك أن الجبهة الغربية (الفرنسية) مع ألمانيا بقيت خنادق متقابلة وساكنة، ومع أن السكون يتحول إلى صخب في بعض الأوقات حين تتبادل الجبهات وراء الخنادق ضربات المدافع، إلا أن ذلك بدا قصدا سياسيا قصاراه تذكير العالم بأن الحرب على الجبهة الغربية ما زالت دائرة!

ومع أوائل سنة 1940 – أي بعد أربعة شهور من نشوب القتال، وثلاثة شهور من سقوط “بولندا” – كان صوت الحرب خافتا لدرجة دعت إلى وصف هذه المرحلة بـ “الحرب الفارغة” لأن جيوش بريطانيا وفرنسا توقفت على ناحية من جبهة القتال، ووقفت القوات الألمانية على الناحية الأخرى. ثم إن الجيوش البريطانية التي عبرت بحر الشمال إلى الشواطئ الفرنسية ظلت هناك، لأن القيادة العليا الإمبراطورية رأت إبقاء قواتها في مقاطعة نورماندي شمال فرنسا حتى تظل مواصلاتها على البحر مفتوحة، قريبة وسالكة إلى قواعدها البريطانية – وفي كل ذلك فإن الحرب ليست وقوفا في الخطوط الدفاعية – سواء خط “ماجينو” الفرنسي أو خط “سيجفريد” الألماني – ولا انتظارا على الشاطئ الآخر من بحر الشمال. (Heikal, 2003, pp. 15–16)

Text 2

وبصرف النظر عن إعلان حالة الطوارئ وما يترتب عليها، فإن الملك “فاروق” ظل مترددا بسند وتشجيع من رجال أحاطوا بالقصر وقتها، وأولهم “علي-ماهر”-(باشا) الذي كان رئيسا للوزراء عندما قامت الحرب.

وضمن هذا التردد الملكي فإن وزارة “علي-ماهر”-(باشا) أعلنت ما سُمي وقتها، سياسة “تجنيب مصر ويلات الحرب”، ومؤدى هذه السياسة أن مصر تنفذ المطلوب منها بنصوص معاهدة سنة 1936 (وذلك تحقق بإعلان حالة الطوارئ) لكنها تحتفظ لنفسها بمسافة من ميادين القتال.

وكان الشيخ “محمد-مصطفى-المراغي” (شيخ الأزهر) قد عزز ذلك بشعار أطلقه يقول بأن ذلك الصراع في أوروبا “حرب لا ناقة لمصر فيها ولا جمل”.

ومع أن تلك بدت تعبيرات عن سياسة “تعاند الإنكليز” ولا تعارضهم، فإن لندن سكتت عليها؛ لأن هذه السياسة ساعتها كانت أكثر تحقيقا لمصالحها، فهي تتيح استعمال مرافق مصر لصالح المجهود الحربي (وفق معاهدة سنة 1936). وفي نفس الوقت فإن عدم المشاركة في ميادين القتال بمنطق “تجنيب مصر ويلات الحرب”، يفرض على الطليان والألمان نوعا من الحذر في نشاطهم الحربي ضد الأراضي المصرية حتى لا تتحول المشاعر والمواقف من نصف تبعية إلى تبعية بالكامل للاستراتيجية البريطانية! (Heikal, 2003, p. 21)

Text 3

كانت نهاية الثلاثينيات فترة اختبار قاس لحزب الوفد وزعيمه “مصطفى-النحاس”(باشا).

كان الحزب بلا جدال هو ممثل الأغلبية الوطنية في مصر، وموضع الثقة بالنسبة لجماهير شعبها في تلك الفترة، وكان زعيمه رمزا للمقاومة المصرية في مواجهة الاحتلال البريطاني، وفي مواجهة التجاوز الملكي للدستور، سواء بنزعات الاستبداد لدى الملك “فؤاد”، أو بمناورات القصر، وقد راحت تكرر نفسها في عهد ابنه الصبي الملك “فاروق”.

ومع أن وزارات الوفد لم تكن تجيء إلى الحكم إلا بإشارات بريطانية، فإن “النحاس”(باشا) لم يكن يعتبر تلك منحة من دولة الاحتلال بقدر ما هي احتياج إلى شرعية الوفد، خصوصا زمن الأزمات.

وعندما وضع النحاس(باشا) توقيعه على معاهدة 1936 كان يدرك أنها استقلال منقوص، لكنه يتفهم أن دولة الاحتلال لن تعطيه أكثر في ذلك الوقت، بينما نذر الحرب العالمية تظهر في أوربا (ومقدماتها قيام إيطاليا بغزو الحبشة واحتلالها سنة 1935) ثم ما تبع ذلك من تركيز الوجود الإيطالي أكثر في البحر الأبيض بتعزيز مواقعه في ليبيا.

ومع أن “مصطفى-النحاس”(باشا) تعرض لنقد شديد عند توقيع معاهدة سنة 1936 (حتى من بعض أنصاره وبينهم رئيس مجلس النواب الوفدي وقتها “أحمد-ماهر”(باشا))، فقد كان يقين “النحاس”(باشا) (وهو سليم) أن المعاهدة التي وقع عليها هي الممكن في زمانها، خصوصا عند دعمها في العام التالي بمعاهدة إلغاء الامتيازات الأجنبية خلال مؤتمر عقد لذلك الغرض (1937) في مونترو (سويسرا).

والحاصل أن “النحاس”(باشا) كان مرتاح الضمير مطمئنا. ولذلك فإن إقالة وزارته بعد أسابيع من عودته الظافرة من مؤتمر مونترو – ديسمبر 1937 – نزلت صدمة ثقيلة عليه. (Heikal, 2003, p. 51)

Text 4:

وهكذا كان اللواء “محمد-نجيب” الذي جلست أمامه في بيته رجل الساعة (وسط ذروة الأزمة) ربما دون أن يقصد. ذلك أنه معروض عليه – أولا – من تنظيم الضباط الأحرار أن يقود خطتهم الجديدة بالسيطرة على الجيش (وقد عجلوا بها شهرا عما أبلغوه به لأنهم وجدوا الظرف السياسي مناسبا إلى جانب اعتبار أمني فرضه ما بلغهم من توصل أجهزة الأمن إلى قائمة بأسماء معظمهم). ثم أنه مطلوب منه – ثانيا – بتكليف من القصر الملكي والوزارة القائمة على الحكم وعلى لسان “مرتضى-المراغي” – “أن يبذل جهده لفض اعتصام أو عصيان قام به مجانين من ضباط الجيش نزلوا إلى الشوارع وإقناعهم بالعودة إلى بيوتهم وكأن شيئا لم يكن تجنبا لفضيحة أو مصيبة سوف تقع قبل أن يطلع الصبح!”(Heikal, 2003, p. 555)

[1]An earlier version of this study was presented at the First International Conference on Literature, Linguistics and Translation, held at the Faculty of Languages, Ain Shams University, in March 2016. I would like to thank my audience there for their stimulating questions and remarks. I would also like to thank two anonymous referees from IJAL for their careful reading and their comments on an earlier draft of the present paper. It goes without saying that I remain solely responsible for any shortcomings that may remain.

[2]For the English translation of Arabic grammatical terms, I have mainly depended on Cachia (1973).

Leave A Reply