2195-5832-1-5 2195-5832 Research <p>Approximating the distributions of runs and patterns</p> JohnsonCBradbrad.johnson@umanitoba.ca FuCJamesjames.fu@umanitoba.ca

Department of Statistics, University of Manitoba, Winnipeg, Canada

Journal of Statistical Distributions and Applications
<p>Regular submissions</p>
2195-5832 2014 1 1 5 http://www.jsdajournal.com/content/1/1/5 10.1186/2195-5832-1-5
141120137320141162014 2014Johnson and Fu; licensee Springer.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. Finite Markov chain imbedding Rate functions Multi-state trials Runs and patterns

Abstract

The distribution theory of runs and patterns has been successfully used in a variety of applications including, for example, nonparametric hypothesis testing, reliability theory, quality control, DNA sequence analysis, general applied probability and computer science. The exact distributions of the number of runs and patterns are often very hard to obtain or computationally problematic, especially when the pattern is complex and n is very large. Normal, Poisson and compound Poisson approximations are frequently used to approximate these distributions. In this manuscript, we (i) study the asymptotic relative error of the normal, Poisson, compound Poisson and finite Markov chain imbedding and large deviation approximations; and (ii) provide some numerical studies to comparing these approximations with the exact probabilities for moderately sized n. Both theoretical and numerical results show that, in the relative sense, the finite Markov chain imbedding approximation performs the best in the left tail and the large deviation approximation performs best in the right tail.

AMS Subject Classification

Primary 60E05; Secondary 60J10

Introduction and notation

Let { X i } i = 1 n be a sequence of m-state trials (m≥2) taking values in the set S = { s 1 , , s m } of m symbols. For simplicity, { X i } i = 1 n will be denoted {X i } and n will be allowed to be . A simple pattern Λ = s i 1 s i 2 s i , of length , is the juxtaposition of (not necessarily distinct) symbols from . Given a simple pattern Λ, we let X n (Λ) denote the number of either non-overlapping or overlapping occurrences of Λ in the sequence { X i } i = 1 n , where the method of counting will be made clear by the context. The waiting time W(Λ,x) until the x’th occurrence of the simple pattern Λ in { X i } i = 1 n is thus defined by

W ( Λ , x ) = inf { n N : X n ( Λ ) = x } ,

and, by convention, the waiting time for the first occurrence is denoted W(Λ)=W(Λ,1). Finally, we define the inter arrival times

W i ( Λ ) = W ( Λ , i ) W ( Λ , i 1 ) , for i = 1 , 2 , ,

where W(Λ,0):=0.

We say that two patterns Λ 1 and Λ 2 are distinct if neither Λ 1 appears in Λ 2 nor Λ 2 appears in Λ 1. If Λ 1,…,Λ r are pairwise distinct simple patterns, we define the compound pattern Λ = i = 1 r Λ i , where an occurrence of any Λ i is considered an occurrence of Λ. For a compound pattern Λ=Λ 1∪⋯∪Λ r , we similarly define

X n ( Λ ) = j = 1 r X n ( Λ j ) .

The waiting times W(Λ,x), W(Λ) and W i (Λ) are then defined as above, and often referred to as sooner waiting times.

From these definitions it is easy to see that, for any simple or compound pattern Λ, x and n, the events {X n (Λ)<x} and {W(Λ,x)>n} are equivalent and hence

P { X n ( Λ ) < x } = P { W ( Λ , x ) > n } ,

which provides a convenient way of studying the exact and approximate distribution of X n (Λ) through the waiting time distributions of W(Λ,x).

Throughout this paper, unless specified otherwise, we assume that the trials {X i } are either independent and identically distributed (i.i.d.) or first order Markov dependent; the pattern Λ is either simple or compound; and the counting of occurrences of Λ is in a non-overlapping fashion.

The distribution of the number of runs and patterns in a sequence of multi-state trials or random permutations of a set of integers have been successfully used in various fields in applied probability, statistics and discrete mathematics. Examples include reliability theory, quality control, DNA sequence analysis, psychology, ecology, astronomy, nonparametric tests, successions, and the Eulerian and Simon-Newcomb numbers (the latter 3 being defined for permutations). Two recent books, Balakrishnan and Koutras (2002) and Fu and Lou (2003), provide some scope of the distribution theory of runs and patterns and Martin et al. (2010) and Nuel et al. (2010) provides some extensions to sets of sequences.

Given a pattern Λ, the exact distribution of X n (Λ) traditionally has been determined using combinatoric analysis on a case by case basis. The formulae for these distributions are often very complex and computationally problematic. Even for many simple patterns, their distributions in terms of combinatoric analysis remains unknown, especially when the {X i } are Markov dependent multi-state trials.

The waiting time W(Λ) for the first occurrence of certain types of runs and patterns have been studied by many authors. See, for example, Blom and Thorburn (1982), Gerber and Li (1981), Schwager (1983), and Solov’ev (1966). More recently, Fu and Koutras (1994) developed a method for determining the exact distributions of X n (Λ) and W(Λ) for any simple or compound Λ in either i.i.d. or Markov dependent trials (see also Fu and Lou 2003). The method was referred to as the Finite Markov Chain Imbedding (FMCI) technique, which can be easily described as follows: given a simple or compound pattern Λ, there exists a finite Markov chain {Y i } defined on a finite state space, say Ω={1,…,d,α}, with an absorbing state α and transition probability matrix of the form

where c is a column vector. The distribution of the waiting time for Λ is given by

P { W ( Λ ) = n } = ξ 0 N n 1 ( I N ) 1

where ξ 0 is the initial distribution, N is the essential transition probability matrix (i.e. the sub-stochastic matrix consisting of only the transient states of {Y i }) as defined in (2), I is a d×d identity matrix and 1 =(1,1,…,1) is a 1×d row-vector. Furthermore, the random variable X n (Λ), the number of occurrences of Λ in {X i }, is also finite Markov chain imbeddable and its distribution is given by

P { X n ( Λ ) < x } = P { W ( Λ , x ) > n } = ξ 0 N x n 1 ,

where the essential transition probability matrix N x has the form

N x = N C N C 0 0 N C N ,

the matrix N is given by (2), and the matrix C defines the “continuation” transition probabilities from one occurrence to the next and depends on c in (2).

If the pattern Λ is long and complex and n is very large, then the computation of P { X n ( Λ ) = x } can become problematic and, to overcome this problem, various asymptotic approximations have been developed for these probabilities.

In real applications, if the exact distribution is not available or is hard to compute, it is important to know which approximations perform well and are easy to compute. Furthermore, it is important to know how these approximations perform with respect to each other and the exact distribution from both a theoretical and numerical standpoint. The aims of this manuscript are two-fold: (i) we first study the asymptotic relative error of the normal, Poisson (or compound Poisson), and FMCI approximations with respect to the exact distribution; and (ii) we then provide a numerical study of these three approximations with the exact probabilities in cases where x is fixed and n and when n is fixed and x varies. As an important byproduct, the FMCI technique allows the normal and Poisson approximations to be applied in more cases, for example, the distribution of compound patterns and patterns in Markov dependent trials.

The approximations

Normal approximation

The normal approximation is one of the most popular for approximating the distribution of the number of runs or patterns X n (Λ) in Statistics. In general, when Λ is simple or compound, the trials are i.i.d., and the counting is non-overlapping, by appealing to (1) and renewal arguments, it has been shown that X n (Λ) is asymptotically normally distributed (cf. Fu and Lou 2007; Karlin and Taylor 1975). The form of the approximation is

lim n P X n ( Λ ) n / μ W n σ W 2 μ W 3 u = Φ ( u ) ,

where Φ(·) denotes the standard normal distribution function and μ W and σ W 2 are the mean and variance of W(Λ) respectively, which are given by

μ W = ξ 0 ( I N ) 1 1 , and

σ W 2 = ξ 0 ( I + N ) ( I N ) 2 1 μ W 2 .

Given a pattern Λ, it is well known that the mean μ W and the variance σ W 2 are difficult to obtain via combinatoric arguments, especially when Λ is a compound pattern or the trials are Markov dependent. For example, as pointed out in Karlin (2005) and Kleffe and Borodovski (1992), approximate values of μ W and σ W 2 must sometimes be used. Since W(Λ) is finite Markov chain imbeddeble, (7) and (8), provide the exact values.

The limit in (6) is appropriate when the sequence of inter arrival times {W i (Λ)} are i.i.d., which is the case for simple and compound patterns when the {X i } are i.i.d. and counting is non-overlapping. When occurrences of Λ correspond to a delayed renewal process, which can occur for Markov dependent trials and/or overlapping counting, we could use the mean and variance of W 2(Λ) for the normalizing constants, which are easily obtained by modifying ξ 0 in (7) and (8). Even more general cases can be handled by making use of a functional central limit theorem for Markov chains (see, for example, (Meyn and Tweedie 1993, §17.4) and (Asmussen 2003, Theorem 7.2, pg. 30) for the details).

Poisson and compound poisson approximations

It is well known that, in a sequence of Bernoulli (p) trials, if n pλ as n, then the probability of k successes in n trials can be approximated by a Poisson probability with parameter λ, denoted P ( λ ) . This idea has been extended to certain patterns Λ and, under certain conditions, the distribution of X n (Λ) can be approximated by a Poisson distribution with parameter μ n in the sense that

d TV ( ( X n ( Λ ) ) , P ( μ n ) ) < ε n ,

where ( · ) denotes the distribution (law) of a random variable and d TV(·,·) denotes the total variation distance.

The primary tool used to obtain μ n and the bound ε n is the Stein-Chen method (Chen 1975), and this method has been refined by various authors Arratia et al. (1990), Barbour and Eagleson (1983), Barbour and Eagleson (1984), Barbour and Eagleson (1987), Barbour and Hall (1984), Godbole (1990a), Godbole (1990b), Godbole (1991), Godbole and Schaffner (1993), and Holst et al. (1988). This method has also been extended to compound Poisson approximations for the distributions of runs and patterns and Barbour and Chryssaphinou (2001) provides an excellent theoretical review of these approximations.

In practice, μ n = E X n ( Λ ) or the expectation of a closely related run statistic is used (cf. Balakrishnan and Koutras 2002, §5.2.3) so that, in the former case,

P { X n ( Λ ) = x } ( E X n ( Λ ) ) x x ! exp E X n ( Λ ) .

Finding E X n ( Λ ) and the bound ε n is usually done on a case by case basis. For the mathematical details, the books (Barbour et al. 1992a) and (Balakrishnan and Koutras 2002) are recommended.

Let P c ( λ , ν ) denote the compound Poisson distribution, that is, the distribution of the random variable j = 1 M Y j where the random variable M has a Poisson distribution with parameter λ and the Y j are i.i.d. having distribution ν. A compound Poisson distribution for approximating nonnegative random variables was suggested in Barbour et al. (1992b) (see also Barbour et al. (1995 1996)). The approximation is formulated similarly to the Poisson approximation:

d TV ( ( X n ( Λ ) ) , P c ( λ , ν ) ) < ε n .

The distribution of N n,k , the number of non-overlapping occurrences of k consecutive successes in n i.i.d. Bernoulli trials, is one of the most important in this area and one of the most studied in the literature. Reversing the roles of S (success) and F (failure), the reliability of consecutive-k-out-of-n system, denoted C(k,n : F), is given by P { N n , k = 0 } . Even in this simple case (i.e. Λ=S SS), there are several ways to apply the Poisson approximation techniques. For example, (Godbole 1991, Theorem 2) shows that approximating N n,k with a P ( E N n , k ) distribution works well if certain conditions hold. Godbole and Schaffner (Godbole and Schaffner 1993, pg. 340) suggests an improved Poisson approximation for word patterns.

The primary difficulty in applying the Poisson approximation is the determination of the optimal parameter μ n , which is higly dependent on the structure of the pattern Λ. In particular, if Λ is long and has several uneven overlapping sub-patterns, then finding μ n by their method can be very tedious. In the sequel, we show that even the (asymptotic) best choice for μ n for Poisson approximations does not perform well in the relative sense.

FMCI approximations

Approximations based on the FMCI approach depend on the spectral decomposition of the essential transition probability matrix N.

Let N be a w×w essential transition probability matrix associated with a finite Markov chain {Y n :n≥0} corresponding to the distribution of the waiting time W(Λ). Let 1>λ 1≥|λ 2|≥⋯≥|λ w | denote the ordered eigenvalues of N, repeated according to their algebraic multiplicities, with associated (right) eigenvectors η 1 , η 2 , , η w . When the geometric multiplicity of λ i is less than its algebraic multiplicity, we will use vectors of 0’s for the unspecified eigenvectors. The fact that λ 1 can be taken as a positive real number and that η 1 can be taken to be non-negative are consequences of the Perron-Frobenious Theorem for non-negative matrices ( Seneta cf. 1981).

Definition1. We will say that {Y n :n≥0}, or equivalently, N, satisfies the FMCI Approximation Conditions if

(i) there exists constants a 1,…,a w such that

1 = i = 1 w a i η i ,

(ii) λ 1 has algebraic multiplicity g and λ 1>|λ j | for all j>g.

Verifying these conditions is usually straightforward. They certainly hold if N is irreducible and aperiodic, but also hold in many other cases as well. For example, (12) requires only that 1 is in the linear space spanned by { η 1 , η 2 , , η w } , which can hold even when N is defective (not diagonizable). Condition (ii) requires that the communication classes corresponding λ 1 are aperiodic. That is, if Ψ is a communication class and N[Ψ] corresponds to the substocastic matrix N restricted to the states in Ψ, with largest eigenvalue λ 1[Ψ], then all Ψ such that λ 1[Ψ]=λ 1 should be aperiodic. We also mention that the algebraic multiplicity of λ 1 is the number of communication classes Ψ such that λ 1[Ψ]=λ 1.

Fu and Johnson (2009) give the following theorem.

Theorem 1. Let {X i } be a sequence of i.i.d. trials taking values in , let Λ be a simple pattern of length with d×d essential transition probability matrix N and let X n (Λ) be the number of non-overlapping occurrences of Λ in {X i }. If N satisfies the FMCI approximation conditions then, for any fixed x≥0,

P { X n ( Λ ) = x } a x + 1 n x ( 1 ) x ( 1 λ 1 ) x λ 1 n x ,

where a = j = 1 g a j ( ξ 0 η j ) . If g=1, as is usually the case, then a=a 1( ξ 0 η 1′).

Given a pattern Λ, the approximation in (13) requires finding the Markov chain imbedding associated with the waiting time W(Λ), the essential transition probability matrix N as well as its eigenvalues and associated eigenvectors. Usually, these steps are rather simple and can be easily automated together with (13). Even for very large n and large , say n=1,000,000 and =50, the CPU time is negligible. Fu and Johnson (2009) also provide details on extending these results to compound patterns, overlapping counting and Markov dependent trials.

For the purpose of comparing these approximations, we prefer to write (13) as

P { X n ( Λ ) = x } a x + 1 1 λ 1 λ 1 x n x ( 1 ) x exp { n ln λ 1 }

Note that the approximation havs three parts: a constant part; a polynomial in n of degree x; and a third (dominant) part which converges to 0 exponentially fast as n.

More precisely, the FMCI approximation in (13) may be written as

P { X n ( Λ ) = x } = a x + 1 1 λ 1 λ 1 x n x ( 1 ) x × exp { n ln λ 1 } 1 + o λ g + 1 λ 1 n / ( x + 1 ) .

Since |λ g+1|<λ 1, the term |λ g+1/λ 1| n/(x+1)− tends to 0 exponentially as n and hence is negligible if n/(x+1)− is moderate or large (say ≥50).

Large deviation approximation

Fu et al. (2012) provide the following large deviation approximation for right-tail probabilities for the number of non-overlapping occurrences for simple patterns Λ. The reasons for providing only the right-tail large deviation approximation are (i) all of the above mentioned approximations fail to approximate the extreme right-tail probabilities and (ii) the FMCI approximation provides an accurate approximation for left-tail probabilities.

Theorem 2. Let ε = x μ W 2 / ( 1 + x μ W ) and let

φ W ( t ) = 1 + ( e t 1 ) ξ ( I e t N ) 1 1 ,

be the moment generating function of W(Λ). Then

P { X n ( Λ ) E X n ( Λ ) + nx } = e ( ε , Λ ) 1 n b 0 + b 1 n 1 + + b m n m + O ( n m 1 ) ,

where

β ( x , Λ ) = 1 μ W + x h ( ε , τ ) = 1 μ W + x τ μ W 1 + x μ W ln φ W ( Λ ) ( τ ) ,

h ( ε , t ) = εt ln φ μ W W ( Λ ) ( t ) , τ is the solution to h (ε,τ)=0, and

b 0 = 1 στ 2 π ( μ 1 + x ) b 1 = 1 στ 2 π ( μ 1 + x ) 3 1 σ 2 τ 2 + h ( 3 ) ( ε , τ ) 2 τ σ 4 h ( 4 ) ( ε , τ ) 8 σ 4 5 ( h ( 3 ) ( ε , τ ) ) 2 24 σ 6 σ = h ′′ ( ε , τ ) .

Comparisons and relative error

For a given n, x and pattern Λ, we define the relative error of an approximation with respect to the exact probability P { X n ( Λ ) = x } as

R ( x : E , A ) = sgn ( A E ) max E A , A E 1 ,

where A stands for the approximate probability and E stands for the exact probability P { X n ( Λ ) = x } . This quantity, R(x:E,A), goes from − to and treats the importance of overestimation the same as underestimation. It is clear that R(x:E,A)>0 implies that the approximation is overestimating the exact probability and that R(x:E,A)<0 implies that the approximation is underestimating the exact probability. Since, for fixed x, the probability P { X n ( Λ ) = x } converges to 0 exponentially fast as n, it follows that R(x:E,A)→± implies that the approximation tends to 0 with the wrong rate. If R(x:E,A) is near 0 then the approximation is close to the exact probability P { X n ( Λ ) = x } .

Note that R(x:E,A) is a function of x, n and the method of approximation used. The following theorem provides the asymptotic relative error for the Normal approximation (N), the Poisson approximation (P(μ n )) and the finite Markov chain imbedding approximation (F).

Theorem 3. Let {X i } be a sequence of i.i.d. multi-state trials taking values in and let Λ be a simple pattern defined on . Then, for every fixed x, we have,

( i ) lim n R ( x : E , F ) = 0 ;

( ii ) lim n R ( x : E , P ( μ n ) ) = , if lim sup n μ n / n < ln λ 1 ; c ( x ) , if lim n μ n / n = ln λ 1 ; , if lim inf n μ n / n > ln λ 1 ;

( iii ) lim n R ( x : E , N ) = , if μ W / 2 σ W 2 ln λ 1 ; , if μ W / 2 σ W 2 > ln λ 1 ;

where the exact probability is computed using (4) and

c ( x ) = a x + 1 λ 1 1 λ 1 ln λ 1 x 1 .

Proof. Given a pattern Λ and x, for the finite Markov chain imbedding approximation we have

lim n P { X n ( Λ ) = x } a x + 1 1 λ 1 λ 1 x n x ( 1 ) x exp { n ln λ 1 } = 1

and hence (i) follows immediately from the definition of R(x:E,A) and Theorem 1.

For the Poisson approximation we have, since E/F∼1 by (i),

E P ( μ n ) = E F × F P ( μ n ) F P ( μ n )

and hence

E P ( μ n ) = P { X n ( Λ ) = x } μ n x x ! exp { μ n } a x + 1 1 λ 1 λ 1 x n x ( 1 ) x exp { n ln λ 1 } μ n x x ! exp { μ n } .

If liminf n μ n / n > ln λ 1 then exp{n lnλ 1+μ n } tends to 0 exponentially fast which overrides the polynomial term and hence R(x:E,P(μ n ))→− as n for all fixed x. Similarly, if limsup n μ n / n < ln λ 1 , then R(x:E,P(μ n ))→ as n for all fixed x. Furthermore, if lim n μ n / n = ln λ 1 , then the ratio yields

lim n R ( x : E , P ( n ln λ 1 ) ) = a x + 1 λ 1 1 λ 1 ln λ 1 x 1

and this completes the proof of (ii). Note also that, if limsup n μ n / n > ln λ 1 and liminf n μ n / n < ln λ 1 , then lim n R ( x : E , P ( μ n ) ) will not exist.

For the normal approximation we have that X n (Λ) is approximately normal with mean n/μ W and variance n σ W 2 / μ W 3 and hence

P { X n ( Λ ) = x } N = x 1 / 2 x + 1 / 2 1 2 πn σ W 2 μ W 3 exp ( t n / μ W ) 2 2 n σ W 2 μ W 3 dt

Hence, provided n>μ W (x+1/2), we have

N 1 2 πn σ W 2 μ W 3 exp ( x + 1 / 2 n / μ W ) 2 2 n σ W 2 μ W 3 .

Therefore, as in the proof of (ii), we are interested in the asymptotics of F/N, which yields

F N 2 πn σ W 2 μ W 3 a x + 1 1 λ 1 λ 1 x n x ( 1 ) x × exp n ln λ 1 + ( x + 1 / 2 n / μ W ) 2 2 n σ W 2 μ W 3 .

We may rewrite the argument of the exponential function as

n ln λ 1 + μ W 2 σ W 2 μ W ( x + 1 / 2 ) n 1 2 ,

making it clear that (24) converges to if μ W / 2 σ W 2 ln λ 1 and 0 otherwise. Therefore, R(x:E,N)→ if μ W / 2 σ W 2 ln λ 1 and R(x:E,N)→− if μ W / 2 σ W 2 < ln λ 1 and the proof of (iii) is complete.

Theorem 3 (ii) implies that asymptotically (for fixed x and n), the Poisson approximation performs poorly (in the relative sense) regardless of the value μ n used. When Λ is simple and does not have overlapping sub-patterns, taking μ n = E X n ( Λ ) is normally recommended for the Poisson approximation (cf. Arratia et al. 1990). In this case, non-overlapping and overlapping counting is equivalent. The following corollary shows that, for fixed x, the Poisson approximation will (asymptotically) always overestimate the exact probability in the following sense.

Corollary 1. Let Λ be a simple pattern defined on an i.i.d. sequence of multi-state trials. For μ n = E X n ( Λ ) , we have

lim n R ( x : E , P ( μ n ) ) =

for all fixed x.

Proof. Recall that, in this case, X n (Λ) is a renewal process with i.i.d. inter-renewal times with mean μ W = EW ( Λ ) and hence, by the elementary renewal theorem, we have E X n ( Λ ) / n 1 / μ W so that E X n ( Λ ) n / μ W . Therefore, by Theorem 3 (ii), it is sufficient to show that n/μ W <−n lnλ 1 for all sufficiently large n, or

e 1 / μ W > λ 1 .

Now, since 0 < λ 1 is a dominant eigenvalue of N, it follows that: 0 < ( 1 λ 1 ) 1 is a dominant eigenvalue of the matrix (IN)−1=A=(a i j ); a i j ≥0 with at least one a i j >0; and A 1 =(IN)−1 1 μ W 1 . Hence, by a simple corollary to the Perron-Frobenius Theorem for nonnegative matrices (cf. Karlin and Taylor 1975, Corollary 2.2, pg. 551), we have

1 1 λ 1 = limsup n max i , j | a ij ( n ) | 1 / n μ W ,

where a ij ( n ) = ( A n ) ij . Therefore, provided μ W <,

e 1 / μ W > 1 1 μ W λ 1 ,

which completes the proof.

Corollary 1 implies that, if μ n E X n ( Λ ) , then the Poisson approximation will always overestimate the exact probability as n. Together with Theorem 3 (ii), this implies that using μ n ∼−n lnλ 1 results in the best Poisson approximation as n.

We also comment that, for the normal approximation, both μ W / 2 σ W 2 < ln λ 1 and μ W / 2 σ W 2 ln λ 1 are possible. As a simple example, suppose we have a sequence of i.i.d. Bernoulli (p) trials and Λ=S S S. If p=1/2, we obtain

μ W = 14 , σ W 2 = 142 and λ 1 = 0.9196434 ,

and

μ W 2 σ W 2 = 0.04929577 < ln λ 1 = 0.08376932 .

However, with p=0.9, we obtain

μ W = 3.717421 , σ W 2 = 2.145694 and λ 1 = 0.5419067 ;

and

μ W 2 σ W 2 = 0.8662513 > ln λ 1 = 0.6126614 .

Thus, R(x:E,N)→± are both possible depending on x, the pattern, and the probability structure of the {X i }.

Numerical comparisons

In the previous section we showed that, for fixed x and n, the approximation based on the finite Markov chain imbedding technique outperforms the Poisson and normal approximations. In practice, however, one is interested in the performance of these approximations not only when x is fixed and n, but also when n is fixed (at some moderate value) and x varies. The reason we consider only large or moderate n in our numerical study is that, for small n, the FMCI technique easily gives the exact results. In this section we present some numerical experiments to illustrate the advantages (and disadvantages) of the methods discussed.

The approximations we compare are: the finite Markov chain approximation in (13) (FMCI); the Poisson approximation with μ n = n / μ W ( E X n ( Λ ) ) where μ W is calculated using (7) (Poisson); The normal approximation given in (6) (Normal); and the large deviation approximation given in Theorem 2 (LD), which is only for right-tail probabilities.

Reliability of C(k,n:F ) systems

A consecutive-k-out-of-n:F system is a system of n independent and linearly connected components, each with common (continuous) lifetime distribution F, in which the system fails if k consecutive components fail. At a given time t>0, the probability a component is working is p=1−F(t) and the probability a single component has failed is q=1−p and hence the probability the system has failed is equivalent to the probability that k (or more) consecutive components have failed, which is equivalent to the probability of k consecutive failures in a sequence of n Bernoulli trials with success probability p. Barbour et al. (1995) present a table of various bounds for system reliability based on a Poisson approximation and a compound approximation and compare these to bounds found in Fu (1985). Table 1 shows the exact probabilities and relative errors for the FMCI and Poisson approximations as well as the compound Poisson approximation in Barbour et al. (1995) (CP).

<p>Table 1</p>

n

k

q

Exact

FMCI

Poisson

CP

5

2

0.01

0.99960

0.00000

-0.00010

0.00000

5

2

0.10

0.96309

0.00000

-0.00788

0.00119

5

2

0.25

0.79980

-0.00002

-0.02697

0.04654

10

2

0.01

0.99911

0.00000

-0.00010

0.00000

10

2

0.10

0.91975

0.00000

-0.00728

0.00312

10

2

0.25

0.61180

0.00000

-0.00869

0.12266

10

4

0.01

1.00000

0.00000

0.00000

0.00000

10

4

0.10

0.99936

0.00000

-0.00026

0.00000

10

4

0.25

0.97855

0.00000

-0.00776

0.00038

50

2

0.01

0.99516

0.00000

-0.00010

0.00000

50

2

0.10

0.63633

0.00000

-0.00251

0.01871

50

2

0.25

0.07173

0.00000

0.14441

0.96838

50

4

0.01

1.00000

0.00000

0.00000

0.00000

50

4

0.10

0.99577

0.00000

-0.00026

0.00000

50

4

0.25

0.86897

0.00000

-0.00663

0.00312

100

2

0.01

0.99024

0.00000

-0.00010

0.00000

100

2

0.10

0.40151

0.00000

0.00343

0.03854

100

2

0.25

0.00492

0.00000

0.36933

2.97133

100

4

0.01

1.00000

0.00000

0.00000

0.00000

100

4

0.10

0.99129

0.00000

-0.00026

0.00001

100

4

0.25

0.74908

0.00000

-0.00523

0.00656

500

4

0.20

0.52721

0.00000

-0.00086

0.00611

1,000

4

0.20

0.27696

0.00000

0.00183

0.01232

10,000

5

0.20

0.07710

0.00000

0.00183

0.00560

Approximation errors for C(k,n : F) systems

The FMCI approximation performs very well for the parameters tested here. As expected, the Poisson and compound Poisson approximations perform well when n q k is relatively small. When the reliability of the system is relatively low, the Poisson and compound Poisson approximations begin to degrade.

Approximating the distribution of N n,k

Recall that N n,k is the number of non-overlapping occurrences of k consecutive successes in {X i } (i.e. N n,k =X n (Λ) with Λ=S SS of length k). By reversing the roles of success and failure, the reliability of C(k,n : F) systems can be related to the distribution of N n,k . In this section we present some examples of approximating P { N n , k = x } with the approximations FMCI, Normal, Poisson and LD.

Figure 1 shows the relative error R(x:E,A) in these approximations for (a) N 2000,4; (b) N 5000,4; and (c) N 250000,6 when the probability of success is p=0.3. On all of the figures, the top axis is on a standard z-scale making use of the asymptotic mean and variance of X n (Λ) — namely,

z = x n / μ W n σ W 2 μ W 3 .

<p>Figure 1</p>

Relative errors of the FMCI, Normal, Poisson and LD approximations N2000,4, N5000,4 and N250000,6 with p=0.3

Relative errors of the FMCI, Normal, Poisson and LD approximations N 2000,4 , N 5000,4 and N 250000,6 with p =0 . 3.

We notice that the Finite Markov chain imbedding approximation (FMCI) performs very well in the left tail of the distribution in all cases. Its performance degrades as x gets large but its performance is more consistent than both the Poisson and Normal approximations in this case. The large deviation approximation performs well in the right tail in all cases. In (c), the FMCI approximation performs very well throughout most of the support. The Poisson approximations also perform well over most of the x considered. The normal approximation performs well in the neighbourhood of E X n ( Λ ) but not in the tails.

As the probability of success p increases, the FMCI approximation still performs very well in the left tail, but it’s performance tends to degrade more quickly as x increases. The Poisson approximations also quickly degrades as p increases since E N n , k increases. For larger p, the Normal approximation tends to work better near the mean. In the far left tail, the FMCI approximation is preferred and in the far right tail, the LD approximation is preferred.

Biological sequences

Sequences of DNA nucleotides are of great interest (as are sequences of amino acids and other biological sequences). Figure 2 shows the relative errors for approximating P { X n ( Λ ) = x } with Λ=A C G (n=1,000 and 10,000) and Λ=C A T T A G (n=500,000). We see that the FMCI approximation again performs very well in the left tail, although, in (b), the performance degrades somewhat as x gets large. The large deviation approximation performs very well in the right tail, especially when x is greater than 3 standard deviations above the mean. While it is difficult to give a rule of thumb, the FMCI approximation seems to perform very well when x O ( n 1 / 2 ) . The normal approximation works best within a few standard deviations of the mean and performs best in this region when E X n ( Λ ) is relatively large.

<p>Figure 2</p>

Relative errors of the FMCI, normal, Poisson and large deviation approximations for the patterns Λ=ACG with n=1,000 and 10,000 and Λ=CATTAG with n=500,000

Relative errors of the FMCI, normal, Poisson and large deviation approximations for the patterns Λ = A C G with n =1,000 and 10,000 and Λ =CATTAG with n =500,000.

Discussion and conclusions

The finite Markov chain imbedding approximations (FMCI and LD) provide an alternative to the usual normal and Poisson approximations for the distributions of runs and patterns. While the FMCI approximation is simple, accurate and fast, it has one disadvantage over the normal and Poisson approximations — it requires the use of the FMCI technique, which is non-traditional and less known in the Statistics community, except in the area of system reliability (cf. Cui et al. 2010). On the other hand, the FMCI technique does not require the rather strong conditions necessary for the Poisson techniques, such as n p k λ. This condition is seldom satisfied in practical applications. For example, in DNA sequence analysis, the probabilities p A , p C , p G and p T do not tend to 0 as n increases. They may not all be in the neighbourhood of 1/4 but they are bounded away from 0.

For all of the numeric results in the previous section, the exact probabilities P { X n ( Λ ) = x } are obtained via the FMCI technique and their CPU times were only a few seconds or less than a minute even in the case of Λ=C A T T A G and n=500,000. Based on our experience, if the length of the pattern is less than 20 and n is less than 1,000,000, the exact probability should be computed.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BJ and JF contributed equally to the mathematical details. BJ performed the numerical comparisons and prepared the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada.

<p>Poisson approximation and the Chen-Stein method</p>ArratiaRGoldsteinLGordonLStat. Sci54403434AsmussenSApplied Probability and QueuesNew York: SpringerBalakrishnanNKoutrasMVRuns and Scans with Applications. Wiley Series in Probability and StatisticsNew York: Wiley-Interscience [John Wiley & Sons]<p>Poisson approximation for some statistics based on exchangeable trials</p>BarbourADEaglesonGKAdv. Appl. Probab153585600<p>Poisson convergence for dissociated statistics</p>BarbourADEaglesonGKJ. Roy. Statist. Soc. Ser. B463397402<p>An improved Poisson limit theorem for sums of dissociated random variables</p>BarbourADEaglesonGKJ. Appl. Probab243586599<p>On the rate of Poisson convergence</p>BarbourADHallPMath. Proc. Cambridge Philos. Soc953473480<p>Compound Poisson approximation: a user’s guide</p>BarbourADChryssaphinouOAnn. Appl. Probab1139641002<p>Poisson Approximation. Oxford Studies in Probability</p>BarbourADHolstLJansonSOxford Science Publications<p>Compound Poisson approximation for nonnegative random variables via Stein’s method</p>BarbourADChenLHYLohW-LAnn. Probab20418431866<p>Compound Poisson approximation in reliability theory</p>BarbourADChryssaphinouORoosMIEEE T. Reliab443398402<p>Compound Poisson approximation in systems reliability</p>BarbourADChryssaphinouORoosMNaval Res. Logist432251264<p>How many random digits are required until given sequences are obtained?</p>BlomGThorburnDJ. Appl. Probab193518531<p>Poisson approximation for dependent trials</p>ChenLHYAnn. Probab33534545<p>Developments and applications of the finite Markov chain imbedding approach in reliability</p>CuiLXuYZhaoXIEEE T. Reliab594685690<p>Reliability of a large consecutive-k-out-of-n:F system</p>FuJCIEEE T. ReliabR-34120127<p>Approximate probabilities for runs and patterns in i.i.d. and Markov dependent multi-state trials</p>FuJCJohnsonBCAdv. Appl. Probab411292308<p>Distribution theory of runs: a Markov chain approach</p>FuJCKoutrasMVJ. Amer. Statist. Assoc8942710501058FuJCLouWYWDistribution Theory of Runs and Patterns and Its ApplicationsRiver Edge: World Scientific Publishing Co. Inc<p>On the normal approximation for the distribution of the number of simple or compound patterns in a random sequence of multi-state trials</p>FuJCLouWYWMethodol. Comput. Appl. Probab92195205<p>Approximating the extreme right-hand tail probability for the distribution of the number of patterns in a sequence of multi-state trials</p>FuJCJohnsonBCChangY-MJ. Stat. Plan. Infer1422473480<p>The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain</p>GerberHULiS-YRStochastic Process. Appl111101108<p>Degenerate and Poisson convergence criteria for success runs</p>GodboleAPStatist. Probab. Lett103247255<p>Specific formulae for some success run distributions</p>GodboleAPStatist. Probab. Lett102119124<p>Poisson approximations for runs and patterns of rare events</p>GodboleAPAdv. Appl. Probab234851865<p>Improved Poisson approximations for word patterns</p>GodboleAPSchaffnerAAAdv. Appl. Probab252334347<p>Rates of Poisson convergence for some coverage and urn problems using coupling</p>HolstLKennedyJEQuineMPJ. Appl. Probab254717724<p>Statistical signals in bioinformatics</p>KarlinSProc. Natl. Acad. Sci. U. S. A102381335513362KarlinSTaylorHMA First Course in Stochastic ProcessesNew York-London: Academic Press [A subsidiary of Harcourt Brace Jovanovich, Publishers]<p>First and second moment of counts of words in random text generated by Markov chains</p>KleffeJBorodovskiMComp Applic Biosci8443441<p>Finite Markov chain embedding for the exact distribution of patterns in a set of random sequences</p>MartinJRegadLCamprouxA-CNuelGAdvances in Data Analysis. Statistics for Industry and TechnologyBoston: BirkhäuserSkiadas CH<p>Markov Chains and Stochastic Stability. Communications and Control Engineering Series</p>MeynSPTweedieRL<p>Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data</p>NuelGRegadLMartinJCamprouxA-CAlgorithm Mol. Biol51118<p>Run probabilities in sequences of Markov-dependent trials</p>SchwagerSJJ. Amer. Statist. Assoc78381168180SenetaENon-negative Matrices and Markov ChainsNew York: Springer<p>A combinatorial identity and its application to the problem on the first occurrence of a rare event</p>Solov’evADTeor. Verojatnost. i Primenen11313320