Department of Statistics, University of Manitoba, Winnipeg, Canada

Abstract

The distribution theory of runs and patterns has been successfully used in a variety of applications including, for example, nonparametric hypothesis testing, reliability theory, quality control, DNA sequence analysis, general applied probability and computer science. The exact distributions of the number of runs and patterns are often very hard to obtain or computationally problematic, especially when the pattern is complex and

AMS Subject Classification

Primary 60E05; Secondary 60J10

Introduction and notation

Let _{
i
}} and _{
n
}(

and, by convention, the waiting time for the first occurrence is denoted

where

We say that two patterns _{1} and _{2} are distinct if neither _{1} appears in _{2} nor _{2} appears in _{1}. If _{1},…,_{
r
} are pairwise distinct simple patterns, we define the compound pattern _{
i
} is considered an occurrence of _{1}∪⋯∪_{
r
}, we similarly define

The waiting times _{
i
}(

From these definitions it is easy to see that, for any simple or compound pattern _{
n
}(

which provides a convenient way of studying the exact and approximate distribution of _{
n
}(

Throughout this paper, unless specified otherwise, we assume that the trials {_{
i
}} are either independent and identically distributed (i.i.d.) or first order Markov dependent; the pattern

The distribution of the number of runs and patterns in a sequence of multi-state trials or random permutations of a set of integers have been successfully used in various fields in applied probability, statistics and discrete mathematics. Examples include reliability theory, quality control, DNA sequence analysis, psychology, ecology, astronomy, nonparametric tests, successions, and the Eulerian and Simon-Newcomb numbers (the latter 3 being defined for permutations). Two recent books, Balakrishnan and Koutras (2002) and Fu and Lou (2003), provide some scope of the distribution theory of runs and patterns and Martin et al. (2010) and Nuel et al. (2010) provides some extensions to sets of sequences.

Given a pattern _{
n
}(_{
i
}} are Markov dependent multi-state trials.

The waiting time _{
n
}(_{
i
}} defined on a finite state space, say

where **
c
** is a column vector. The distribution of the waiting time for

where **
ξ
**

where the essential transition probability matrix **N**
_{
x
} has the form

the matrix **N** is given by (2), and the matrix **C** defines the “continuation” transition probabilities from one occurrence to the next and depends on **c** in (2).

If the pattern

In real applications, if the exact distribution is not available or is hard to compute, it is important to know which approximations perform well and are easy to compute. Furthermore, it is important to know how these approximations perform with respect to each other and the exact distribution from both a theoretical and numerical standpoint. The aims of this manuscript are two-fold: (i) we first study the asymptotic relative error of the normal, Poisson (or compound Poisson), and FMCI approximations with respect to the exact distribution; and (ii) we then provide a numerical study of these three approximations with the exact probabilities in cases where

The approximations

Normal approximation

The normal approximation is one of the most popular for approximating the distribution of the number of runs or patterns _{
n
}(_{
n
}(

where _{
W
} and

Given a pattern _{
W
} and the variance _{
W
} and

The limit in (6) is appropriate when the sequence of inter arrival times {_{
i
}(_{
i
}} are i.i.d. and counting is non-overlapping. When occurrences of _{2}(**
ξ
**

Poisson and compound poisson approximations

It is well known that, in a sequence of Bernoulli (_{
n
}(_{
n
} in the sense that

where _{TV}(·,·) denotes the total variation distance.

The primary tool used to obtain _{
n
} and the bound _{
n
} is the Stein-Chen method (Chen 1975), and this method has been refined by various authors Arratia et al. (1990), Barbour and Eagleson (1983), Barbour and Eagleson (1984), Barbour and Eagleson (1987), Barbour and Hall (1984), Godbole (1990a), Godbole (1990b), Godbole (1991), Godbole and Schaffner (1993), and Holst et al. (1988). This method has also been extended to compound Poisson approximations for the distributions of runs and patterns and Barbour and Chryssaphinou (2001) provides an excellent theoretical review of these approximations.

In practice,

Finding _{
n
} is usually done on a case by case basis. For the mathematical details, the books (Barbour et al. 1992a) and (Balakrishnan and Koutras 2002) are recommended.

Let _{
j
} are i.i.d. having distribution

The distribution of _{
n,k
}, the number of non-overlapping occurrences of _{
n,k
} with a

The primary difficulty in applying the Poisson approximation is the determination of the optimal parameter _{
n
}, which is higly dependent on the structure of the pattern _{
n
} by their method can be very tedious. In the sequel, we show that even the (asymptotic) best choice for _{
n
} for Poisson approximations does not perform well in the relative sense.

FMCI approximations

Approximations based on the FMCI approach depend on the spectral decomposition of the essential transition probability matrix **N**.

Let **N** be a _{
n
}:_{1}≥|_{2}|≥⋯≥|_{
w
}| denote the ordered eigenvalues of **N**, repeated according to their algebraic multiplicities, with associated (right) eigenvectors _{
i
} is less than its algebraic multiplicity, we will use vectors of 0’s for the unspecified eigenvectors. The fact that _{1} can be taken as a positive real number and that **
η
**

**Definition****1**. We will say that {_{
n
}:**N**, satisfies the

(i) there exists constants _{1},…,_{
w
} such that

(ii) _{1} has algebraic multiplicity _{1}>|_{
j
}| for all

Verifying these conditions is usually straightforward. They certainly hold if **N** is irreducible and aperiodic, but also hold in many other cases as well. For example, (12) requires only that **1**
^{′} is in the linear space spanned by **N** is defective (not diagonizable). Condition (ii) requires that the communication classes corresponding _{1} are aperiodic. That is, if **N**[**N** restricted to the states in _{1}[_{1}[_{1} should be aperiodic. We also mention that the algebraic multiplicity of _{1} is the number of communication classes _{1}[_{1}.

Fu and Johnson (2009) give the following theorem.

**Theorem ****1**. _{
i
}} **N **
_{
n
}(_{
i
}}. **N **

_{1}(**
ξ
**

Given a pattern **N** as well as its eigenvalues and associated eigenvectors. Usually, these steps are rather simple and can be easily automated together with (13). Even for very large

For the purpose of comparing these approximations, we prefer to write (13) as

Note that the approximation havs three parts: a constant part; a polynomial in

More precisely, the FMCI approximation in (13) may be written as

Since |_{
g+1}|<_{1}, the term |_{
g+1}/_{1}|^{
n/(x+1)−ℓ
} tends to 0 exponentially as

Large deviation approximation

Fu et al. (2012) provide the following large deviation approximation for right-tail probabilities for the number of non-overlapping occurrences for simple patterns

**Theorem ****2**.

^{′}(

Comparisons and relative error

For a given

where

Note that _{
n
})) and the finite Markov chain imbedding approximation (F).

**Theorem ****3**. _{
i
}}

and hence (i) follows immediately from the definition of

For the Poisson approximation we have, since

and hence

If _{1}+_{
n
}} tends to 0 exponentially fast which overrides the polynomial term and hence _{
n
}))→−_{
n
}))→

and this completes the proof of (ii). Note also that, if

For the normal approximation we have that _{
n
}(_{
W
} and variance

Hence, provided _{
W
}(

Therefore, as in the proof of (ii), we are interested in the asymptotics of

We may rewrite the argument of the exponential function as

making it clear that (24) converges to

Theorem 3 (ii) implies that asymptotically (for fixed _{
n
} used. When

**Corollary ****1**.

_{
n
}(_{
W
}<−_{1} for all sufficiently large

Now, since **N**, it follows that: **I**−**N**)^{−1}=**A**=(_{
i
j
}); _{
i
j
}≥0 with at least one _{
i
j
}>0; and **A****
1
**

where _{
W
}<

which completes the proof.

Corollary 1 implies that, if _{
n
}∼−_{1} results in the best Poisson approximation as

We also comment that, for the normal approximation, both

and

However, with

and

Thus, _{
i
}}.

Numerical comparisons

In the previous section we showed that, for fixed

The approximations we compare are: the finite Markov chain approximation in (13) (FMCI); the Poisson approximation with _{
W
} is calculated using (7) (Poisson); The normal approximation given in (6) (Normal); and the large deviation approximation given in Theorem 2 (LD), which is only for right-tail probabilities.

Reliability of **
C(k,n:F
**) systems

A consecutive-

**
n
**

**
k
**

**
q
**

**Exact**

**FMCI**

**Poisson**

**CP**

5

2

0.01

0.99960

0.00000

-0.00010

0.00000

5

2

0.10

0.96309

0.00000

-0.00788

0.00119

5

2

0.25

0.79980

-0.00002

-0.02697

0.04654

10

2

0.01

0.99911

0.00000

-0.00010

0.00000

10

2

0.10

0.91975

0.00000

-0.00728

0.00312

10

2

0.25

0.61180

0.00000

-0.00869

0.12266

10

4

0.01

1.00000

0.00000

0.00000

0.00000

10

4

0.10

0.99936

0.00000

-0.00026

0.00000

10

4

0.25

0.97855

0.00000

-0.00776

0.00038

50

2

0.01

0.99516

0.00000

-0.00010

0.00000

50

2

0.10

0.63633

0.00000

-0.00251

0.01871

50

2

0.25

0.07173

0.00000

0.14441

0.96838

50

4

0.01

1.00000

0.00000

0.00000

0.00000

50

4

0.10

0.99577

0.00000

-0.00026

0.00000

50

4

0.25

0.86897

0.00000

-0.00663

0.00312

100

2

0.01

0.99024

0.00000

-0.00010

0.00000

100

2

0.10

0.40151

0.00000

0.00343

0.03854

100

2

0.25

0.00492

0.00000

0.36933

2.97133

100

4

0.01

1.00000

0.00000

0.00000

0.00000

100

4

0.10

0.99129

0.00000

-0.00026

0.00001

100

4

0.25

0.74908

0.00000

-0.00523

0.00656

500

4

0.20

0.52721

0.00000

-0.00086

0.00611

1,000

4

0.20

0.27696

0.00000

0.00183

0.01232

10,000

5

0.20

0.07710

0.00000

0.00183

0.00560

The FMCI approximation performs very well for the parameters tested here. As expected, the Poisson and compound Poisson approximations perform well when ^{
k
} is relatively small. When the reliability of the system is relatively low, the Poisson and compound Poisson approximations begin to degrade.

Approximating the distribution of **
N
**

Recall that _{
n,k
} is the number of non-overlapping occurrences of _{
i
}} (i.e. _{
n,k
}=_{
n
}(_{
n,k
}. In this section we present some examples of approximating

Figure _{2000,4}; (b) _{5000,4}; and (c) _{250000,6} when the probability of success is _{
n
}(

Relative errors of the FMCI, Normal, Poisson and LD approximations _{2000,4}, _{5000,4} and _{250000,6} with

**Relative errors of the FMCI, Normal, Poisson and LD approximations**
**
N
**

We notice that the Finite Markov chain imbedding approximation (FMCI) performs very well in the left tail of the distribution in all cases. Its performance degrades as

As the probability of success

Biological sequences

Sequences of DNA nucleotides are of great interest (as are sequences of amino acids and other biological sequences). Figure

Relative errors of the FMCI, normal, Poisson and large deviation approximations for the patterns

**Relative errors of the FMCI, normal, Poisson and large deviation approximations for the patterns**
**
Λ
**

Discussion and conclusions

The finite Markov chain imbedding approximations (FMCI and LD) provide an alternative to the usual normal and Poisson approximations for the distributions of runs and patterns. While the FMCI approximation is simple, accurate and fast, it has one disadvantage over the normal and Poisson approximations — it requires the use of the FMCI technique, which is non-traditional and less known in the Statistics community, except in the area of system reliability (^{
k
}→_{
A
}, _{
C
}, _{
G
} and _{
T
} do not tend to 0 as

For all of the numeric results in the previous section, the exact probabilities

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BJ and JF contributed equally to the mathematical details. BJ performed the numerical comparisons and prepared the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada.