Design and Implementation of FPGA-based Systolic Array for LZ Data Compression

Hardware implementation of data compression algorithms is receiving increasing attention due to exponentially expanding network traffic and digital data storage usage. Among lossless data compression algorithms for hardware implementation, Lempel-Ziv algorithm is one of the most widely used. The main objective of this paper is to enhance the efficiency of systolic-array approach for implementation of Lempel-Ziv algorithm. The proposed implementation is area and speed efficient. The compression rate is increased by more than 40% and the design area is decreased by more than 30%. The effect of the selected buffer's size on the compression ratio is analyzed. An FPGA implementation of the proposed design is carried out. It verifies that data can be compressed and decompressed on-the-fly.


Introduction
Data compression is becoming an essential component of high speed data communications and storage.Lossless data compression is the process of encoding ("compressing") a body of data into a smaller body of data which can, at a later time, be uniquely decoded ("decompressed") back to the original data.In lossy compression, the decompressed data contains some approximation of the original data.Hardware implementation of data compression algorithms is receiving increasing attention due to exponential expansion in network traffic and digital data storage usage.Many lossless data compression techniques have been proposed in the past and widely used, e.g., Huffman code (Huffman, 1952) ; (Gallager, 1978); ( Park & Prasanna, 1993), arithmetic code (Bodden et al., 2004); (Said, 2004); (Said, 2003); (Howard & Vetter, 1992), run-length code (Golomb, 1966), and Lempel-Ziv (LZ) algorithms (Ziv & Lempel, 1977); ( Ziv & Lempel, 1978); (Welch, 1984); (Salomon, 2004).Among those, LZ algorithms are the most popular when no prior knowledge or statistical characteristics of the data being compressed are available.The principle of the LZ algorithms is to find the longest match between the recently received string which is stored in the input buffer and the incoming string.Once this match is located, the incoming string is represented with a position tag and a length variable linking the new string to the old existing one.Since the repeated data is linked to an older one, more concise representation is achieved and compression is performed.The latency of the compression process is defined by the number of clock cycles needed to produce a codeword (matching results).To fulfill real-time requirements, several hardware realizations of LZ and its variants have been presented in the literature.Different hardware architectures, including content addressable memory (CAM) (Lin & Wu, 2000); (Jones, 1992); (Lee & Yang, 1995), Systolic array (Ranganathan & Henriques, 1993); (Jung & Burleson, 1998); (Hwang & Wu, 2001), and embedded processor (Chang et al., 1994), have been proposed in the past.The microprocessor approach is not attractive for real-time applications, since it does not fully explore hardware parallelism (Hwang & Wu, 2001).CAM has been considered one of the fastest architectures to search for a given string in a long world, which is necessary process in LZ.A CAM-based LZ data compressor can process one input symbol per clock cycle, regardless of the buffer length and string length.A CAM-based LZ can achieve optimum speed for compression.However, CAMs require highly complex hardware and dissipate high power.The CAM approach performs string matching through full parallel search, while the systolic-array approach exploits pipelining.As compared to CAM-based designs, systolic -array-based designs are slower, but better in hardware cost and testability (Hwang & Wu, 2001); (Hwang & Wu, 1995); (Hwang & Wu, 1997).Preliminary design for systolic -array contains thousands of processing elements (PEs) (Ranganathan & Henriques, 1993).High speed designs were then reported later, requiring only tens of PEs (Jung & Burleson, 1998);(hwang et al., 2001).A technique to enhance the efficiency of systolic-array approach which is used to implement Lempel-Ziv algorithm is described in this chapter.A parallel technique for LZ-based data compression is presented.The technique employs transforming a data-dependent algorithm to a data -independent algorithm.A control variable is introduced to indicate early completion which improves the latency.The proposed implementation is area and speed efficient.The effect of the input buffer length on the compression ratio is analyzed.An FPGA implementation for the proposed technique is carried out.The implemented design verifies that data can be compressed and decompressed on-the-fly which opens new areas of research in data compression.The organization of this chapter is as follows: In Section 2, the LZ compression algorithm is explained.The results and comments about some software simulations are discussed.The dependency graph (DG) to investigate the data dependency of every computation step in the algorithm is shown.The most recent systolic array architecture is described and an area and speed efficient architecture is proposed in Section 3. In Section 4, the proposed systolic array structure is compared with the most recent structures (Hwang et al., 2001) in terms of area, and latency.An FPGA implementation for the proposed architecture showing the real time operations is demonstrated in Section 5. Finally, conclusions are provided in Section 6.

Lempel-ziv coding algorithm
The LZ algorithm was proposed by Ziv and Lempel in (Ranganathan & Henriques, 1993).The relationship between n and Ls for optimal compression performance is briefly examined.The data dependency of every computation step in the LZ compression algorithm is investigated.

The compression algorithm:
The LZ algorithm and its variants use a sliding window that moves a long with the cursor.The window can be divided into two parts, the part before the cursor, called the dictionary, and the part starting at the cursor, called the look-ahead buffer.The lengths of these two parts are input parameters to compression algorithm.The basic algorithm is very simple, and loops executing the following steps: 1.
Find the longest match of a string starting at the cursor and completely contained in the look-ahead buffer to a string starting in the dictionary.

2.
Output a triple parameter (I p , L max , S) containing the position I p of the occurrence in the window, the length L max of the match and the next symbol S past the match.

3.
Move the cursor L max + 1 symbols forward.
www.intechopen.comLet us consider an example with window length of (n=9) and look-ahead buffer length (Ls=3) shown in Fig. 1.Let the content of the window be denoted as X i , i = 0,1,…..,n-1 and that of the look-ahead buffer be Y j , j = 0,1,……Ls-1 (i.e., Y j = X i+ n-Ls ).According to LZ algorithm, the content of lookahead buffer is compared with the dictionary content starting from X0 to X n-Ls-1 to find the longest match length.If the best match in the window is found to start from position I p and the match length is L max .Then L max symbols will be represented by a codeword (I p , L max ).The codeword length is Lc:

Compression Algorithm Paramters Selection
Simulation for the performance of the LZ algorithm for different buffers lengths is performed using the Calgary corpus and the Silesia Corpus (Deorowicz, 2003) [16], as shown in Fig. 2 and Fig. 3, respectively.In these experiments, the codeword is up to 2 bytes long.From Fig. 2 and Fig. 3, the compression ratio decreases when n exceeds 1024.The above improvement in the compression ratio can be obtained only when L s = 8, 16, or 32.Based on the results, the best L s for a good compression ratio is 2 4 .Increasing L s beyond that will require a much faster growing n (as well as hardware cost and computation time), with saturating or even decreasing compression ratio.The reason is that repeating patterns tend to be short, and that increasing L s and n also increases the codeword length (l + p).To achieve a good performance for different data formats, L s may range from 8 to 32, while n may be from 1k to 8 k.The simulation results verify the results proposed in (Arias et al., 2004). www.intechopen.com

Dependency graph:
A dependency graph (DG) is a graph that shows the dependence of the computations that occur in an algorithm.The DG of the LZ algorithm can be obtained as shown in Fig. 4. In the DG, L (match length) and E (match signal) are propagated from cell to cell.X (content of the window) and Y (content of the look-ahead buffer) are broadcast horizontally and diagonally to all cells, respectively.The DG shown in Fig. 4 is called a global DG, because it contains global signals.The global DG can be transformed into a localized DG, by propagating the input data Y and X from cell to cell instead of broadcasting them.Processor assignment can be done by projection of the DG onto the surface normal to the projection vector selected.After processor assignment, the events are scheduled using a schedule vector.

Systolic array architecture
The hardware architectures of LZ data compression demonstrate that systolic array compressors are better in hardware cost and testability.The main difference between systolic arrays and other architectures is that systolic arrays can operate at a higher clock rate (due to nearest-neighbor communication) and can easily be implemented and tested (due to regularity and homogeneity).In the following subsections, the most recent systolic array architectures are described.The high performance architecture is proposed.

Design-1
This architecture was first proposed in (Ziv & Lempel, 1977).The space-time diagram and its final array architecture are given in Fig. 5, where D represents a unit delay on the signal line between two processing elements.In Table 1, the six -sets of comparisons have to be done in sequence in order to find the maximum matching substring. (1) (2) Table 1.The six-sets of required comparisons Let us consider six processing elements (PE's) in parallel, each performing one vertical set of comparisons.Each processing element would require 3 time units (L s =3) to complete its set of comparisons.As shown in Fig. 5, the delay blocks in each PE delay the Y by two time steps and the X by one time step.A space-time diagram is used to illustrate the sequence of comparisons as performed by each PE.The data brought into PE0 are routed systolically through each processor from left to right.In the first time unit, X0 and Y0 are compared at PE0.In the second time unit, X1 and Y1 are compared, X0 flows to PE1, and Y0 is delayed by one cycle (time unit).In the third time unit, X2 andY2 are compared at PE0.At this time, Y0 gets to PE1along with X1, and PE1 performs its first comparison at the third cycle, PE0 completes all its required comparisons and stores an integer specifying the number of successful comparisons in a register called L i .Another register called L max , holds the maximum matching length obtained from the previous PE's.In the fourth time unit, PE0 compares the values of L max (which for PE0 is 0) and L i , and the greater of the two is sent to the L max register of the next PE.The result of the L i -L max comparison is sent to the next PE after a delay of one time unit for proper synchronization.Finally, the L max value emerging out from the last PE (PE5 in this case) is the length of the longest matching substring.There is another register called PE's id, whose contents are passed along with the L max value to the next PE.Its contents indicate the id of the processor element where the L max value occurred which becomes the pointer to the match.The functional block of the PE is shown in Fig. 6, in which the control circuit is not included.Two comparators are needed in the PE: one is for equality check of Y j and X i and the other together with two multiplexers are for determining L max and I p .If Y j and X i are equal, a counter is incremented each time until an unsuccessful comparison occurs.Sequences X i and Y j can be generated by the buffer shown in Fig. 7, which is organized in two levels the upper level of the buffer holds the incoming symbols to be compressed.The contents of the upper level are copied into the lower level whenever the "load" line goes high.The lower level is used to provide data to the PE's in the correct sequence.The operation of the buffer is as follows.When the longest match length is found, the same number of symbols are shifted into the in upper buffer from the source and then the symbols in the upper buffer are copied to the lower buffer in parallel to generate the next sequence to the processor array.In the Design-1 array, The number of clock cycles needed to produce a codeword is 2 (n-Ls), so the utilization rate of each PE is Ls/ ], which is low since the PE is idle from the moment when Li is determined until the time the codeword is produced.The reason is that it seems impossible to compress subsequent input symbols before the present compression process is completed, because the number of input symbols needed to be shifted into the buffer is equal to the longest match length which is not available before the completion of the present compression process.Therefore, the design with more than L s pipeline stages must have some idle PEs before the present codeword is produced.

Design-2:
The Design-2 was first proposed in (Hwang & Wu, 2001).The space-time diagram and its array architecture are given in Fig. 8.It consists of L s process elements.The match element Y j s stay in the PEs, and X i and L i both flow leftwards with delays of 1 and 2 clock cycles, www.intechopen.comrespectively.The first L i from the leftmost PE will be obtained after 2* L s clock cycles.After that, one L i will be obtained every clock cycle.The block diagram of the Design-2 PE is shown in Fig. 9.

Design-3:
Design-3 was proposed in (Ranganathan & Henriques, 1993).The space-time diagram and the resulting array are given in Fig. 10.The value of Y i stays in the PE.The buffer element X i moves systolically from right to left with delay of 2 clocks.L i propagates from right to left with 1 clock cycle.The first L i from the leftmost PE will be obtained after L s clock cycles.After that, the subsequent ones will be obtained every clock cycle.The structure of Design-3 PE is shown in Fig. 11.MRB is needed to determine L max and Ip.

The Interleaved Design:
From the dependency graph shown in Fig. 4, the interleaved design is obtained by projecting all the nodes in a particular raw to a single processor element.This design was first proposed in (Hwang & Wu, 2001).The space-time diagram and the resulting array of the interleaved design are given in Fig. 12.The Y j s which need to be accessed in parallel do not change during the encoding step.The special buffer is needed to generate the interleaving symbols of X i and X i + [n-Ls/2] as shown in Fig. 13.The first L i will be obtained after L s clock cycles from the leftmost PE and subsequent ones will be obtained every one clock cycle.Before the encoding process, the Y i s are preloaded which take L s extra cycles.During the encoding process, the time to preload new source symbols depend on how many source symbols were compressed in the previous compression step, L max .The block diagram of the processor element is shown in Fig. 14.

Proposed Design (Design-P)
From the dependence graph shown in Fig. 4, all the nodes in a particular row are projected to a single processor element (Abd El Ghany, 2007).This produces an array of length Ls.The space-time diagram and the resulting array of Design-P are given in Fig. 17, where D represents a unit delay on the signal line between two processing elements.As shown in Fig. 17, the architecture consists of L s processing elements which are used for the comparison and L-encoder which is used to produce the matching length.Consequently, the look-ahead buffer symbols Y j s which do not change during the encoding step, stay in PEs.As shown in Fig. 17, it is clear that the maximum matching length is not produced by the Lencoder.So, a match results block (MRB) is needed, as shown in Fig. 20, to determine L max among the serially produced L i s.Also, the PEs need not store their ids to record the position of the L i s (Ip).Since p = [log2(n-Ls)] bits are required to represent Ip, only a pbit counter is required to provide the position i associated with each L i , since the time when L i is produced corresponds to its position.MRB uses a comparator to compare the current input L i and the present longest match length L max stored in the register.If the current input L i is larger than L max , then L i is loaded into the register and the content of the position counter is also loaded into another register which is used to store the present Ip.Another comparator is used to determine whether the whole window has been searched.It compares the content of the position counter with n-L s , whose output is used as the codeword -ready signal.During the searching process, L i might be equal to L s when I< n-L s , i.e., the content in the look-ahead buffer can be fully matched to a subset of the dictionary, and hence searching the whole window is not always necessary.An extra comparator is used to determine whether L max is equal to L s , and hence the string matching process is completed.Therefore, encoding a new set of data could start immediately.This will reduce the average compression time.The number of clock cycles needed to produce a codeword is [(n-L s ) + 1] clock cycles, so the utilization rate of each PE is (n-L s ) / [(n-L s ) + 1], which almost equals to one.This result is consistent since the PE is busy once L i is determined until the time at which the codeword is produced.
www.intechopen.comproposed implementation is area and speed efficient.The compression rate is increased by more than 40% and the design area is decreased by more than 30%.The design can be integrated into real -time systems so that data can be compressed and decompressed onthe -fly.
1) Lc is fixed.Assume w bits are required to represent a symbol in the window, l = [log2 Ls] bits are required to represent Lmax, and p = [log2 (n-Ls)] bits are required to represent Ip.Then the compression ratio is (l + p) / (Lmax * w), where 0 ≤ Lmax ≤ Ls.Hence the compression ratio depends on the match situation.

Fig. 1 .
Fig. 1. window of the LZ compressor example The codeword design and the choice of widow length are crucial in achieving maximum compression.The LZ technique involves the conversion of variable length substrings into fixed length codewords that represent the pointer and the length of the match.Hence, selection of values of n and L s can greatly influence the compression efficiency of the LZ algorithm.

Fig. 2 .
Fig. 2. The relationship between the compression ratio of Calgary corpus and Ls for different values of n

Fig. 4 .
Fig. 4. The dependence graph of the LZ compression algorithm.

Fig. 5 .
Fig. 5. Design-1 and its space-time diagram indicating the sequence of events in the 6 PE's.

Fig. 10 .
Fig. 10.The space-time diagram and the resulting array of Design-3.

Fig. 14 .
Fig. 14.The block diagram of interleaved design PE.Match results block MRB is needed to determine L max among the serially produced L i s.MRB is shown in Fig.15.The PEs do not need to store their ids to record the position of the L i s.A special counter is needed to generate the sequence which interleaves the position of the first half of L i s and the position for the second half as shown in Fig.16.The compression time of the interleaved design is n clock cycles.

Fig. 20 .
Fig. 20.The match results block (MRB) for design-P.Parallel compression can be achieved by using an appropriate number of Design-P modules.For example, two modules of Design-P can be used, as shown in Fig.21.The input sequence of the first module (Xi) is obtained from the first position of Buffer.The input sequence of the second module (X i +[(n -Ls)/2 ] ) is obtained starting at ((n -Ls)/2 ) position in the Dictionary.Note that the MRB now needs to determine L max among LI, LII that are produced at the same time.So, the MRB could be modified.The speed is two times the Design -P array.