Algorithm II
Review of Princeton Algorithms II course on Coursera.
My solution to the homework
Course Overview
topic  data structures and algorithms 

data types  stack, queue, bag, unionfind, priority queue 
sorting  quicksort, mergesort, heapsort 
searching  BST, redblack BST, hash table 
graphs  BFS, DFS, Prim, Kruskal, Dijkstra 
strings  radix sorts, tries, KMP, regexps, data compression 
advanced  Btree, suffix array, maxflow 
Undirected Graphs
Terminology:
 Path: sequence of vertices connected by edges
 Cycle: path whose first and last vertices are the same
Some graph problems:
 Path: Is there a path between \(s\) and \(t\)
 Shortest path: What is the shortest path between \(s\) and \(t\)
 Cycle: Is there a cycle in the graph
 Euler tour: Is there a cycle that uses each edge exactly once
 Hamilton tour: Is there a cycle that uses each vertex exactly once.
 Connectivity: Is there a way to connect all of the vertices?
 MST: What is the best way to connect all of the vertices?
 Biconnectivity: Is there a vertex whose removal disconnects the graph?
 Planarity: Can you draw the graph in the plane with no crossing edges
 Graph isomorphism: Do two adjacency lists represent the same graph?
API
1  public class Graph { 
For vertices, convert them between names and integers with symbol table
For edges, we have 3 choices:
 Maintain a list of the edges
 Maintain a 2D \(V\)by\(V\) boolean matrix
 Maintain a vertexindexed array of lists
Comparison:
Representation  Space  Add edge  Edge between v and w  Iterate over vertices adjacent to v 

list of edges  E  1  E  E 
adjacency matrix  V^{2}  1  1  V 
adjacency lists  E + V  1  degree(v)  degree(v) 
In practice, use adjacency lists because
 Algorithms based on iterating over vertices adjacent to \(v\)
 Realworld graphs tend to be sparse
1  public class Graph { 
DepthFirst Search
 Mark vertex \(v\) as visited
 Recursively visit all unmarked vertices adjacent to \(v\)
1  public class DepthFirstPaths { 
After DFS, can find vertices connected to \(s\) in constant time (all true vertices in marked
) and can find a path to \(s\) in time proportional to its length (edgeTo
is a parentlink representation of a tree rooted at \(s\)).
BreadthFirst Search
 Remove vertex \(v\) from queue
 Add to queue all unmarked vertices adjacent to \(v\) and mark them
1  public class BreadthFirstPaths { 
BFS examines vertices in increasing distance from \(s\). It can be used to find the shortest path.
Connected Components
Definition:
 Vertices \(v\) and \(w\) are connected if there is a path between them
 A connected component is a maximal set of connected vertices
Use DFS to partition vertices into connected components, then can answer whether \(v\) is connected to \(w\) in constant time
1  public class CC { 
Directed Graphs
API
Almost same as Graph
:
1  public class DiGraph { 
Digraph Search
Both DFS and BFS are digraph algorithms, code is identical to undirected graphs.
DFS search applications:
 Every program is a digraph. Vertex=basic block of instructions; edge=jump
 Deadcode elimination: find and move unreachable code
 Infiniteloop detection: whether exit is unreachable
 Every data structure is a digraph. Vertex=object; edge=reference
 Marksweep algorithm: collect unreachable objects as garbage
BFS search applications:
 Multiplesource shortest paths: initialize by enqueuing all source vertices
 Web crawler
Topological Sort
Precedence scheduling: Given a set of tasks to be completed with precedence constraints, in which order should we schedule the tasks?
 Represent data as digraph: vertex=task; edge=precedence constraint
 Represent the problem as topological sort: redraw directed acyclic graph (DAG) so all edges point upwards
Run DFS and return vertices in reverse postorder.
Strong Components
Definition:
 Vertices \(v\) and \(w\) are strongly connected if there is both a directed path from \(v\) to \(w\) and a directed path from \(w\) to \(v\)
 A Strong component is a maximal subset of stronglyconnected vertices
Applications:
 Food Web. Vertex=species; edge=from producer to consumer; strong component=subset of species with common energy flow
 Software module dependency graph. Vertex=software module; edge=from module to dependency; strong component=subset of mutually interacting modules
KosarajuSharir algorithm:
 Compute topological order (reverse postorder) in Reverse Graph \(G^R\)
 Run DFS in \(G\), visiting unmarked vertices in the order computed above
1  public class KosarajuSharirSCC { 
HW6 Wordnet
 WordNet digraph: build digraph from input files. WordNet.java
synsets.txt
saves all the id and corresponding words (one id can have one or more words, and one word can have one or more ids)hypernyms.txt
contains hypernym relationships
 Shortest ancestral path: given two vertices, find the directed paths to their common ancestor with the max total length. SAP.java
 Outcast detection: given a list of WordNet nouns, find the noun least related to the others. This can be measured by the sum of SAP distances to all the other vertices. Outcast.java
Minimum Spanning Trees
Definition: A spanning tree of \(G\) is a subgraph \(T\) that is both a tree (connected and acyclic) and spanning (includes all of the vertices)
Problem: Given undirected graph \(G\) with positive edge weights, find the min weight spanning tree.
EdgeWeighted Graph API
Edge abstraction needed for weighted edges.
1  public class Edge implements Comparable<Edge> { 
Edgeweighted graph representation:
1  public class EdgeWeightedGraph { 
Greedy Algorithm
Simplifications:
 Edge weights are distinct
 Graph is connected
Then MST exists and is unique.
Definitions:
 A cut in a graph is a partition of its vertices into two (nonempty) sets
 A crossing edge connects a vertex in one set with a vertex in the other.
Given any cut, the crossing edge of min weight is in the MST.
Algorithm:
 Start with all edges colored white
 Find cut with no black crossing edges; color its minweight edge black
 Repeat until \(V1\) edges are colored black
Remove simplifying assumptions:
 Greedy MST still correct if equal weights are present
 Compute MST of each component if graph is not connected
Efficient implementations: How to choose cut? How to find minweight edge?
Kruskal's Algorithm
 Consider edges in ascending order of weight
 Add next edge to tree \(T\) unless doing so would create a cycle
The challenge is how to examine adding edge \(vw\) to tree \(T\) create a cycle. Use unionfind:
 Maintain a set for each connected component in \(T\)
 If \(v\) and \(w\) are in same set, then adding \(vw\) would create a cycle
 To add \(vw\) to \(T\), merge sets containing \(v\) and \(w\)
1  public class KruskalMST { 
Prim's Algorithm
 Start with vertex \(0\) and greedily grow tree \(T\)
 Add to \(T\) the min weight edge with exactly one endpoint in \(T\)
 Repeat until \(V1\) edges
Lazy Implementation
Maintain a PQ of edges with (at least) one endpoint in \(T\), where the priority is the weight of edge
 Deletemin to determine next edge \(e=vw\) to add to \(T\)
 Disregard if both endpoints \(v\) and \(w\) are marked (both in \(T\))
 Otherwise, let \(w\) be the unmarked vertex (not in \(T\))
 add to PQ any edge incident to \(w\)
 add \(e\) to \(T\) and mark \(w\)
1  public class LazyPrimMST { 
Eager Implementation
Maintain a PQ of vertices connected by an edge to \(T\), where priority of vertex \(v\) = the weight of shortest edge connecting to \(T\)
 Delete min vertex \(v\) and add its associated edge \(vw\) to \(T\)
 Update PQ by considering all edges \(vx\) incident to \(v\)
 ignore if \(x\) is already in \(T\)
 add \(x\) to PQ if not already in it
 update priority of \(x\) if \(vx\) becomes shortest edge connecting \(x\) to \(T\)
1  public class PrimMST { 
And we need a IndexMinPQ
data structure
1  public class IndexMinPQ<Key extends Comparable<Key>> { 
Shortest Paths
Given an edgeweighted digraph, find the shortest path from \(s\) to \(t\).
Algorithm  Restriction  Typical case  Worst case  Extra space 

Topological sort  no directed cycles  E + V  E + V  V 
Dijkstra (binary heap)  no negative weights  E log V  E log V  V 
BellmanFord  no negative cycles  E V  E V  V 
BellmanFord (queuebased)  no negative cycles  E + V  E V  V 
API
Weighted directed edge API:
1  public class DirectedEdge { 
The Edgeweighted digraph API is the same as EdgeWeightedGraph
.
Singlesource shortest paths API:
1  public class SP { 
Edge Relaxation
Relax edge \(e=v \rightarrow w\)
distTo[v]
is length of shortest known path from \(s\) to \(v\)distTo[w]
is length of shortest known path from \(s\) to \(w\)edgeTo[w]
is last edge on shortest known path from \(s\) to \(w\) If \(e = v \rightarrow w\) gives shorter path to \(w\) through \(v\), update both
distTo[w]
andedgeTo[w]
Different ways to choose which edge to relax:
 Dijkstra's algorithm (nonnegative weights)
 Topological sort algorithm (no directed cycles)
 BellmanFord algorithm (no negative cycles)
Dijkstra's Algorithm
 Consider vertices in increasing order of distance from \(s\)
 Add vertex to tree and relax all edges pointing from that vertex
1  public class DijkstraSP { 
Prim's algorithm is essentially the same algorithm, distinct in the way to choose next vertex for the tree:
 Prim's algorithm choose the closest vertex to the tree (via undirected edge)
 Dijkstra's algorithm choose the closest vertex to the source (via a directed path)
Topological Sort
Suppose the edgeweighted digraph has no directed cycles
 Consider vertices in topological order
 Relax all edges pointing from that vertex
It is more efficient to get the SPT. Time is proportional to \(E + V\)
1  public class AcyclicSP { 
Applications:
 Seam carving: Resize an image without distortion
 Grid DAG: vertex = pixel; edge = from pixel to 3 downward neighbors
 Weight of pixel = energy function of 8 neighboring pixels
 Seam = shortest path (sum of vertex weights) from top to bottom
 Find longest paths in edgeweighted DAGs
 Topological sort algorithm works with negative weights.
 Negate all the weights, find shortest path, negate weights in result.
 Parallel job scheduling: Given a set of jobs with durations and precedence constraints, schedule the jobs (by finding a start time for each) so as to achieve the minimum completion time.
 Source and sink vertices. Two vertices for each job (begin and end).
 Three edges for each job. Begin to end (weighted by duration); source to begin (0 weight); end to sink (0 weight)
 One edge for each precedence constraint (0 weight)
 Use longest path from the source to schedule each job
BellmanFord
Dijkstra doesn't work with negative edge weights.
A SPT exists iff no negative cycles, A negative cycle is a directed cycle whose sum of edge weights is negative.
BellmanFord algorithm:
1  for (int i = 0; i < G.V(); i++) 
Dynamic programming algorithm can be used to compute SPT in any edgeweighted graph with no negative cycles in time proportional to \(E \times V\):
 If
distTo[v]
does not change during pass \(i\), no need to relax any edge pointing from \(v\) in pass \(i+1\)  Maintain a queue of vertices whose
distTo[]
changed
Negative cycle can be found by BellmanFord algorithm too: If any vertex \(v\) is updated in phase \(V\), there exists a negative cycle.
Negative cycle application: Arbitrage detection. Given table of exchange rates, is there an arbitrage opportunity?
 Vertex = currency; edge = transaction; weight = exchange rate
 Find a directed cycle whose product of edge weights > 1
 Take \(\ln\) for the weights so that multiplication > 1 turns to addition < 0
 Equivalent to find a negative cycle
HW7 Seam Carving
Seam Carving is a contentaware image resizing technique. Remove one row/column every time.
 Design energy function for pixels
 Find seam with minimal energy with one pixel every row/column, and every two adjacent seam points differ at most 1 column/row
 Delete seam
For this problem, dynamic programming is equivalent to treat the image as a graph and find the shortest path on it.
 Treat every pixel as a vertex \((column, row)\) and there are edges connecting from it to vertexes \((column  1, row + 1)\), \((column, row + 1)\) and \((column + 1, row + 1)\), unless the vertex is invalid.
 This graph is a DAG so the shortest path problem can be solved by topological sort. Moreover, since edges are connected layer by layer, we can sort it layer by layer too.
 To optimize the algorithm, energy can be precomputed and stored. When removing seams, only points near the removed location should be updated.
 Find/remove horizontal and vertical seam are equivalent, except for a transpose.
Solution: SeamCarver.java
Maximum Flow
Definition:
 A \(st\)cut is a partition of the vertices into two disjoint sets, with \(s\) in one set \(A\) and \(t\) is the other set \(B\)
 Its capacity is the sum of the capacities of the edges from \(A\) to \(B\)
 A \(st\)flow is a an assignment of values to the edges such that:
 Capacity constraints: \(0 \le\) edge's flow \(\le\) edge's capacity
 Local equilibrium: inflow = outflow at every vertex (except \(s\) and \(t\))
 The value of a flow is the inflow at \(t\)
Mincut problem: Find a cut of minimum capacity
Maxflow problem: Find a flow of maximum value
FordFulkerson Algorithm
For digraph with \(V\) vertices, \(E\) edges and integer capacities between \(1\) and \(U\):
 Start with \(0\) flow
 Fina an undirected path from \(s\) to \(t\) such that (Augmenting Path)
 Can increase flow on forward edges (not full)
 Can decrease flow on backward edge (not empty)
 Increase flow on that path by bottleneck capacity
FF performance depends on choice of augmenting paths:
Augmenting path  Number of paths  Implementation 

shortest path  \(\le 1/2 E V\)  queue (BFS) 
fattest path  \(\le E \ln (E U)\)  priority queue 
random path  \(\le E U\)  randomized queue 
DFS path  \(\le E U\)  stack (DFS) 
MaxflowMincut Theorem
Definition: The net flow across a cut \((A, B)\) is the sum of the flows on its edges from \(A\) to \(B\) minus the sum of the flows on its edges from \(B\) to \(A\)
Flowvalue lemma: Let \(f\) be any flow and let \((A, B)\) be any cut. The net flow across \((A, B)\) equals the value of \(f\)
Maxflowmincut theorem: Value of the maxflow = capacity of mincut
To compute mincut \((A, B)\) from maxflow \(f\): Compute \(A\) = set of vertices connected to \(s\) by an undirected path with no full forward or empty backward edges.
Implementation
Residual network:
 Use forward edge to represent residual capacity
 Use backward edge to represent flow
Augmenting path in original network is equivalent to directed path in residual network
Flow Edge API:
1  public class FlowEdge { 
Flow Network API:
1  public class FlowNetwork { 
FordFulkerson algorithm:
1  public class FordFulkerson { 
Applications
Bipartite Matching Problem
\(N\) students apply for \(N\) jobs, each gets several offers. Is there a way to match all students to jobs?
Equivalent to: Given a bipartite graph, find a perfect matching.
 Create \(s\), \(t\), one vertex for each student, and one vertex for each job
 Add edge from \(s\) to each student (capacity \(1\))
 Add edge from each job to \(t\) (capacity \(1\))
 Add edge from student to each job offered (capacity infinity)
Every perfect matching in bipartite graph corresponds to a maxflow of value \(N\)
Baseball Elimination Problem
Given each team's current wins and losses, and their games left to play, determine whether team \(i\) can be eliminated (cannot win).
Construct the graph:
 Create \(s\), \(t\), one vertex for each pair of teams other than \(i\): \(j \leftrightarrow k\), and one vertex for each team other than \(i\)
 Add edge from \(s\) to each pair of teams (capacity = number of games left between this pair)
 Add edge from each team \(j\) to \(t\) (capacity = \(w_i + r_i  w_j\) ) upperbound on the games that \(j\) can win
 Add edge from each pair of teams to two corresponding teams (capacity infinity)
Team \(i\) will not be eliminated iff all edges pointing from \(s\) are full in maxflow
HW8 Baseball Elimination
To check whether team \(x\) is eliminated, we consider two cases:
 Trivial elimination: The maximum number of games team \(x\) can win is less than the number of wins of some other team \(i\)
 Nontrivial elimination: Solve a maxflow problem as mentioned above
Solution: BaseballElimination.java
String Sorts
worst  average  extra space  stable?  

LSD  \(2NW\)  \(2NW\)  \(N + R\)  ✔ 
MSD  \(2NW\)  \(N \log_R N\)  \(N + DR\)  ✔ 
3way string quicksort  \(1.39 W N \lg R\)  \(1.39 N \lg R\)  \(\lg N + W\) 
String vs. StringBuilder
Operations:
 Length: Number of characters
 Indexing: Get the \(i\)th character
 Substring extraction: Get a contiguous subsequence of characters
 String concatenation: Append one character to end of another string
length()  charAt()  substring()  concat()  

String  1  1  1  N 
StringBuilder  1  1  N  1 
String:
 sequence of characters (immutable)
 Immutable
char[]
array, offset, and length
StringBuilder:
 Sequence of characters (mutable)
 Resizing
char[]
array and length
KeyIndexed Counting
Comparebased algorithms require \(\sim N\lg N\) compares.
We can do better if we don't depend on key compares.
Keyindexed counting:
 Assumption: keys are integers between \(0\) and \(R1\)
 Implication: can use key as an array index
Sort an array a[]
of \(N\) integers between \(0\) and \(R1\):
 Count frequencies of each letter using key as index
 Compute frequency cumulates which specify destinations
 Access cumulates using key as index to move items
 Copy back into original array
1  int N = a.length; 
LSD Radix Sort
LSD (LeastSignificantDigitfirst) string sort:
 Consider characters from right to left
 Stably sort using \(d^{th}\) character as the key (keyindexed counting)
1  public class LSD { 
MSD Radix Sort
MSD (MostSignificantDigitFirst) string sort:
 Partition array into \(R\) pieces according to first character
 Recursively sort all strings that start with each character
Treat variablelength strings as if they had an extra char at end (smaller than any char)
1  private static int charAt(String s, int d) { 
1  public static void sort(String[] a) { 
MSD String Sort vs. Quicksort
Disadvantages of MSD string sort:
 Extra space for
aux[]
 Extra space for
count[]
 Inner loop has a lot of instructions
 Accesses memory "randomly" (cache inefficient)
Disadvantage of quicksort:
 Linearithmic number of string compares (not linear)
 Has to rescan many characters in keys with long prefix matches
3Way Radix Quicksort
Do 3way partitioning on the \(d^{th}\) character
 Less overhead than \(R\)way partitioning in MSD string sort
 Does not reexamine characters equal to the partitioning char
1  private static void sort(String[] a) 
3Way String Quicksort vs. Standard Quicksort
Standard quicksort:
 Uses \(\sim 2 N \ln N\) string compares on average
 Costly for keys with long common prefixes (and this is a common case)
3way string quicksort:
 Uses \(\sim 2 N \ln N\) character compares on average
 Avoids recomparing long common prefixes
Suffix Arrays
Problem: Given a text of \(N\) characters, preprocess it to enable fast substring search (find all occurrences of query string context)
 suffix sort the text
 Binary search for query; scan until mismatch
Tries
String symbol table API:
1  public class StringST<Value> { 
Goal: Faster than hashing, more flexible than BSTs.
RWay Tries
 Store characters in nodes (not keys)
 Each node has \(R\) children, one for each possible character
Search: Follow links corresponding to each character in the key
 Search hit: node where search ends has a nonnull value
 Search miss: reach null link or node where search ends has null value
Insertion: Follow links corresponding to each character in the key
 Encounter a null link: create new node
 Encounter the last character of the key: set value in that node
Deletion:
 Find the node corresponding to the key and set value to null
 If node has null value and all null links, remove that node (and recur)
1  public class TriesST<Value> { 
Performance:
 Search hit: Need to examine all \(L\) characters for quality
 Search miss: Examine only a few characters for typical case
 Space: \(R\) null links at each leaf (sublinear if many short strings share common prefixes)
Ternary Search Tries
 Store characters and values in nodes (not keys)
 Each node has 3 children: smaller (left), equal (middle), larger (right)
Search: Follow links corresponding to each character in the key
 If less, take left link; if greater, take right link
 If equal, take the middle link and move to the next key character
Insertion: Follow links corresponding to each character in the key
 Encounter a null link: create new node
 Encounter the last character of the key: set value in that node
Deletion:
 Find the node corresponding to the key and set value to null
 If node has null value and all null links, remove that node (and recur)
1  public class TST<Value> { 
TST vs. Hashing
Hashing:
 Need to examine entire key
 Search hits and misses cost about the same
 Performance relies on hash function
 Does not support ordered symbol table operations
TST:
 Works only for strings (or digital keys)
 Only examines just enough key characters
 Search miss may involve only a few characters
 Support ordered symbol table operations (plus others)
Substring Search
Find pattern of length \(M\) in a text of length \(N\). Typically \(N \gg M\)
Brute force: Check for pattern starting at each text position.
 We want lineartime guarantee
 We want to avoid backup problem: treat input as stream of data
KnuthMorrisPratt
DFA
DFA (Deterministic finite state automaton):
 Finite number of states (including start and halt)
 Exactly one transition for each char in alphabet
 Accept if sequence of transitions leads to halt state
The DFA state after reading in text[i]
is the length of longest prefix of pattern[]
that is a suffix of text[0:i]
Constructing the DFA:
 Copy of
dfa[][X]
todfa[][j]
for mismatch case  Set
dfa[pat.charAt(j)][j]
toj+1
for match case  Update
X
1  public KMP(String pat) { 
KMP
Once we have the dfa
matrix, we can use this make transition without backup
1  public int search(In in) { 
BoyerMoore
 Scan characters in pattern from right to left
 Can skip as many as \(M\) text chars when finding one not in the pattern
Precompute index of rightmost occurrence of character in pattern to decide how much to skip:
1  right = new int[R]; 
BoyerMoore:
1  public int search(String txt) { 
RabinKarp
Use modular hashing:
 Compute a hash of pattern characters \(0\) to \(M1\)
 For each \(i\), compute a hash of text characters \(i\) to \(M+i1\)
 if pattern hash = text substring hash, check for a match
Modular hash function: use the notation \(t_i\) for txt.charAt(i)
, compute \[
\displaylines{x_{i}=t_{i} R^{M1}+t_{i+1} R^{M2}+\ldots+t_{i+M1} R^{0}(\bmod Q)
}
\] Horner's method: lineartime method to evaluate degree\(M\) polynomial
1  private long hash(String key, int M) { 
RabinKarp:
1  public class RabinKarp { 
HW9 Boggle
Find valid words in a board by connecting characters to its eight neighbors.
 Add dictionary to 26way Tries
 Use dfs to search the board. Early termination can be made when no prefix match
 dfs can be run with node expansion together to avoid starting from the root every time
Solution: BoggleSolver.java
Regular Expressions
 Substring search: Find a single string in text
 Pattern matching: Find one of a specified set of strings in text
Regular Expression Pattern:
operation  order  example RE  matches  dose not match 

concatenation  3  AABAAB  AABAAB  every other string 
or  4  AA  BAAB  AA BAAB 
every other string 
closure  2  AB*A  AA ABBBBBA 
AB ABABA 
parentheses  1  A(A  B)AAB  AAAAB ABAAB 
every other string 
parentheses  1  (AB)*A  A ABABABABA 
AA ABBA 
wildcard  .U.U.U.  CUMULUS JUGULUM 
SUCCUBUS TUMULTUOUS 

character class  [AZaz][az]*  word Captitalized 
camelCase 4illegal 

at least 1  A(BC)+DE  ABCDE ABCBCDE 
ADE BCDE 

exactly k  [09]{5}[09]{4}  085401321 190725541 
111111111 16654111 
NFA
Kleene's theorem:
 For any DFA, there exists a RE that describes the same set of strings
 For any RE, there exists a DFA that recognized the same set of strings
The DFA built from RE may have exponential number of states, so instead we use NFA (Nondeterministic finite state automata).
Regularexpressionmatching NFA:
 RE enclosed in parentheses
 One state per RE character (start=\(0\), accept=\(M\))
 Red \(\epsilon\) transition (change state, but don't scan text)
 Black match transition (change state and scan to next text char)
 Accept if any sequence of transitions end in accept state
The Nondeterminism comes from different transitions. The program has to consider all possible transition sequences.
Basic plan:
 Build NFA from RE
 Simulate NFA with text as input
NFA Simulation
NFA representation:
 State names: Integers from \(0\) to \(M\)
 Matchtransitions: Keep regular expression in array
re[]
 \(\epsilon\)transitions: Store in a digraph \(G\)
Simulation Steps:
 Run DFS from each source, without unmarking vertices
 Maintain set of all possible states that NFA could be in after reading in the first \(i\) characters
 When no more input characters, accept if any state in the set is an accept state
1  public class NFA { 
Time complexity: Determining whether an \(N\)character text is recognized by the NFA corresponding to an \(M\)character pattern takes time proportional to \(MN\) in the worst case.
NFA Construction
Construction process:
 States: Include a state for each symbol in the RE, plus an accept state.
 Concatenation: Add matchtransition edge from state corresponding to characters in the alphabet to next state.
 Parentheses: Add \(\epsilon\)transition edge from parentheses to next state.
 Closure: Add three \(\epsilon\)transition edges for each
*
operator.  Or: Add two \(\epsilon\)transition edges for each

operator
1  private Digraph buildEpsilonTransitionDigraph() { 
Time complexity: Building the NFA corresponding to an \(M\)character RE takes time and space proportional to \(M\)
Data Compression
Lossless compression and expansion:
 Message: Binary data \(B\) we want to compress
 Compress: Generates a "compressed" representation \(C(B)\)
 Expand: Reconstructs original bitstream \(B\)
Static model: Same model for all texts
 Fast
 Not optimal: different texts have different statistical properties
 Ex: ASCII, Morse code
Dynamic model: Generate model bases on text
 Preliminary pass needed to generate model
 Must transmit the model
 Ex: Huffman code
Adaptive model: Progressively learn and update model as you read text
 More accurate modeling produces better compression
 Decoding must start from beginning
 Ex: LZW
RunLength Encoding
Simple type of redundancy in a bitstream: Long runs of repeated bits
1  0000000000000001111111000000011111111111 
Representation: 4bit counts to represent alternating runs of 0s and 1s (15 0s, 7 1s, 7 0s, 11 s)
1  1111 0111 0111 1011 
Huffman Compression
Variablelength codes: Use different number of bits to encode different chars.
To avoid ambiguity, ensure that no codeword is a prefix of another.
Use binary trie to represent the prefixfree code.
Huffman trie node data type:
1  private static class Node implements Comparable<Node> { 
Expansion:
1  public void expand() { 
Read and write:
1  private static void writeTrie(Node x) { 
To find the best prefixfree code, use Huffman algorithm:
 Count freq for each char in input
 Start with one node for each char with weight equal to freq
 Select two tries with min weight and merge into single trie with cumulative weight
1  private static Node buildTrie(int[] freq) { 
LZW Compression
 Create ST associating \(W\)bit codewords with string keys
 Initialize ST with codewords for singlechar keys
 Find longest string \(s\) in ST that is a prefix of unscanned part of input
 Write the \(W\)bit codeword associated with \(s\)
 Add \(s+c\) to ST, where \(c\) is next char in the input
Use a trie that supports longest prefix match to represent LZW compression code table:
1  public static void compress() { 
HW10 Burrows–Wheeler
Implement the BurrowsWheeler data compression algorithm
 Burrows–Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times. BurrowsWheeler.java, CircularSuffixArray.java
 Movetofront encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear much more frequently than others. MoveToFront.java
 Huffman compression. Given a text file in which certain characters appear much more frequently than others, compress it by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.