Lecture 18  Control Flow and Data Flow
 Introduction
 Control Edges and Data Edges
 Construction of the CFG and the DFG
 Dealing with loops
 Harder cases
 Other constructions
 Application: Converting C to Hardware
 Conclusions
Introduction
We will temporarily leave the world of HPS and Linux to discuss a fundamental topic in hardwaresoftware codesign. We will define control flow and data flow in the context of C programs and assembly programs. And we will show how the analysis of control flow and data flow of a program can reveal fundamental dependencies in your design. The analysis of control flow and data flow of a program leads to two data structures called a control flow graph and a data flow graph, respectively.
Control Edges and Data Edges
Control edges and data edges are two orthogonal representations of a C or assembly program. They describe different aspects. Control edges describe the sequence of instructions and/or statements. Data edges describe the flow of data troughout the program.
All control edges together form a graph called the Control Flow Graph (CFG). All data edges together form a graph called the Data Flow Graph (DFG). Neither control edges nor data edges are a complete representation of a C (or assembly) program. In particular, you need both to fully capture the meaning of the C (or assembly) program.
A C (or assembly) program = CFG + DFG
We next define control edges and data edges. The two concepts are closely related but not identical.
A Data Edge is a relation between two operations in a program, such that data produced by one operation is consumed by the other.
A Control Edge is a relation between two operations in a program, such that one operation will be executed after the other.
Why do we care about distinguishing control edges and data edges? Both of these describe different aspects of a program.

You can think of a Data Edge as a fundamental constraint of the program’s meaning. If you execute two operations on a manner that would violate a data edge relationship, then you change the meaning of the program.

A Control Edge on the other hand is an implementation effect. The fact that two operations have to execute sequentially is not a constraint of the program’s meaning. Rather, it is an implementation issue. The two operations have to execute sequentially because there is not enough parallel hardware available to run these operations in parallel.
It’s clear where you would go with such a concept. If you want to build hardware from a C program, then you have to understand the DFG in detail. The DFG combines all operations together and links them through the data they produce and/or consume. A CFG, on the other hand, is not a fundamental property. For a parallel implementation of a sequential program, we would very much like to get rid of the sequential features, and thus eliminate the CFG.
Construction of the CFG and the DFG
We now discuss a systematic procedure to draw a CFG and a DFG for
a given program. We start with a simple example in C. The following
max
function determines the largest of two inputs a
and b
.
int max(int a, int b) {
if (a > b)
r = a;
else
r = b;
return r;
First, the draw the CFG and the DFG we have to decide on the nodes of the graph. We will use the easy approximation that each C statement corresponds to one node. Hence, we could annotate the program with the node numbers as follows. The source node is the node that provides the input, while the sink node is the node that carries the result away.
int max(int a, int b) { // 1  source node
if (a > b) // 2
r = a; // 3
else
r = b; // 4
return r; // 5  sink node
Next, following the C semantics, we draw the control flow
graph, by interconnecting the nodes that have a sequential dependency.
For the max
function we find the following control flow graph. Note how decision nodes are at the source of more than one control node.
// control edges
int max(int a, int b) { // 1 > 2
if (a > b) // 2 > 3 or 2 > 4
r = a; // 3 > 5
else
r = b; // 4 > 5
return r; // 5  sink node
The data flow graph (DFG) must be derived after the control flow graph. The data flow graph has the same set of nodes as the control flow graph: there is one node per C statement. The data flow graph requires identification of the data dependencies between every node. To simplify this, we start by annotating, with each node, what variables are read and what variables are written.
// a b r
int max(int a, int b) { // 1 W W
if (a > b) // 2 R R
r = a; // 3 R W
else
r = b; // 4 R W
return r; // 5 R
Next, we draw each data dependency. A data dependency goes from a node that writes into a variable to another node that reads from the variable. To have a valid dependency, we must identify the correct ‘write’ node for each ‘read’ node. That is done as follows.

Start with a node that reads from a variable. For example, node 3 in the example reads variable
a
. That read operation is the endpoint of a data dependency. 
Next, walk backward in the control flow graph until you find a node that writes the same variable. That is the starting point of a data dependency. For example, going backward from node 3, we visit node 2, and then node 1. Only node 1 writes
a
. Therefore, the data dependency fora
goes from 1 to 3.
The search for data dependencies has to be performed breadthfirst
and consider every control path that leads to the endpoint of the
data dependency. For example, the read of r
in node 5 could be
done by node 4 or else by node 3. This is because there is a
control flow from node 3 to node 5 5 as well as from node 4 to node 5.
// data edges
// a b r
int max(int a, int b) { // 1 1>2 1>2
// 1>3 1>4
if (a > b) // 2
r = a; // 3 3>5
else
r = b; // 4 4>5
return r; // 5
The same analysis can also be done at assembly level. Here is the max
function in Nios assembly. Recall that this function follows the Nios ABI: the input arguments are in r4
and r5
, while the output argument is in r2
.
// CFG r2 r4 r5
max(r4, r5): // 1>2 W W
mov r2, r4 // 2>3 W R
bge r4, r5, .L2 // 3>4, 3>5 R R
mov r2, r5 // 4>5 W R
.L2:
ret (r2) // 5 R
Using the CFG and the read/write operations in registers, we can derive the DFG.
// data edges
// r2 r4 r5
max(r4, r5): // 1 1>2 1>3
// 1>3 1>4
mov r2, r4 // 2 2>5
bge r4, r5, .L2 // 3
mov r2, r5 // 4 2>5
.L2:
ret (r2) // 5
Dealing with loops
DFG and CFG contruction is interesting when the program becomes more complicated. For example, when there are loops, backtracing in the control flow graph can lead to unexpected data dependencies. This is an area where data flow analysis really shines, since it can alert you to read/write dependencies that would go unnoticed otherwise. In the second example we consider a GCD algorithm.
// control edges
unsigned gcd(unsigned a, // 1 1>2
unsigned b) {
while (a != b) // 2 2>3, 2>6
if (b > a) // 3 3>4, 3>5
b = b  a; // 4 4>2
else
a = a  b; // 5 5>2
return a; // 6
}
To determine the DFG, first collect the read/write operations for each node.
// a b
unsigned gcd(unsigned a, // 1 W W
unsigned b) {
while (a != b) // 2 R R
if (b > a) // 3 R R
b = b  a; // 4 R W,R
else
a = a  b; // 5 W,R R
return a; // 6 R
}
Next, we derive the data flow edges, per variable. We start with ‘a’. Every node that reads a
is the end point of a data dependency. The starting point is a corresponding write node that can be directly reached through a reverse control dependency. These dependencies go across loop boundaries, and can be nontrivial.
Consider, for example, the read of a
in node 5. There are two possible nodes that write a
: node 1 and node 5. To find which are try dependencies, follow the reverse path in the control graph.
From 5>3>2>1, we see that node 1 has a data dependency on node
5 for a
. Next, from 5>3>2>5, we see that node 5 also has a dependency on itself. The following program shows all data dependencies.
// data edges
// a b
unsigned gcd(unsigned a, // 1 1>2 1>2
unsigned b) { 1>3 1>3
1>4 1>4
1>5 1>5
1>6 1>6
while (a != b) // 2
if (b > a) // 3
b = b  a; // 4 4>4
// 4>3
// 4>2
// 4>5
else
a = a  b; // 5 5>5
// 5>6
// 5>2
// 5>3
// 5>4
return a; // 6
}
Harder cases
Control Flow Analysis and Data Flow Analysis work well if we can statically observe every dependency from the source code. That is not always the case.
 Pointers obscure data dependencies. Pointers can alias other variables so that a read/write from a dereferenced pointer really is a dependency into another variable (while at the same time being a readoperation for the pointer).
// CFG a b c DFG(a) DFG(b)
a = 5; // 1 1>2 W
b = &a; // 2 2>3 W
c = *b; // 3 3>exit R R W 1>3 2>3
 Arrays make data dependencies more complex, since a single
array reference can be any of the data members of the array.
There are two ways to look at a reference such as
a[i]
. The first is to look ata[i]
as a scalar variable. This requires knowledge ofi
 which may be hard, wheni
depends on the runtime state of the program. The second is to look at the arraya
as an atomic entity, and consider every access into the array a dependency on the complete array. The dependencies identified in each case may be different, as the following example shows.
// CFG a[3] i DFG(a) DFG(a[3])
i = 3; // 1 1>2 W
a[i] = 1; // 2 2>3 W R
a[0] = 0; // 3 3>4
b = a[3]; // 4 4>exit R 2>4 1>4
Our analysis technique is not quite advanced enough to handle arrays in a general manner. In the following, we will stick to the scalar case.
Other constructions
forloop
for (a=0; a<3; a++) {
b[a] = 2;
}
We break down the compound for loop as follows:
// CFG a DFG(a)
for (a=0; // 1 1>2 W
a<3; // 2 2>4 R 1>2, 3>2
2>exit
a++) { // 3 3>2 W,R 1>3, 3>3
b = b + a; // 4 4>3 R 1>4, 3>4
}
case
// CFG a DFG(a)
switch(a) { // 1 1>2 R initial>1
1>3
1>4
case '1': b = a; // 2 2>exit R initial>2
break;
case '2': c = a; // 3 3>exit R init>3
break;
default : b = 0; // 4 4>exit
}
dowhile loop
// CFG a DFG(a)
a = 1; // 1 1>2 W
do {
a = a  1; // 2 2>3 R,W 1>2, 2>2
} while (a != 0); // 3 3>2 R 2>3
// 3>exit
Application: Converting C to Hardware
In recent years, high level synthesis has rapidly grown in interest and importance. The need to design hardware faster, while stil delivering performance speedup due to parallellization, has made the use of automatic synthesis tools essential.
The following is an snippet of input code for Vivado Highlevel Synthesis, a compiler for Xilinx FPGA’s. Several other examples can be found on the Xilinx Github repository.
void diff_sq_acc(din_t a[N], din_t b[N], dout_t *dout) {
int i;
int acc= 0;
int a_reg, b_reg, sub, sub2;
for(i=0; i<N; i++) {
#pragma HLS PIPELINE II=1
a_reg = a[i];
b_reg = b[i];
sub = a_reg  b_reg;
sub2 = sub*sub;
acc += sub2;
}
*dout = acc;
}
Dataflow and controlflow analysis is one of the steps done in the conversion of a C program into a parallel implementation of it.
We will illustrate this by converting a small C function into hardware using a very (very) simple conversion approach, based on the following rules.
 Each single C statement will execute in a clock cycle
 Each expression (operation) becomes a datapath operation
 Each variable becomes a register
 The FSM controller is extracted from the CFG
The function computes a modulo179 operation. This particular design is interesting as it can achieve a mod179 by only using division and multiplication with powers of two. In hardware, such operations can be implemented efficiently using shift operations and bitwise and operations.
unsigned modk(unsigned x, unsigned k) { return (x & ((1 << k)  1)); }
unsigned divk(unsigned x, unsigned k) { return (x >> k); }
unsigned modulo(unsigned x) {
unsigned r, q, k, a, m, z;
m = 0xB3; // 179
k = 8;
a = (1 << k)  m;
r = modk(x, k);
q = divk(x, k);
do {
do {
r = r + modk(q * a, k);
q = divk(q * a, k);
} while (q != 0);
q = divk(r, k);
r = modk(r, k);
} while (q != 0);
z = (r >= m) ? r  m : r;
return z;
}
We will first simplify this program and substitute some of the variable that hold constants. Our conversion rules state that variables will become registers, and there’s no point storing a constant as a register!
unsigned modulo(unsigned x) {
unsigned r, q, z;
r = x & 255;
q = x >> 8;
do {
do {
r = r + (q * 77) & 255;
q = (q * 77) >> 8;
} while (q != 0);
q = r >> 8;
r = r & 255;
} while (q != 0);
z = (r >= m) ? r  m : r;
return z;
}
Next, we extract the CFG and the DFG.
// CFG x r q z
unsigned modulo(unsigned x) { // 1 1>2 W
unsigned r, q, z;
r = x & 255; // 2 2>3 R W
q = x >> 8; // 3 3>4 R W
do {
do {
r = r + (q * 77) & 255; // 4 4>5 W,R R
q = (q * 77) >> 8; // 5 5>6 W,R
} while (q != 0); // 6 6>7 R
6>4
q = r >> 8; // 7 7>8 R W
r = r & 255; // 8 8>9 W,R
} while (q != 0); // 9 9>10 R
9>4
z = (r >= m) ? r  m : r; // 10 10>11 R W
return z; // 11 11>exit
}
// CFG DFG(r) DFG(q)
unsigned modulo(unsigned x) { // 1 1>2
unsigned r, q, z;
r = x & 255; // 2 2>3
q = x >> 8; // 3 3>4
do {
do {
r = r + (q * 77) & 255; // 4 4>5 2>4 3>4
// 8>4 5>4
q = (q * 77) >> 8; // 5 5>6 3>5
// 5>5
} while (q != 0); // 6 6>7 5>6
6>4
q = r >> 8; // 7 7>8 4>7
r = r & 255; // 8 8>9 4>8
} while (q != 0); // 9 9>10 7>9
9>4
z = (r >= m) ? r  m : r; // 10 10>11 8>10
return z; // 11 11>exit 10>11
}
We can now implement the FSMD. If we apply our conversion rules, we end up with an 11cycle implementation that uses three registers r,q and z.
However, there is a trivial optimization possible, which we will implement as well. The rule is the following:
Two successive states in the CFG can be merged into one, if they have no data dependencies between them, and if the resulting cluster has a unique entry point and exit point.
It’s easy to see why that is the case. If two successive states have no data dependencies, that means that they write into different registers, so they can run in the same clock cycle. Second, if the cluster has a unique entry point and exit point, then the sucessive states will always run jointly.
The following annotation reflects this optimization. Notice also how we are now able to extract the FSM states and corresponding datapath expressions.
// FSM Datapath
unsigned modulo(unsigned x) { // 1
unsigned r, q, z;
r = x & 255; // 2 LOAD r <= x&255
q = x >> 8; // 3 q <= x>>8
  
do {
do {
r = r + (q * 77) & 255; // 4 COMPUTE1 r <= r + mul
//
q = (q * 77) >> 8; // 5 q <= mul>>8
//
  
} while (q != 0); // 6 INNER (q != 0)?
  
q = r >> 8; // 7 COMPUTE2 q <= r>>8
r = r & 255; // 8 r <= r&255
  
} while (q != 0); // 9 OUTER (q != 0)?
  
z = (r >= m) ? r  m : r; // 10 ADJ z = ...
//
return z; // 11
}
Finally, we can write a Verilog implementation for this design. This transformation is mechanical; all of the conversion work and analysis was done by identifying (and optimizing) control flow dependencies and data flow dependencies.
module mod179(
input wire clk,
input wire reset,
input wire [15:0] x,
input wire start,
output reg done,
output reg [7:0] z);
localparam [2:0] load = 3'b000,
compute1 = 3'b001,
inner = 3'b010,
compute2 = 3'b011,
outer = 3'b100,
adj = 3'b101;
reg [15:0] q_reg, q_next;
reg [15:0] r_reg, r_next;
reg [ 2:0] state_reg, state_next;
wire [15:0] mul;
always @(posedge clk, posedge reset)
begin
q_reg <= reset ? 16'b0 : q_next;
r_reg <= reset ? 16'b0 : r_next;
state_reg <= reset ? load : state_next;
end
assign mul = q_reg[7:0] * 8'd77;
always @*
begin
q_next = q_reg;
r_next = r_reg;
state_next = state_reg;
done = 1'd0;
z = 8'd0;
case (state_reg)
load:
if (start) begin
q_next = x[15:8];
r_next = x[7:0];
state_next = compute1;
end
compute1:
begin
r_next = r_reg + mul[7:0];
q_next = mul[15:8];
state_next = inner;
end
inner:
if (q_reg != 16'd0)
state_next = compute1;
else
state_next = compute2;
compute2:
begin
q_next = r_reg[15:8];
r_next = r_reg[7:0];
state_next = outer;
end
outer:
if (q_reg != 16'd0)
state_next = compute1;
else
state_next = adj;
adj:
begin
done = 1'd1;
z = (r_reg >= 8'd179) ? (r_reg  8'd179) : r_reg;
state_next = load;
end
default:
state_next = load;
endcase
end
endmodule
Conclusions
The analysis of dataflow and controlflow dependencies is an important element in building insight into the behavior of a design specification. Understanding the data flow is crucial to identify opportunities for parallel execution. Understanding the control flow helps to construct a hardware controller that drives and FSMD.