FPGA Architecture
Presentation Overview
Available choice for digital designer FPGA – A detailed look Interconnection Framework
FPGAs and LDs
Field programmability and programming technologies
SRAM, Anti-fuse, EPROM and EEPROM
Design steps Commercially available devices
Xilinx XC4000 Altera MAX 7000
Fixed Versus Programmable Logic
The circuits in a fixed logic device are permanent, they perform one function or set of functions once manufactured, they cannot be changed.
Programmable logic devices (PLDs) are standard, off-the-shelf parts that offer customers a wide range of logic capacity, features, speed, and voltage characteristics - and these devices can be changed at any time to perform any number of functions.
Classifications PLA — a Programmable Logic Array (PLA) is a relatively small FPD that contains two levels of logic, an ANDplane and an OR-plane, where both levels are programmable PAL — a Programmable Array Logic (PAL) is a relatively small FPD that has a programmable AND-plane followed by a fixed OR-plane SPLD — refers to any type of Simple PLD, usually either a PLA or PAL LD — a more Complex PLD that consists of an arrangement of multiple SPLD-like blocks on a single chip. FPGA — a Field-Programmable Gate Array is an FPD featuring a general structure that allows very high logic capacity.
Definitions Field Programmable Device (FPD): — a general term that refers to any type of integrated circuit used for implementing digital hardware, where the chip can be configured by the end to realize different designs. —Programming of such a device often involves placing the chip into a special programming unit, but some chips can also be configured “in-system”. Another name for FPDs is programmable logic devices (PLDs).
Designer’s Choice
Digital designer has various options
SSI (small scale integrated circuits) or MSI (medium scale integrated circuits) components
Difficulties arises as design size increases Interconnections grow with complexity resulting in a prolonged testing phase
Simple programmable logic devices PALs
(programmable array logic) PLAs (programmable logic array) Architecture not scalable; Power consumption and delays play an important role in extending the architecture to complex designs Implementation of larger designs leads to same difficulty as that of discrete components
Simple Programmable Logic Devices Simple
two level structure PAL and PLA
Allow high speed performance implementations of circuit
Drawback
Small logic circuits Modest
number of product Interconnection structure grow impractically large
With increase in product
MPGAs
PLA Programmable AND Plane
Programmable OR Plane YZ XZ XYZ XY
X
Y
Z
XY+YZ XZ+XYZ
?
?
PLA Programmable AND Plane
Programmable OR Plane
Programmable Node Un-programmed Connect Disconnect
X X Y
X X Y Y
O1
Y
O2
O3 XY XY
XY
XY
O4
PAL Programmable AND Plane
X
Y
Fix OR Plane
O1
O2
O3
O4
PAL with Logic Expanders Programmable AND Plane Fix OR Plane
?
Logic expanders
PLA v.s. PAL PLAs are more flexible than PALs since both AND & OR planes are programmable in PLAs. Because both AND & OR planes are programmable, PLAs are expensive to fabricate and have large propagation delay. By using fix OR gates, PALs are cheaper and faster than PLAs. Logic expanders increase the flexibilities of PALs, but result in significant propagation delay. PALs usually contain D flip-flops connected to the outputs of OR gates to implement sequential circuits. PLAs and PALs are usually referred to as SPLD.
Programmable Logic
Programmable digital integrated circuit Standard off-the-shelf parts Desired functionality is implemented by configuring on-chip logic blocks and interconnections Advantages (compared to an ASIC):
Low development costs Short development cycle Device can (usually) be reprogrammed
Types of programmable logic:
Complex PLDs (LD) Field programmable Gate Arrays (FPGA)
A Generic LD Structure
Multiple PLDs can be combined on a single chip by using programmable interconnect structures. These PLDs are called LDs.
LD Architecture and Examples
PLD Sum of Products Programmable AND array followed by fixed fan-in OR gates A
B
C Programmable switch or fuse
f1 A B C A B C
f2 A B A B C
AND plane
PLD - Macrocell Can implement combinational or sequential logic A
B
Select
C
Enable
f1 Flip-flop D
Clock
AND plane
Q
MUX
LD Structure Integration of several PLD blocks with a programmable interconnect on a single chip PLD Block
• • •
• • •
I/O Block
PLD Block
I/O Block
I/O Block
• • •
Interconnection Matrix
I/O Block
• • •
PLD Block
PLD Block
High Density Logic Overview
High-Density or Complex PLDs
Large Logic Building Blocks PLD-Like Architectures Centralized Interconnect
HDPLD or LD A
C
Fast Predictable Performance Good at “Wide Gating” Functions State
Machines Counters
B
Altera MAX LD I/O Cell
LAB
LAB
LAB
LAB
LAB
LAB
Chip-wide interconnect
Altera MAX chip
LA (local array)
•••
LAB (Logic Array Block)
Macroccell Each LAB contains 16 macrocells
LD Example - Altera MAX7000
EPM7000 Series Block Diagram
LD Example - Altera MAX7000
MAX 7000 architecture includes Logic array blocks Macrocells Expander product ( shareable and parallel) Programmable interconnect array I/O control blocks Performance Linking high performance and flexible LABs Programmable interconnect array (PIA) Global bus fed by all dedicated inputs I/O pins Macrocells
Configured Combinational or Sequential logic operations
LD Example - Altera MAX7000
EPM7000 Series Device Macrocell
LD Example - Altera MAX7000
Macrocell Logic array Combinational
logic is implemented in the array Five products
Product-term select matrix Allocate
product
Primary logic inputs to AND or XOR gates – Combinational logic Secondary inputs to the – Clear, preset, clock, clock enable control functions
Programmable Logic expanders Each
LAB has16 Shareable expanders
Inverted product fed back into the logic array
Parallel
expanders
Product borrowed from adjacent macrocells
LD Example - Altera MAX7000
ed functions
Macrocell FF
Configured to get T,D,JK or SR functions
Software
Optimize resource utilization
By global clock signal
Fastest clock to output performance
By a global clock signal and an active-high clock enable
Provides an enable on each FF Fast clock to output performance of global clock
By an array clock implemented with a product term
FF clocked – Signals from buried macrocells – I/O pins
FF s Asynchronous preset and clear functions
Selecting efficient FF operation for ed function
can be clocked in different modes
Programmable clock control
Product-term select matrix provides product to control these functions Control signals are active high or can be derived to be below by inverting within the logic array
Device power-up
FF is cleared upon power up
LD Example - Altera MAX7000
I/O pins
Fast input path to macrocell bying PIA and combinational logic
Complex logic functions
Each macrocell provides five product Most
logic functions can designed using five product
Another macrocell can be used to
supply the required logic resources Expander product Shareable
and parallel expander product
Provides additional product to any macrocell within the same LAB
LD Example - Altera MAX7000
EPM7000 Series Device Macrocell
Performance LDs
have wide fan in
Single level allows high frequency AND low latency Very small functions burn logic
Macrocell 68
Logic
Design Methodologies
Custom ICs are created using unique masks for all layers during the manufacturing process
Highly skilled and competent designers Lengthy development time High cost of design and testing
Mask Programmable Gate Arrays
Generic masks for all layers except metallization Generic Masks array of modular functional blocks Modules of transistors rows separated by fixed width chl’s
Designer’s expertise less critical Shorter development time and low development costs Channel-less Gate arrays Sea-of-Gates
Standard Cells
Modules or Standard cells are picked from the database and then placed in rows and interconnected Placement and Routing are done automatically (removing the designers from the physical design process) Designs are less efficient in size and performance
Gate
arrays
a highly standardized means to implement digital integrated circuit design manufactured as regular arrays of patterned blocks of transistors which can be interconnected to form logic elements such as gates, flip-flops and multiplexers. Manufacturer can pre-produce gate array wafers without interconnections in highvolume. These are then configured in an additional process step in the factory Once a customer provides a definition of the logic block interconnections, one or more layers of metal are added to form these connections collectively known as MPGAs (Mask- Programmable Gate Arrays) Sea-of-gates structures added metal interconnects have to be placed over particular transistors, rendering them unusable Regular gate arrays blank routing space is provided at regular intervals in the transistor array As process technologies advance and sizes get smaller, it is becoming increasingly more expensive to configure such devices
Masked Programmable Logic Devices
MPGA
Rows of transistors specified interconnections Within
the rows
to implement basic logic gates
Between
the rows
To connect basic gates together
I/O circuitry Predefined mask layers except final metal layers Manufacturer
Metal layers Customized
to implement desired circuit
MPGA
MPGA
Drawback Large
NRE cost
Need to generate metal mask layers Manufacture the chip
More
time to market
Advantage General
structure allows to implement much larger structure
Due to the their scalable interconnection structure – Scales proportionally with the logic
Field Programmable Logic Devices FPGAs
Programmability of PLD Scalable interconnection structure of an MPGA
Designer’s Choice Quest
for high capacity; Two choices available
MPGA (Masked Programmable Logic Devices) Customized during fabrication Low volume expensive Prolonged time-to-market and high financial risk
FPGA (Field Programmable Logic Devices) Customized by end Implements multi-level logic function Fast time to market and low risk
Designer’s Choice FPGA s
vs MPGA
Disadvantages Low
Programmable switches – Significant resistance and capacitance in the connections between logic blocks
Low
speed of operation
logic density
Programmable switches and programming circuitry Requires more area over MPGA to implement with the same amount of logic circuitry – Less number of chips per wafer
FPGA s vs MPGA
Standard Cell-based Design
Rows of cells
Feedthrough cell
Logic cell
Routing channel
Functional module (RAM, multiplier,…)
Standard-cell layout methodology
Routing channel requirements are reduced by presence of more interconnect layers
Gate Array — Sea-of-gates rows of uncommitted cells
routing channel
Why FPGAs?
Advantages of FPGAs
Replacement of SSI and MSI chips Availability of parts off the shelf Rapid Turnaround Low risk Re programmability
Limitations
PLDs will operate faster than FPGAs for the same design implemented in both For FPGAs the circuit delay depends on the design implementation tools Less dense and operate at lower speed when compared to conventional Gate Arrays
FPGA – A Quick Look
Two dimensional array of customizable logic block placed in an interconnect array Like PLDs programmable at s site Like MPGAs, implements thousands of gates of logic in a single device Employs
logic and interconnect structure capable of implementing multi-level logic Scalable in proportion with logic removing many of the size limitations of PLD derived two level architecture
FPGAs offer the benefit of both MPGAs and PLDs!
FPGA – A Detailed Look
Based on the principle of functional completeness FPGA: Functionally complete elements (Logic Blocks) placed in an interconnect framework Interconnection framework
comprises of wire segments and switches
Provide a means to interconnect logic blocks
Circuits are partitioned to logic block size, mapped and routed
Basic FPGA Architecture
FPGA Architecture
(With Multiplexer As Functionally Complete Cell) Basic building block
Interconnection Framework Granularity
and interconnection structure has caused a split in the industry
FPGA – Fine grained – Variable length interconnect segments – Programmable switches – Timing in general is not predictable; Timing extracted after placement and route
Interconnection Framework
LD – Coarse grained
– – – – –
(SPLD like blocks) Programmable crossbar interconnect structure Interconnect structure uses continuous metal lines The switch matrix may or may not be fully populated Timing predictable if fully populated Architecture does not scale well
Field Programmability
Field programmability is achieved through switches (Transistors controlled by memory elements or fuses) Switches control the following aspects Interconnection
among wire segments Configuration of logic blocks
Distributed memory elements controlling the switches and configuration of logic blocks are together called “Configuration Memory”
Technology of Programmable Elements
Vary from vendor to vendor. All share the common property: Configurable in one of the two positions – ‘ON’ or ‘OFF’ Can be classified into three categories:
SRAM based Fuse based EPROM/EEPROM/Flash based
Desired properties: Minimum
area consumption Low on resistance; High off resistance Low parasitic capacitance to the attached wire Reliability in volume production
SRAM Programming Technology
Employs SRAM (Static RAM) cells to control transistors and/or transmission gates SRAM cells control the configuration of logic block as well Volatile Needs an external storage Needs a power-on configuration mechanism In-circuit re-programmable
Lesser configuration time Occupies relatively larger area
Anti-fuse Programming Technology
Though implementation differ, all anti-fuse programming elements share common property
Uses materials which normally resides in high
impedance state But can be fused irreversibly into low impedance state by applying high voltage
Anti-fuse Programming Technology
Very low ON Resistance (Faster implementation of circuits) Limited size of anti-fuse elements; Interconnects occupy relatively lesser area
Offset : Larger transistors needed for programming
One Time Programmable
Cannot be re-programmed (Design
changes are not possible)
Retain configuration after power off
EPROM, EEPROM or Flash Based Programming Technology
EPROM Programming Technology Two gates: Floating and Select Normal mode: No charge on floating gate Transistor behaves as normal n-channel transistor
Floating gate charged by applying high voltage Threshold of transistor (as seen by gate) increases Transistor turned off permanently
Re-programmable by exposing to UV radiation
EPROM Programming Technology Used
as pulldown devices Consumes static power
EPROM Programming Technology No
external storage mechanism Re-programmable (Not all!) Not in-system re-programmable Re-programming is a time consuming task
EEPROM Programming Technology
Two gates: Floating and Select Functionally equivalent to EPROM; Construction and structure differ Electrically Erasable: Re-programmable by applying high voltage (No UV radiation expose!) When un-programmed, the threshold (as seen by select gate) is negative!
EEPROM Programming Technology
EEPROM Programming Technology Re-programmable;
In general, insystem re-programmable Re-programming consumes lesser time compared to EPROM technology Multiple voltage sources may be required Area occupied is twice that of EPROM!
Programming Technologies
Basic architectures of FPGAs
An FPGA device to allow the implementation of practically any logic circuit requires an area trade-off between a sufficient number of flexible configurable logical cells and enough interconnect resources to allow all connections between these cells.
majority of circuits a small portion of routing and logic resources, Resulting in a loss in speed (signal ing through redundant routing elements) density of logic when compared to the same circuit implemented in dedicated logic.
grouping of different FPGA devices with related architecture into a family. Each member in a family would be physically tailored to a certain class of application architecture, by for example replacing the switches in certain routes by hard shorts, or hard-wiring the logical cells internally in a certain manner. This member may now implement certain circuits more efficiently, but its reduced flexibility means that some circuits may not fit at all onto the device.
Implementation of a circuit is now a question of choosing the right device from the FPGA family.
Programming Skills vs. FPGAs U Model
Single-threading No synchronization for/if/switch control
Incremental execution One instruction at a time Results are immediate
Common parallelization Large units of work Costly communication
FPGA Model
Massive parallelism Visible timing relations State machine/hardwired
Pipelined execution All operations active Visible dependencies
Parallelism model Fine grain – one ALU op Cheap on-chip comm.
An Example
Modulo-4 counter: Specification
Modulo-4 counter: Logic Implementation
FPGA Implementation of Modulo-4 Counter
Design Steps Involved in Deg With FPGAs
Understand and define design requirements Design description Behavioural simulation (Source code interpretation) Synthesis Functional or Gate level simulation Implementation Fitting Place and Route Timing or Post layout simulation Programming, Test and Debug
Commercially Available Devices Architecture
differs from vendor to vendor Characterized by
Structure and content of logic block Structure and content of routing resources To examine,
devices
look at some of available
FPGA: Xilinx (XC4000) LD: Altera (MAX 5K)
Xilinx FPGAs
Generic Xilinx Architecture
Symmetric Array based; Array consists of CLBs with LUTs and D-Flipflops N-input LUTs can implement any n-input boolean function Array embedded within the periphery of IO blocks Array elements interleaved with routing resources (wire segments, switch matrix and single connection points) Employs SRAM technology
What is an FPGA? contain the building blocks necessary to design a custom integrated circuit without having to turn to an outside foundry. logic blocks Interconnects and I/O blocks All of these can be programmed to do a particular function memory-based (SRAM or flash EEPROM) anti-fuse A designer needs to develop a special program and have that program ed to the FPGA. FPGAs could be considered more of a software development than a hardware development effort. Intellectual property -IP, placed inside the FPGA, can either be developed by the designer or via a third party.
Design Flow Approaches – Schematic capture - the most intuitive and visual but the least flexible – Hardware Description Language – More portable
Why FPGAs?
Advantages of FPGAs
Replacement of SSI and MSI chips Availability of parts off the shelf Rapid Turnaround Low risk Re programmability
Limitations
PLDs will operate faster than FPGAs for the same design implemented in both For FPGAs the circuit delay depends on the design implementation tools Less dense and operate at lower speed when compared to conventional Gate Arrays
FPGA manufacturers Xilinx (http://www.xilinx.com) SRAMbased FPGAs ( tens of thousands
to millions upon
millions of gates).
Altera
(http://www.altera.com) SRAM based FPGAs Lattice Semiconductor (http:// www.latticesemi.com) Actel (http://www.actel.com) Quick Logic (http://www.quicklogic.com)
Classification by Granularity
Logic Block size correlates to the granularity of a device which relates to the effort required to complete the wiring between the blocks (routing channels)
Fine granularity (sea of gates architecture) Medium granularity (FPGA) Large granularity (LD)
Large numbers of relatively simple programmable logic block “islands” embedded in a “sea” of programmable interconnect Fine-grained architecture
Each logic block can be used to implement only a very simple function
Coarse-grained architecture
3- input function or a storage element Glue logic and state machines A large number connections into and out of each block
Underlying FPGA Fabric
Each logic block contains a relatively of more logic Logic block might contain Four 4-input LUTs, four mux’es, four D-latches, and some fast carry logic
Mux vs LUT-based logic blocks assume that the LUT is formed from SRAM cells (but it could be formed using antifuses, E2PROM, or FLASH cells)
MUX-based logic block
Multiplexer based CLB (configurable logic block)
Multiplexer based CLB example from Actel 40MK 8-input, 1-output cell implements basic logic functions (and, or, nor, ..) with 2,3, or 4 inputs
LUT based CLB
A commonly used technique is to use the inputs to select the desired SRAM cell using a cascade of transmission gates If a transmission gate is enabled (active), it es the signal seen on its input through to its output. But if the gate is disabled, its output is electrically disconnected from the wire it is driving. 4-input LUTS offer the optimal balance of tradeoffs. recently introduced Virtex-5 family from Xilinx features 6-input LUTs Altera has a fabric that combines two 4LUTs and four 3-LUTs. In addition to allowing designers to form a 6-LUT, this also allows you to make a 5-LUT and a 3LUT, and many other combinations.
Look-up table (LUT) based CLB
LUT based CLB depending on the combination of the input words, a predefined output value is assigned
Memory implementation: input values = address of memory predefined values = content of memory
ed output based CLB
The output of the LUT may be ed or not, depending on the functional description (selection is implemented via multiplexers)
Look up Tables
Configuration memory holds outputs for truth table Internal signals connect to control signals of multiplexers to select value of truth table for any given input value
Synthesis mode
Arithmetic mode
Any logic function of up to 4 variables in its ed or direct form.
The LUT is split to provide any two logic functions of the same 3 variables. In the arithmetic mode, the inputs A, B, C are the addends and the Carry-in, whilst the output functions are the Sum and the Carry-out.
Multiplier mode
This mode also implements an adder, with the addends this time being partial products and Carry-in from the previous bit position.
The partial product of A and B may be implemented with an AND gate.
CLB with ed output
Counter mode The LUT provides two logic functions (counter Output and Carryout) of the same 2 variables, which are a Carry-in and the previous Output. The loop to use this output as an input is normally provided for within the CLC; this could also be implemented externally by connecting appropriate routes.
Multiplexer (2:1) mode The LUT is configured to provide a logic function of 3 variables, where one selects one of the other two inputs. As an example, the case where C is the select line for A and B will be considered
CLB with ed output
Modulo-4 counter: Logic Implementation
An Example
Modulo-4 counter: Specification
LUT based CLB
Example from Actel Varicore CLC: LUT based
Multiplexer to decrease LUT size ed output via multiplexer selectable
LUT output based CLB Example from Xilinx XC3000: Dual output complex CLB s selectable large combinational function with two outputs
Xilinx logic cell (LC)
an LC comprises a 4-input LUT (which can also act as a 16 × 1 RAM or a 16-bit shift ), a multiplexer and a
The can be configured (programmed) to act as a flip-flop or as a latch. The polarity of the clock (rising-edge-triggered or falling-edge-triggered) can be configured, as can the polarity of the clock enable and set/reset signals (active-high or active-low)
Highly-simplified view of a Xilinx logic cell (LC).
the equivalent core "building block" in an FPGA from Altera is called a logic element (LE).
A multi-faceted LUT A "slice" containing two logic cells.
what Xilinx call a configurable logic block (CLB) and what Altera refer to as a logic array block (LAB). some Xilinx FPGAs have two slices in each CLB while others have four fast programmable interconnect within the CLB. This interconnect is used to connect neighboring slices.
logic-block hierarchy LC, then Slice (with two LCs), then
CLB (with four Slices) complemented by an equivalent hierarchy in the interconnect. Thus, there is fast interconnect between the LCs in a slice, then slightly slower interconnect between slices in a CLB, followed by the interconnect between CLBs. The idea is to achieve the optimum tradeoff between making it easy to connect things together without incurring excessive interconnectrelated delays.
A CLB containing four slices
all of the LUTs within a CLB can be configured together to implement the following: Single-port 16 × 8 bit RAM Single-port 32 × 4 bit RAM Single-port 64 × 2 bit RAM Single-port 128 × 1 bit RAM Dual-port 16 × 4 bit RAM Dual-port 32 × 2 bit RAM Dual-port 64 × 1 bit RAM each 4-bit LUT can be used as a 16-bit shift the LUTs within a single CLB to be configured together to implement a shift containing up to 128 bits as required
Fast carry chains
A key feature - the special logic and interconnect required to implement fast carry chains. In the context of the CLBs, each logic cell (LC) contains special carry logic. This is complemented by dedicated interconnect between the two LCs in each slice, between the slices in each CLB, and between the CLBs themselves. This special carry logic and dedicated routing boosts the performance of logical functions such as counters and arithmetic functions such as adders. The availability of these fast carry chains – in conjunction with features like the shift incarnations of LUTs and the embedded multipliers.
FPGA families Low-cost
High-performance
Spartan 3 Virtex 4 LX / SX / FX Spartan 3E Virtex 5 LX Xilinx Spartan 3L
Altera
Cyclone II
Stratix II Stratix II GX
Xilinx FPGA Families
Old families XC3000, XC4000, XC5200 Old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for modern designs. Low Cost Family Spartan/XL – derived from XC4000 Spartan-II – derived from Virtex Spartan-IIE – derived from Virtex-E Spartan-3, Spartan 3E, Spartan 3L High-performance families Virtex (220 nm) Virtex-E, Virtex-EM (180 nm) Virtex-II, Virtex-II PRO (130 nm) Virtex-4 (90 nm) Virtex 5 (65 nm)
Xilinx XC3000 CLB
Granularity of FPGAs
Selection of an FPGA
an evolution of PAL’s where size is increased by an order of magnitude, or a refinement of mask-programmed gate arrays, where the reprogramming time and cost are drastically reduced
anti-fuse versus reprogrammable configuration, blockstructured versus channel-structured routing, and lookup table (LUT) versus multiplexer versus sum-of-products logic.
the technology and architecture of the routing fabric is the most important factor in determining the effectiveness of an FPGA for a particular application.
Selecting an FPGA
Size
I/O pins
A large FPGA may be able to squeeze in all of your required IP, but the resultant cost might break the project's budget. It may make more sense to only incorporate certain IP into the FPGA and use off-the-shelf components for the rest of the design.
Performance
A designer needs to know how many pins they must share with the circuit outside of the FPGA. For example, serialization and de-serialization of signals can use up many pins.
Unit price
FPGA vendors measure density or size in different ways. Nonetheless, a designer will need a ballpark understanding of what type of FPGA product they require.
If fast computations are essential, then higher-performance FPGAs would, of course, become mandatory—and a tradeoff to cost.
Power consumption
This is critical for applications particularly sensitive to heat dissipation, and for those that require batteries.
FPGA routing enables (almost) arbitrary connection among logic blocks, but at the cost of tying up more area and incurring more delay than present in a mask programmed part. Likewise,
the logic architectures of FPGA’s are larger and slower than mask defined gates, since their functionality must be programmable. But in comparison to the routing fabric, their delays tend to be more predictable and less of a limiting factor.
FPGAs
Architecture Gate Density Routing Resources Programming method
Xilinx Logic Cell Array (LCA) Actel Configurable Technology (ACT)
FPGA Capacity comparisons
Some FPGAs offer dedicated adder blocks. One operation that is very common in DSP applications is called a multiply-and-accumulate (MAC) this function multiplies two numbers together and adds the result into a running total stored in an accumulator.
the majority of designs make use of microprocessors in one form or another. high-end FPGAs contain one or more embedded microprocessors microprocessor cores
The core functions forming a MAC
A hard microprocessor core is one that is implemented as a dedicated, predefined block.
move all of the tasks that used to be performed by the external microprocessor into the internal core makes the board smaller and lighter.
Two main approaches for integrating such a core into the FPGA locate it in a strip to the side of the main FPGA fabric embed one or more microprocessor cores directly into the main FPGA fabric
Soft microprocessor cores
configure a group of programmable logic blocks to act as a microprocessor simpler (more primitive) and slower only need to implement a core if you need it, and also that you can instantiate as many cores as you require
Bird's-eye view of chip with embedded core outside of the main fabric
Clock trees
All of the synchronous elements inside an FPGA (the s configured to act as flip-flops inside the programmable logic blocks) – need to be driven by a clock signal. Such a clock signal typically originates in the outside world, comes into the FPGA via a special clock input pin, and is then routed through the device and connected to the appropriate s. Clock tree
the main clock signal branches again and again the flip-flops can be consider to be the "leaves" on the end of the branches
all of the flip-flops see the clock signal as close together as possible. Skew
If the clock was distributed as a single long track driving all of the flip-flops one after another, then the flip-flop closest to the clock pin would see the clock signal much sooner than the one at the end of the chain.
The clock tree is implemented using special tracks and is separate from the generalpurpose programmable interconnect. multiple clock pins multiple clock domains (clock trees)
Clock manager
daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board. Each family of FPGAs has its own type of clock manager (there may be multiple clock manager blocks in a device), where different clock managers may only a subset of the following features:
Jitter removal Frequency synthesis
Phase shifting
Selecting an FPGA
Size
I/O pins
A large FPGA may be able to squeeze in all of your required IP, but the resultant cost might break the project's budget. It may make more sense to only incorporate certain IP into the FPGA and use off-the-shelf components for the rest of the design.
Performance
A designer needs to know how many pins they must share with the circuit outside of the FPGA. For example, serialization and de-serialization of signals can use up many pins.
Unit price
FPGA vendors measure density or size in different ways. Nonetheless, a designer will need a ballpark understanding of what type of FPGA product they require.
If fast computations are essential, then higher-performance FPGAs would, of course, become mandatory—and a tradeoff to cost.
Power consumption
This is critical for applications particularly sensitive to heat dissipation, and for those that require batteries.
Design Entry Involves capturing the design using a high-level description
Logic Synthesis optimizes the circuit by regrouping logic functions and/or removing
language like Verilog or VHDL. Alternatively a schematic editor is used to enter the design at basic logic level, or by making use of generic blocks which in turn are described by high level languages entry of the design using state diagrams. The CAD software provided by FPGA manufacturers includes libraries of standard circuits or macro-functions to quickly implement common circuits of varying complexity. The schematic or VHDL description are then translated into a netlist describing the circuit in of logic gates and sequential elements.
redundancies. according to design constraints or rules, which could be minimizing area or maximizing velocity. Once the optimized netlist is obtained, it has to be mapped onto the logical cell of the FPGA (LUT / flip-flop, PLA ... ).
Floorplanning
The circuit to be designed is now divided into partitions, each of which is adjusted to be implemented in a particular area on a FPGA device. A partition usually corresponds to a large section of the circuit which has a particular functionality, e.g a multiplier, filter bank etc. the total number of FPGA devices required is also determined.
FPGA Design Flow
Place and Route.
Layout Verification
A logic partition is now mapped onto an FPGA device by means of the placement tool, which assigns a physical place in the array of CLCs to each function (LUT / flip-flop, PLA . ). Typical placement algorithms aim to minimize the total length of the interconnections in the final design, with the objective of maximizing the speed of the device. Routing algorithms configure the routing elements to provide the required connections between logic elements. The primary aim of any routing algorithm is to assure that 100% of the required routes may be realized. Other goals of routing algorithms include finding the shortestpaths possible between elements. Because of restricted interconnection resources, this step is the most restrictive. This step involves extracting the physical layout of the design and simulating it using commercial simulators to obtain timing data and checking design rules (DRC). If the delays associated with the interconnections within the prototype indeed fulfill delay constraints imposed by the design specifications, then the device may be programmed, otherwise the placement and routing steps have to be repeated until a satisfactory configuration is found.
Macro Integration
This involves the provision of all the necessary files and data formats for integrating the macro in the design flow of the whole chip. Once the circuit would have been verified, the design configuration is output in a format which is readable as an input to the FPGA device which is to be programmed. The programming of the device could be a question of minutes .
FPGA Design Flow
The Design Cycle
ASIC Design Methodology
the design is verified by simulation at each stage of refinement
Accurate simulators are slow Fast simulators trade away simulation accuracy.
ASIC designers use a battery of simulators across the speedaccuracy spectrum in an attempt to the design.
an FPGA designer can replace simulation with in-circuit verification, “simulating” the circuitry in real time with a prototype
The path from design to prototype is short, allowing a designer to operation over a wide range of conditions at high speed and high accuracy. proof-of-concept prototype designs easily
Designs can be verified by trial rather than reduction to first principles or by mental execution. that the design works in the real system, not merely in a potentially-erroneous simulation model of the system.
The Design Cycle 1. 2. 3. 4. 5. 6. 7. 8. 9.
Entering the design in the form of schematic, netlist, logic expressions, or HDLs Simulating the design for functional verification Mapping the design into the FPGA architecture Placing and Routing the FPGA design Extracting delay parameters of the routed design Resimulating for timing verification Generating the FPGA device configuration format Configuring or Programming the device Testing the product for undesirable functional behavior
FPGA Configuration
In the case of re-programmable devices, activation or deactivation of interconnects is implemented by means of transistors or tri-state buffers Memory units also store the configuration of LUTs and static multiplexers in the CLC. If the type of memory used is EEPROM, the device is non-volatile, but the difficult mechanism of re-configuration imposes limitations on the application of the system. SRAM memory, on the other hand, loses the configuration once power is removed from the device (volatile), but it is simple and quick to configure. The use of SRAM allows for dynamic re-configuration of the device even during real-time operation. Small local SRAM blocks may also be used to store several configuration bits.
FPGA Capacity comparisons
SRAM FPGA -- EEPROM FPGA
Despite this, however, most FPGAs still use SRAM for reasons of simplicity (when you need to reprogram it, it's easier to reencode a small ROM chip than to reprogram a large FPGA chip), so count on having to use a separate boot ROM for the FPGA. Use of an FPGA is broadly divided into two main stages: Configuration
mode
the mode in which the FPGA is when you first power it up. Configuration mode is, as you may have guessed, where you configure the FPGA;
Product – FPGA vs ASIC
Comparison: FPGA benefits vs ASICs: - Design time: - Cost: - Volume:
9 month design cycle vs 2-3 years No $3-5 M upfront (NRE) design cost. No $100-500K mask-set cost High initial ASIC cost recovered only in very high volume products
Due to Moore’s law, many ASIC market requirements now met by FPGAs - Eg. Virtex II Pro has 4 processors, 10 Mb memory, IO
Resulting Market Shift: Dramatic decline in number of ASIC design starts: - 11,000 in ’97 - 1,500 in ’02
FPGAs as a % of Logic market:
- Increase from 10 to 22% in past 3-4 years
FPGAs (or programmable logic) is the fastest growing segment of the semiconductor industry!!
FPGA/ASIC Crossover Changes
Cost
90nm / 300mm ASICs
SICs A m m 0 0 2 / 150nm s s A A G G P P F F m m m 0 0 m 3 0 / 0 2 m / 0n 9 m n 0 15 FPGA FPGA Cost Advantage Cost Advantage ASIC Cost ASICAdvantage Cost Advantage FPGA Cost Advantage
Production Volume