Informatica ETL Standards and Guidelines September 17, 2016 Version: 1.0
Informatica ETL Standards and Guidelines
Document Control Version History
Version
Date
Author
Modification Description
Reviewers
Date
Reviewer
Department
Page 2 of 55
Informatica ETL Standards and Guidelines TABLE OF CONTENTS 1.
Overview...................................................................................................................................................................... 5 1.1
2.
Security Standards....................................................................................................................................................... 5 2.1 2.2 2.3 2.4 2.5
3.
Repository Folder Structure............................................................................................................................... 13 Informatica Server Structure.............................................................................................................................. 13 Parameter File.................................................................................................................................................... 15 Reusable Objects............................................................................................................................................... 16 Shared Folders................................................................................................................................................... 16
Data Quality................................................................................................................................................................ 17 6.1 6.2
7.
Informatica Team-based Deploy.........................................................................................................................12
Informatica Environment Standards...........................................................................................................................13 5.1 5.2 5.3 5.4 5.5
6.
Informatica Server Components...........................................................................................................................8 Database Connections......................................................................................................................................... 8 Tranformations..................................................................................................................................................... 9 Ports..................................................................................................................................................................... 9 Mappings............................................................................................................................................................ 10 Mapplets............................................................................................................................................................. 10 Sessions............................................................................................................................................................. 10 Workflows........................................................................................................................................................... 10 Worklet............................................................................................................................................................... 10 Parameter Naming Standards............................................................................................................................ 11
Version Control........................................................................................................................................................... 12 4.1
5.
Informatica Roles & Responsibilities.................................................................................................................... 5 UNIX Security....................................................................................................................................................... 6 Informatica Security.............................................................................................................................................. 6 Informatica Authentication Security......................................................................................................................7 Informatica Authorization Security........................................................................................................................ 7
Naming Conventions.................................................................................................................................................... 8 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10
4.
Purpose................................................................................................................................................................ 5
Data Transformations......................................................................................................................................... 17 Default Data....................................................................................................................................................... 17
Informatica Development Guidelines.........................................................................................................................19 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15
Mapping Guidelines............................................................................................................................................ 19 Mapping and Workflow Documentation..............................................................................................................19 Transformation Object Documentation...............................................................................................................20 Expression Ports Documentation....................................................................................................................... 20 Mapping Recovery............................................................................................................................................. 21 Mapping Templates............................................................................................................................................ 21 Source Qualifier.................................................................................................................................................. 21 Filter.................................................................................................................................................................... 22 er.................................................................................................................................................................. 22 Router................................................................................................................................................................. 23 Sorter.................................................................................................................................................................. 23 Aggregator.......................................................................................................................................................... 24 Lookup................................................................................................................................................................ 26 Sequence Generator.......................................................................................................................................... 28 Update Strategy................................................................................................................................................. 29 Page 3 of 55
Informatica ETL Standards and Guidelines 7.16 7.17 7.18 7.19 8.
Change Data Capture (CDC)..................................................................................................................................... 40 8.1 8.2 8.3
9.
Target.................................................................................................................................................................. 30 Mapplet............................................................................................................................................................... 31 Session............................................................................................................................................................... 32 Workflow/Worklet................................................................................................................................................ 37
Full Master.......................................................................................................................................................... 40 Incremental......................................................................................................................................................... 40 CDC Techniques................................................................................................................................................. 40
Logging....................................................................................................................................................................... 42 9.1 9.2 9.3
Informatica Session Logging..............................................................................................................................42 Informatica Workflow Logging............................................................................................................................42 Non-Informatica Logging.................................................................................................................................... 42
10. Error/Exception Handling and Recovery....................................................................................................................43 10.1 10.2
Essential Error Trapping and Recovery..............................................................................................................43 Informatica Errors and Warnings........................................................................................................................ 44
11. ETL Control................................................................................................................................................................ 45 11.1 11.2 11.3 11.4 11.5 11.6 11.7
ETL Control Data Model..................................................................................................................................... 45 ETL Transform Rule Data Model........................................................................................................................46 Rolling ETL Control Forward..............................................................................................................................46 Parameter File and ETL_CONTROL Synchronization.......................................................................................48 ETL Start............................................................................................................................................................ 48 ETL End.............................................................................................................................................................. 49 ETL Control Workflow......................................................................................................................................... 50
12. UNIX........................................................................................................................................................................... 51 12.1 12.2 12.3 12.4 12.5
UNIX Data Directories........................................................................................................................................ 51 UNIX Landing Directories................................................................................................................................... 52 UNIX Informatica Environment Variables...........................................................................................................52 UNIX Logging..................................................................................................................................................... 52 UNIX Naming..................................................................................................................................................... 52
13. File Processing........................................................................................................................................................... 53 13.1 13.2 13.3 13.4 13.5 13.6
File Types........................................................................................................................................................... 53 File Format and Transfer Validation....................................................................................................................53 File Naming........................................................................................................................................................ 53 File Notification................................................................................................................................................... 54 File Landing........................................................................................................................................................ 54 File Archiving...................................................................................................................................................... 54
14. Scheduling.................................................................................................................................................................. 55 14.1 14.2 14.3
Informatica Workflow.......................................................................................................................................... 55 Corporate Scheduler.......................................................................................................................................... 55 Data Warehouse istration Console (DAC)................................................................................................55
Page 4 of 55
Informatica ETL Standards and Guidelines
1. Overview 1.1 Purpose The purpose of this document is to define the ETL standards and guidelines that will be utilized in the design, development of the Informatica processes for the Advantage 1.0 project. Due to the fact that additional ETL standards may need to be incorporated in this document, this document will evolve overtime. The initial version is based on standards and best practices already developed, practiced, and shared by Informatica (Velocity) and client. Note, some guidelines will not apply to the pre-built Informatica mappings that come with Oracle BI applications.
2. Security Standards
2.1 Informatica Roles & Responsibilities
A number of different Roles have been identified which will require different levels of access. It is recommended that the Group feature in Informatica’s security model be leveraged to aid in these role assignments. It is understood that initially some individuals may be serving multiple roles as will primarily be handled within the development groups at the outset.
Developer o Full access to assigned project folders in Development o Execute access to project specific database connections in Development o Read access to assigned project folders in QA o Read access to assigned project folders in Production o Update access to assigned deployment groups in Development o Read access to assigned project folders in QA o Execute access to project specific database connections in QA o Update access to assigned deployment groups in QA Tester o Read and possibly execute access to assigned project folders in QA o Execute access to project specific database connections in QA o Update access to assigned deployment groups in QA o Read access to assigned project folders in Production o Full access to Break/Fix (B/F) environment in Development o Read access to all project folders in Development o Update access to all B/F deployment groups in Development o Execute access to all database connections in Development o Full access to all project folders in QA o Update access to all B/F deployment groups in QA o Execute access to all database connections in QA o Read access to all project folders in Production o Execute access to all database connections in Production Page 5 of 55
Informatica ETL Standards and Guidelines
o Full access to all environments including security istration o s will perform configuration changes and software upgrades Deployer o Must perform all deployments via deployment groups
2.2 UNIX Security
The Informatica installation resides on UNIX servers. Thus, in order for Informatica to be installed, and function properly, an Informatica system must be created. This must have full privileges to the directory in which Informatica will be installed. Additionally, it is recommended that Informatica have full privileges to any directory it may need to process data. This includes, but is not limited to: Temp directories File processing directories Archive directories Log directories Bad file directories Cache directories In order to accommodate multiple projects with distinct parent folders in Unix it may be preferable to utilize the OS Profiles functionality in Informatica. This would require distinct Unix ids for each project with its own set of system variables. The main Informatica environment would still run with the id but the workflows in each folder would be run under distinct Unix ids. 2.3 Informatica Security
Informatica allows for the following types of authentication security as found in the PowerCenter Guide, pages 55 - 57: Native Authentication For native authentication, the Service Manager stores all information and performs all authentication within the PowerCenter domain. When a logs in, the Service Manager uses the native security domain to authenticate the name and . By default, the PowerCenter domain contains a native security domain. The native security domain is created at installation and cannot be deleted. A PowerCenter domain can have only one native security domain. You create and maintain s of the native security domain in the istration Console. The Service Manager stores details of the s, including s and groups, in the domain configuration database.” LDAP Authentication LDAP Authentication is not used currently for Informatica. However, if you have s in an enterprise LDAP directory service that you want to give access to the PowerCenter applications, you can configure the PowerCenter domain to use LDAP authentication. Create an LDAP security domain and set up a filter to specify the s and groups in the LDAP directory service who can access PowerCenter and be included in the security domain. The Service Manager imports the s and groups from the LDAP directory service into an LDAP security domain. You can set up a schedule for the Service Manager to periodically synchronize the list of s and groups in the LDAP security domain with the list of s and groups in the LDAP directory service. During synchronization, the Service Manager imports new s and groups from the LDAP directory service and deletes any or group that no longer exists in the LDAP directory service.
Page 6 of 55
Informatica ETL Standards and Guidelines
When a in an LDAP security domain logs in to a PowerCenter repository or application, the Service Manager es the name and to the LDAP directory service for authentication. If the LDAP server uses SSL security protocol, the Service Manager sends the name and to the LDAP directory service using the appropriate SSL certificates. You can use the following LDAP directory services for LDAP authentication:
Microsoft Active Directory Service Sun Java System Directory Service Novell e-Directory Service IBM Tivoli Directory Service Open LDAP Directory Service
You create and manage LDAP s and groups in the LDAP directory service; assign roles, privileges, and permissions to s and groups in an LDAP security domain; and assign LDAP s to native groups to organize them based on their roles in the PowerCenter domain. You cannot use the istration Console to create, edit, or delete s and groups in an LDAP security domain.”
2.4 Informatica Authentication Security
The Informatica installation will use a combination of native and LDAP security. Native security is necessary in the event a connection to the Corporate LDAP cannot be established. For normal use, a security domain named “Corporate_LDAP” can be created. Developers, testers, and s will to this Informatica security domain using their Directory Services ID and . This approach reduces the total amount of s a needs to maintain throughout the security landscape. In order to connect to the Corporate LDAP, the following must be set in the LDAP Configuration: In order to add s to the “Corporate_LDAP” security domain, the “ Search Base” field must be set to “ou=person, o=??.com, c=us” as shown in Figure 2 below: Additionally, specific syntax must be used to add each individual . The entire string specified in the “ filter” filed must first be enclosed in parentheses. The “|” (pipe) symbol must be used to indicate that the pursuant IDs are to be concatenated. Lastly, each ID must be contained in parentheses, and follow the “uid=” LDAP syntax. Finally, after adding new s, the must click the “Synchronize Now” button to retrieve the information from the Corporate LDAP. In the future, an LDAP role may be created in the Corporate LDAP system, with the appropriate s assigned to the role in Corporate LDAP. This would mean that rather than maintaining a long string of s in the “ filter” section, a single group can be specified in the “Group filter” section. Since there are only a handful of s in the system currently, the “ filter” approach is preferred, and does not require the Corporate LDAP team to maintain another LDAP group. 2.5 Informatica Authorization Security
Informatica comes with predefined roles for s. For further description regarding these roles, see the Informatica Guide, Chapter 5: Managing Privileges and Roles. Page 7 of 55
Informatica ETL Standards and Guidelines
It is important to note that these roles only grant authorization as to what a has the ability to do within the Informatica PowerCenter utilities. It does not grant authorization to specific objects (i.e.: sources, targets, mappings, etc…) within the PowerCenter utilities. If a wants to allow another to have access to their objects, they must explicitly grant access to those objects from within the PowerCenter utility.
3. Naming Conventions
3.1 Informatica Server Components
All Informatica server components are prefixed with the Informatica Domain name, followed by “_
”. Additionally, since multiple nodes may be used to enable the use of Informatica as a service across the Enterprise, nodes have the additional designation of a numeric suffix. The following are the names of the Informatica server components: Development Domain: INFADEV Node(s): INFADEV_node01 Integration: INFADEV_integration Repository: INFADEV_repository Quality Assurance Domain: INFAQA Node(s): INFAQA_node01 Integration: INFAQA_integration Repository: INFAQA_repository Production Domain: INFAPROD Node(s): INFAPROD_node01 Integration: INFAPROD_integration Repository: INFAPROD_repository 3.2 Database Connections
The PowerCenter designer utilizes ODBC connections to databases. As such, there is no standard for defining ODBC connections. The PowerCenter Workflow Manager contains a Relational Connection Browser, however, in which connections to relational databases are defined for the workflows. When naming these connections, spaces and special characters are not allowed, with the exception of the “_” (underscore) character. It is suggested that the name of the relational connection contain the name, followed by the “_” (underscore) character, followed by the SID of the relational database. Below is the current connection name: DB Connection Name: F2INFOD : INFDDBR1 SID: F2INFOD
Page 8 of 55
Informatica ETL Standards and Guidelines 3.3 Tranformations
The following Figure 3 gives standard naming conventions for transformations: Transformation
Naming Convention
Transformation
Naming Convention
Aggregator
AGG_TransformationName
Router
RTR_TransformationName
Application Source Qualifier
SQ_TransformationName
Sequence Generator
SEQ_TransformationName
Custom
CT_TransformationName
Sorter
SRT_TransformationName
Expression
EXP_TransformationName
Stored Procedure
SP_TransformationName
External Procedure
EXT_TransformationName
Source Qualifier
SQ_TransformationName
Filter
FIL_TransformationName
SQL
SQL_TransformationName
HTTP
HTTP_TransformationName
Transaction Control
TC_TransformationName
Java
JTX_TransformationName
Union
UN_TransformationName
er
JNR_TransformationName
Update Strategy
UPD_TransformationName
Lookup
LKP_TransformationName
XML Generator
XG_TransformationName
MQ Source Qualifier
SQ_MQ_TransformationNam e
XML Parser
XP_TransformationName
Normalizer
NRM_TransformationName
XML Source Qualifier
XSQ_TransformationName
Rank
RNK_TransformationName Figure 1 - Transformation Naming Conventions
There may be instances in naming a Lookup transformation when the developer needs to distinguish how a lookup is performed (i.e.: via a key vs. a code). In these instances, a lookup may be named “LKP_TransformationName_KEY” or “LKP_TransformationName_CD”.
3.4 Ports
Ports are generally named for the field they represent. For example, if there is a column in a target table called “FIRST_NAME”, the Informatica target definition should have a port of the same name. Thus, it is recommended that for data lineage, as well as for developer best practices, a port should be named for the source field from which it comes, or the target field to which it will be populated appropriately. Quite often, however, transformations may have ports not found in sources or targets. In these cases, it is preferable that the developer use ALL CAPS to define the field names, the “_” (underscore) character as a separator, and judiciousness when naming the field so as not to cause confusion for other developers. In circumstances when a transformation contains inputs that will not be output, the port should be prefixed with “IN_” indicating that it is just an input port. Similarly, if the transformation contains output fields that were not inputs to the transformation, the prefix “OUT_” should be used to indicate they are output only ports. When naming variable ports, the prefix “V_” should be used.
Page 9 of 55
Informatica ETL Standards and Guidelines
Figure 2 - Port Naming Example
3.5 Mappings
A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. The recommended naming convention for mappings is m_MappingName, where MappingName is a reasonable description of the purpose of the mapping. This description should contain the target database schema and possibly the target load table or subject area. For example, a staging mapping may be called m_STG_TABLE_NAME while a IDR mapping maybe called m_FDW_TABLE_NAME. 3.6 Mapplets
Mapplets are reusable objects containing transformations. The purpose of a mapplet is to allow multiple mappings the ability to use the same block of transformation logic. By creating the mapplet once, developers increase coding efficiency, while at the same time decreasing diagnosis and maintenance. The recommended naming convention for Mapplets is mplt_MappletName, where MappletName is a reasonable description of the purpose of the Mapplet. 3.7 Sessions
Sessions are the Workflow representation of a mapping, and all the instructions that surround that mapping. The recommended naming convention for Sessions is s_MappingName (without the “m_”). This is the default naming convention and indicates that a session is related to a specific mapping. 3.8 Workflows
A workflow is a set of instructions that tells the Integration Service how to run tasks such as sessions, Worklets, email notifications, and shell commands. The recommended naming convention for a Workflow is wfk_WorkflowName, where WorkflowName is a reasonable description of the purpose of the Workflow. 3.9 Worklet
A Worklet is a reusable object that contains a series of tasks. Each Worklet can be used in multiple Workflows similar to the way a Mapplet can be used in multiple Mappings. The recommended naming convention for a Worklet is wlkt_WorkletName where WorkletName is a reasonable description of the purpose of the Worklet.
Page 10 of 55
Informatica ETL Standards and Guidelines 3.10 Parameter Naming Standards
Informatica allows Mapping and Session parameters to be overridden at runtime through the use of Parameter Files. Mapping Parameters should be defined in a consistent case separated by underscores to improve readability. Mapping parameters are prefixed with $$. The ETL design documents should detail out those standard parameters utilized by every mapping or targeted mappings. Session Parameters must match the formats defined in Informatica and are typically mixed case; these must follow predefined naming conventions. Session parameters allow the overriding of source, target and log files as well as database connection names. Session parameters are prefixed with $. The following examples are provided: Mapping Parameter $$LAST_EXTRACT_DATE $$EFF_START_DATE $$ETL_CREATE_DATE Session Parameter $DBConnection_Target $InputFile1 $PMSessionLogFile
Page 11 of 55
Informatica ETL Standards and Guidelines
4. Version Control
4.1 Informatica Team-based Deploy
Informatica contains built-in version control allowing developers to track their work as objects are built. Information about Informatica versioning can be found in the Informatica Repository Guide, Chapter 6: Working with Versioned Objects. The following guidelines are recommended to ensure object tracking, repository backups.
Objects must be checked-in before they can be made available to the “public” repository. If an object is not checked in, other developers, and other parts of the Informatica PowerCenter suite will not be able to access that object. Similarly, existing checked-in objects must be checked-out in order to make changes to them. Clear descriptions about the modifications must be entered into the “Description” box prior to checkin, to ensure proper communication of modifications. Prior to modifying an object that is checked-in, it is recommended that developers use one or more of the functions in the “Analyze” menu of the Repository Manager to view dependencies of the object. This will allow the developer to understand and communicate the effects of the change they are about to make. If the developer does not have permissions to perform this function, they should request that this analysis be done by an .
In addition to using the built-in object versioning of Informatica, it is recommended that the ETL Architect implement code-freeze checkpoints. During this time, the Informatica objects should be labeled, and the entire repository should be exported to XML, and checked-in to a separate version control system (i.e. ClearCase).
Page 12 of 55
Informatica ETL Standards and Guidelines
5. Informatica Environment Standards
5.1 Repository Folder Structure
The Informatica ETL tool will be used to multiple applications across the Enterprise. The planned installation architecture will consist of multiple Informatica servers across each of the planned environments (Development, Test, and Production). Folders will be created in the repository for each key application. These folders must be created in ClearCase also. To reduce the issue of a cluttered repository, rather than create personal folders, “sandbox” folders will be created for each project. The sandbox folder would have full permissions for everyone in that project group. Developers in the project can work in the sandbox folder, and once they are finished researching or experimenting, they can move their code to the main project folder. Each developer should clean up all objects in the sandbox folder once the developer is finished. A periodic purge process will be initiated to remove objects in the sandbox folders that have not been modified for 60 days. The following figure provides a list of the known applications that will be developed and the structures for the repository configurations. Additional folders for sharing may also be created. Each application should also have a folder for break/fix. This folder would represent the current production version of the application, and should only be editable by the Informatica Team. Figure 5 shows an example of each Informatica application, and their break/fix counterpart. Note: the pre-built Informatica mappings are already organized in separate pre-built folders. Application
Repository Folder Name
Expense Planning and Corporate Forecasting
FM_EPCF
Expense Planning and Corporate Forecasting – Break/Fix
FM_ EPCF_BF
Expense Planning and Corporate Forecasting – Sandbox
FM_ EPCF_SANDBOX
Financial Data Warehouse
FM_FDW
Financial Data Warehouse – Break/Fix
FM_FDW_BF
Financial Data Warehouse – Sandbox
FM_FDW_SANDBOX Figure 3 - Repository Folders
5.2 Informatica Server Structure
Informatica provides a directory structure out of the box for development. The following figure provides the session variables that are provided with Informatica and the descriptions for their usage. Applications should adhere to these usage rules. Rather than default to $PMRootDir with relative pathing off of root it may be preferable to utilize the OS Profiles functionality where a distinct set of system variables will be set up for distinct s with a different per Folder, allowing the base system variable names to be utilized directly in the sessions. Informatica Session Variable
Description of Usage
$PMRootDir
A root directory to be used by any or all other server variables. Page 13 of 55
Informatica ETL Standards and Guidelines
Informatica Session Variable
Description of Usage
Informatica recommends you use the PowerCenter Server installation directory as the root directory. Instead, we maintain application specific folders separately from the Root installation directory. As such $PMRootDir should be set to a common application folder / {/opt/Informatica/PowerCenter8.1.1/server/infa_shared}. In the production environment, this folder is logically shared in the Unix file system across the 2-failover servers. $PMSessionLogDir
Default directory for session logs. Defaults to $PMRootDir/SessLogs but should be set in the environment to $PMRootDir.
$PMBadFileDir
Default directory for reject files. Defaults to $PMRootDir/BadFiles but should be set in the environment to $PMRootDir.
$PMCacheDir
Default directory for the lookup cache, index and data caches, and index and data files. To avoid performance problems, always use a drive local to the PowerCenter Server for the cache directory. Do not use a mapped or mounted drive for cache files. Defaults to $PMRootDir/Cache but should be set in the environment to $PMRootDir.
$PMTargetFileDir
Default directory for target files. Defaults to $PMRootDir/TgtFiles but should be set in the environment to $PMRootDir.
$PMSourceFileDir
Default directory for source files. Defaults to $PMRootDir/SrcFiles but should be set in the environment to $PMRootDir.
$PMExtProcDir
Default directory for external procedures. Defaults to $PMRootDir/ExtProc but should be set in the environment to $PMRootDir.
$PMTempDir
Default directory for temporary files. Defaults to $PMRootDir/Temp but should be set in the environment to $PMRootDir.
$PMWorkflowLogDir
Default directory for workflow logs. Defaults to $PMRootDir/WorkflowLogs but should be set in the environment to $PMRootDir.
$PMLookupFileDir
Default directory for lookup files. Defaults to $PMRootDir/LkpFiles but should be set in the environment to $PMRootDir. Figure 4 - Informatica Provided Server Variables Page 14 of 55
Informatica ETL Standards and Guidelines
To multiple applications and to separate application data at both a repository and UNIX level, the ‘out of the box’ Informatica server variables will not be used. In order to separate application data, all session variables will be defined as $PMRootDir and all applications must override any session variables with the directories for their application. These directories must be defined relative to $PMRootDir to avoid hard coding full paths of directory names. For example, the cache directory for the IDR application should point to $PMRootDir/IDR/Cache for the $PMCacheDir value. Based on the table above, the following directory structures must be created to the different applications. Rather than using the default server variable, the one $PMRootDir reference will be used with the path name appended and recorded in each session/workflow. Default Server Variable
Workflow Variable/Session Variable
Application Unix Path Name
Override Path Names $PMSessionLogDir
$PMRootDir/EPCF/SessLogs
/{fmacdata}/EPCF/SessLogs
$PMBadFileDir
$PMRootDir/EPCF/BadFiles
/{fmacdata}/EPCF/BadFiles
$PMCacheDir
$PMRootDir/EPCF/Cache
/{fmacdata}/EPCF/Cache
$PMTargetFileDir
$PMRootDir/EPCF/TgtFiles
/{fmacdata}/EPCF/TgtFiles
$PMSourceFileDir
$PMRootDir/EPCF/SrcFiles
/{fmacdata}/EPCF/SrcFiles
$PMTempDir
$PMRootDir/EPCF/Temp
/{fmacdata}/EPCF/Temp
$PMExtProcDir
$PMRootDir/EPCF/ExtProc
/{fmacdata}/EPCF/ExtProc
$PMWorkflowLogDir
$PMRootDir/EPCF/WorkflowLogs
/{fmacdata}/EPCF/WorkflowLogs
$PMLookupFileDir
$PMRootDir/EPCF/LkpFiles
/{fmacdata}/EPCF/LkpFiles
TBD
$PMRootDir/EPCF/Parameters
/{fmacdata}/EPCF/Parameters
TBD
$PMRootDir/EPCF/DataIn
/{fmacdata}/EPCF/DataIn
TBD
$PMRootDir/EPCF/DataOut
/{fmacdata}/EPCF/DataOut
Figure 5 - Application Informatica UNIX Structures for EPCF
5.3 Parameter File
All mappings will use parameters to help drive processing and flow. See section 10 for specifics relevant to ETL Control interaction with parameter file usage. However, each mapping, depending on the style of mapping, will contain standard parameters. Following are some standards and guidelines regarding parameter usage.
Page 15 of 55
Informatica ETL Standards and Guidelines
1) Parameter files should be referenced at the workflow level only versus the session. Should a special parameter exist for a mapping, incorporate it in the workflow referenced parameter file tagged appropriately to the session. (See Informatica parameter file configuration) 2) The following parameters are used in all mappings a. $$EXT_FROM_TM a timestamp b. $$EXT_THRU_TM a timestamp c. $$LAST_BUSN_PROC_TM a timestamp d. $$ETL_CREATE_TM a timestamp e. $$SRC_SYS_CD a string field f. $$TGT_SYS_CD a string field 3) The following parameters are used (in addition to other parameters) in type 2 mappings a. $$ETL_EFF_FROM_TM a timestamp b. $$ETL_EFF_THRU_TM a timestamp c. $$ETL_LAST_MOD_TM a timestamp d. $$INFINITY_TM a timestamp 4) Other parameters may be used to ensure consistency, such as a. $$CURR_IND_Y, $$CURR_IND_N, $$ACTV_IND_Y, $$ACTV_IND_N all string fields
5.4 Reusable Objects
Any mapping transformation and session can be marked reusable. However, there are key ETL components that should always be marked reusable: Sequence generators Lookups for foreign keys (i.e. Reference tables, Dimensions, Hubs and Links) Sessions Other mapping transformations should be marked reusable when the transformation can be used in more than one mapping or mapplet. Keep in mind the following: 1) Changes to any reusable object may affect everywhere it is used. Before applying the change, determine the impact of the change by examining where it is used and how it is used. 2) Lookups – these should be configured to work the exact fashion in every case, for ports, return values and conditions. For example, since a lookup condition can be augmented or altered in a mapping, the reusable lookup condition should include only the conditions that are mandatory. 5.5 Shared Folders
Informatica allows the sharing of objects by using shared folders. Sharing objects via a Shared Folder results in increased maintenance and is managed centrally. Care must be taken when importing or re-importing shared objects in the repository since many applications may be relying on these shared objects and any relationships that exist between those shared objects. Therefore sharing of any objects is to be avoided unless approved centrally by the Informatica . The default for any object should be to place it in a non-shared folder. If an object has been approved for sharing, it should be placed in a shared folder for that project. Each project that uses shared objects should have a corresponding Informatica shared project folder. The name of that folder should include the word “shared” after the project name. For example: FDW_Shared. Shared folders may be appropriate when: 1) Downstream consumers shortcut into the source Page 16 of 55
Informatica ETL Standards and Guidelines
2) Reusable objects are made available for downstream consumers Shared folders are not recommended for the following reasons: 1) Upon deployment, issues often arise where those unaffected mappings which reference shared objects become invalidated 2) The development environment and break fix operations may conflict since there are pending development changes while potential production fixes are occurring.
6. Data Quality
6.1 Data Transformations
Currently, no data quality tools have been purchased to assist with the Advantage project. Thus it becomes necessary for Business Analysts to ensure that all data quality related business rules are captured, and necessary for the ETL designers to ensure these business rules are reflected in the ETL designs using Data Transformations provided by Informatica. There are basic transformations that are applied regardless of requirements. These are typically common sense application of data quality. Some of those are: 1) Trimming of strings – assuming database string columns will be varchar in nature; it is standard to trim leading and trailing spaces. This is necessary to avoid issues, data comparisons 2) Data type conversion – convert incoming data into the target data type. a. For dates, examine the incoming fields for the various forms of dates possible b. For numeric fields, remove any punctuation (commas, currency and percent signs) 3) Case – for certain domain types (Codes, Identifiers) capitalizing or placing in a consistent case may be warranted. 4) Truncated fields – this is problematic since there will be data loss. However, trimming may help. 5) Missing data – refer to section 6.2 6) Domain value mapping – there may be instances where there are known mapping of values into a common set of values. For example, a flag field may contain these values to indicate true/false or yes/no (0, 1, Y, y, N, n, X, space). Standardize the flag to either a true/false bit or a Y/N character field for consistency 6.2 Default Data
In many instances, there are aspects of data that either require a relationship to another piece of data or is a required field. In these events, fundamental solutions can be instituted to fill in the gaps or build the relationships so the data continues to maintain integrity and value and error conditions are avoided. Some of the options are: 1) Nulls not allowed – when a database column contains a null constraints, which requires a non-null value, often it may be required that when there is no content a default value is required. The best approach may be to alter the database column and allow for nulls. If this is not possible, see the following for possible solutions: a. Date – set to some high or low level date (e.g. 1900-01-01 or 9999-12-31) b. Text (Code) – any default value that is consistent to the domain. E.g. ‘UNK’ c. Text – any default value that is consistent to the domain. E.g. ‘Unknown’ d. Numeric – either 0 or some other value consistent to the domain. The danger here is the non-null value may skew the expectation of the column meaning. e. Error – if the column is truly required and a default value is not acceptable, which will be the case for foreign keys, surrogate keys and ETL audit fields then return an error condition. Page 17 of 55
Informatica ETL Standards and Guidelines
2) Missing relative – this situation occurs when a foreign key constraint is violated. The typical design approach is to stop the process, since there may be an expectation that the relationship must exist. A couple of approaches are a. Ensure the ETL mappings that load the parent or foreign table, completely populate the table from the same file that is being used in the down stream mapping. This will ensure the foreign key relationship will exist. b. Fall back to a default foreign key or set the key to null. If the field is not null, a default row is identified in the relationship and that is substituted for the key. This may require seeding of rows that contain the default key. E.g. 0 for a surrogate key. 3) Data does not fit – this situation occurs when the domain of the field does not match the target column. Should this occur, a default can be set (see Nulls not allowed). A note of caution is, the data content will be lost and if not reported, there will be no indication of the situation. 4) Data Priming – if key relationships are required, (i.e. loading basic ‘default’ data into a hub or link) this may require defaulting of a relationship, in the event the lookup cannot find the relative. For example, a one-time SQL load can be run to load a hub or link, where the surrogate key is some default value (e.g. 0 or -99999). When it is determined that a relative is not found and the relationship is mandatory, the foreign key reference will be set to the default surrogate key.
Page 18 of 55
Informatica ETL Standards and Guidelines
7. Informatica Development Guidelines The following sections provide general programming guidelines and should be used as a guide. There may be cases when these guidelines are not appropriate and the ETL team should use its best judgment in exercising these rules. In addition, for those options, features and components not mentioned in these guidelines, reference the relevant Informatica help and product documentation. 7.1 Mapping Guidelines
When creating a mapping, there are key considerations and guidelines that foster ease of maintenance, restart-ability and performance. These guidelines are: 1) Write a single mapping per a target when appropriate. This could be when a single or multiple sources are needed to populate a target table. This may not make sense when a single source with high volume has to load multiple targets. 2) Try not to make overly complex mappings and instead, break it into multiple logical mappings. This will simplify maintenance and restart-ability. This may not be possible at all times such in cases where better performance is needed. 3) Limit the number of transformations when possible. There is overhead involved in moving data between transformations. Consider more shared memory for large number of transformations. Session shared memory between 12 and 40 MB should suffice. 4) Only connect what is used. Delete un-needed links between transformations. This minimizes the amount of data moved. This is very important in the source qualifier. 5) Avoid using stored procedures unless it is the only option. If the logic can be accomplished through Informatica, use Informatica. 7.2 Mapping and Workflow Documentation
Documentation should be included in all objects within a Mapping or Workflow. In addition, the Mapping or Workflow itself must have documentation that describes the Mapping/Workflow at a high level.
Figure 6 - Mapping Documentation
The mapping/workflow comments should include a high level description of the mapping/workflow. Objects that should be documented include and are not limited to the following: Transformation objects Expression ports Expression code Workflow tasks Page 19 of 55
Informatica ETL Standards and Guidelines
Workflows and Worklets Sessions
7.3 Transformation Object Documentation
Every Transformation object in a Mapping should be documented. In the first tab of each Transformation object is a place for the description. Include a brief description of the object and its function in the mapping. Documentation should be simple and non-technical. Example:
Figure 7 - Transformation Documentation 7.4 Expression Ports Documentation
Expressions have a variety of ports: Input, Output, Variable, Lookup, etc. Transformations that may be difficult to decipher should have documentation added. Expression code is often self-explanatory. However, some expressions can be complex and not easily understood. In such circumstances, the developer will provide comments and possibly pseudo-code that will help others understand the purpose and logic of the code. All documentation will follow the code and each line of documentation will be prefaced with double right-slashes (//). Example:
Page 20 of 55
Informatica ETL Standards and Guidelines
Figure 8 - Expression Port Documentation
7.5 Mapping Recovery
To eliminate the need for manual intervention, each mapping must be designed to automated recoverability. When intervention is required, those exceptions should be clearly documented. The Workflow restart features allow sessions to recover simply by restarting the workflow or session. 7.6 Mapping Templates
The use of mapping templates is essential in ing maintainable and consistent processing. Each of the templates will contain the standard transformations and flow that expected for the particular type of mapping. The working templates should be copied by developers to begin the development of the mappings that follow the template pattern. PowerCenter Data Stencil, which allows the design of ETL mappings in a Visio format, may also be used. These Visio flows can be reversed engineered into Informatica mappings in order to jumpstart the ETL development work. In some cases, mapplets may be also used to simplify processing or share common code – see the section related to Mapplets. 7.7 Source Qualifier
Informatica has extensive flexibility with its SQL qualifier transformation object. However, with that flexibility, there are guidelines that will help ensure efficiency and maintainability. Those are: 1) SQL Query - overrides are to be avoided. It is used only when there is no better option to resolve a complex criteria situation that cannot be achieved other ways. If used, be sure that the columns in the SELECT clause match the order of the ports ed through the transformation. Avoid all nonstandard SQL 2) Defined – when the source qualifier has more than one incoming source, use this to place the predicates. 3) Source Filter - place any filtering of source data in this box to prevent unnecessary traffic moving across the network and through additional mapping logic. 4) Number of Sorted Ports – use this when there is a need to sort incoming data. Keep in mind that the outgoing ports should be ordered (in this transformation port section) in the fashion expected. 5) Select Distinct – select this when there is a need to gain unique or distinct sets of data. Keep in mind that this can be an expensive database operation and may not gain the desired result if certain data is to be transformed (e.g. trimmed) Page 21 of 55
Informatica ETL Standards and Guidelines
6) Pre/Post SQL – when using this in the mapping, keep in mind this is embedded in the mapping. It is recommended to be placed in the session to ensure optimal flexibility. When overriding table logic, define all the source tables in the mapping even if they are not needed.
Figure 9 - Source Qualifier Transformation
7.8 Filter
Filter transformations are used to effectively narrow down incoming data, based upon a set of criteria, for process. Some guidelines are: 1) Use filters when a SQL filtering condition is not feasible, otherwise, filter in the Source Qualifier when possible. 2) Keep in mind a filter is considered active, so all points in the processing flow will have to converge to the filter 3) A filter is appropriate on a true/false condition on whether data is to be processed or ignored 4) When multiple outcomes or processing dependencies occurs, consider a router. See router for more specifics. 7.9 er
ers may be necessary for a number of reasons: ing disparate sources, simplifying source qualifier s or addressing performance issues. A few guidelines are: 1) ers are expensive transformations; use them only when performance at the database level is prohibitive. This usually occurs when there are too many s in a SQL statement. 2) Use ers to disparate sources 3) Prior to ing, ensure that the predicates have been trimmed or transformed accordingly.
Page 22 of 55
Informatica ETL Standards and Guidelines
4) Be careful about mixing character types. When ing on a field with varying types and lengths, the may not work. 5) When configuring the er, the following guidelines are relevant to the ‘Ports’ tab a. Type – this instructs the er to perform an equal (normal ), full outer or left and right s. Carefully read Informatica help to select which to use. Typically, a master outer is where all detail rows are kept and the master rows are matched against. Informatica recommends, “using the source with fewer rows as the master source”. Hence a master outer is a better performer than a detail outer . b. Sorted Input - to boost performance, it is often better to presort the incoming data prior to the er. Achieve this with a SQL sort or sorter transformation. This can significantly improve performance.
Figure 10 - er Properties
7.10 Router
This is a transformation that directs logic in one or more paths. A router is a more efficient transformation vs. a filter when there is multiple decision logic to be performed. Hence use a router when: 1) More than one path is taken depending on a filtering condition. 2) A decision path results in one or more true paths 3) There is a need for a default decision path 7.11 Sorter
Sorting is a standard need when processing and storing data. However, sorting is a performance intensive process demanding attention to where the sort occurs and sorting efficiency. When deciding on the sorting methods consider: 1) Leverage DBMS sorting features when possible. See the section on Source Qualifier for selecting a sorting option. By sorting in the DBMS, this reduces Informatica resources, shifting that the DBMS.
Page 23 of 55
Informatica ETL Standards and Guidelines
2) Consider using a sorter prior to an aggregator to minimize the impact of resource demands incurred during an aggregation. 3) Sort only those fields necessary to sort. 4) When configuring sort Properties, some settings to note: a. Case Sensitivity - click case sensitive (default) on when this is necessary. Uncheck the box to increase efficiency. b. Distinct – only select this when unique rows are required. This may be preferable to use versus an aggregator transformation. c. Null Treated Low – DBMS will typically sort nulls first. If this is the desired, then select this box to ensure Informatica sorts the data in that fashion.
Figure 11 - Sort Properties
7.12 Aggregator
The need to aggregate or summarize data may be necessary. Aggregators are transformations used to facilitate this type of processing. Often aggregate transformations are used for other purposes, such as deduplicating records or taking the first, last, max or min values of a record. Aggregators take significant resources and should be used only when essential to processing. When deciding on the sorting methods consider: 1) Aggregators are active transformations, requiring all data towards a target instance, to be ed through the transformation. Since this is the case, consider aggregating either upstream or downstream with few ports. 2) Reserve aggregating to intervals when the number of records are fewer. Hence, allow for filtering and routing to be complete before aggregating. 3) When configuring Ports, some settings to note: a. GroupBy – select only those fields that are used in the aggregation. Do not select fields if they are not in any way a key. If no ports are selected, the transformation will bring back only one row.
Page 24 of 55
Informatica ETL Standards and Guidelines
b. Expression – this enables flexible inclusion of functions that are necessary during aggregation. If a field is not grouped, Informatica will default to the first value that is encountered, based upon the GroupBy keys. Use the proper function to acquire the desired result (e.g. max, min, last, first). The expression box is also used to perform aggregation functions, such as sort, avg, mean, etc.
Figure 12 - Aggregator Ports
Page 25 of 55
Informatica ETL Standards and Guidelines
4) When configuring Properties, some settings to note: a. Sorted Input – when the incoming keys are in a sorted order select this option to optimize performance. If possible, sort the incoming data in advance either using features in the Source Qualifier or Sorter transformation. b. Aggregator Data(Index) Cache Size – in general follow Informatica recommendations when sizing an aggregator. Keep the default cache size unless performance becomes a concern.
Figure 13 - Aggregator Properties
7.13 Lookup
Lookups are useful transformations that foster change detection as well as other functions. Since most lookups are essentially accessing relational databases, even though flat file lookups are permissible, performance is a key consideration. Hence, lookup transformations are used when data volumes are relatively small (less than 500 MB) or when no alternative is available. Small lookups can improve performance while large volumes may necessitate performing SQL s to gain efficiency. Refer to Informatica documentation for recommended settings; however, the following are guidelines for the should be considered. 1) Only bring in ports that are relevant to the lookup. In other words do not carry forward into the transformation ports that have no relation to looking up data. 2) When configuring the lookup, the following guidelines are relevant to the ‘Ports’ tab a. Reusable Lookups - make a lookup reusable if it used in more than one mapping. b. Ports - When creating a lookup, select only the ports that are required and cache only those columns that are required. See the following figure.
Page 26 of 55
Informatica ETL Standards and Guidelines
c.
Disconnected/Connected Lookups – connected lookups are embedded in the flow of the mapping, while disconnected lookups are referenced in an expression of some sort. Use disconnected lookups when there is a need to return only one port and the lookup may be required for multiple lookups in the mapping. Use connected lookups to return more than one port and it’s natural to incorporate in the logic flow. When using a disconnected lookup, ensure an ‘R’ return port is selected in the lookup. Select ‘O’ output for those ports that will be returned in the connected lookup.
Figure 14 – Lookup Transformation Ports
3) Lookup properties provide a range of options to facilitate lookup processing and efficiency. When configuring the lookup, the following guidelines are relevant to the ‘Properties’ tab a. Lookup SQL Override - overrides are to be avoided. It is used only the lookup complexity is such that more than one table is required to a lookup condition. If used, be sure that the columns in the SELECT clause match the order of the ports ed through the transformation. b. Lookup Caching Enabled – select this to entirely cache the table. This is generally the best option for smaller tables; however, if the table is large and it is necessary to perform a lookup, then do not click this box since caching large files is very expensive. Refer to Informatica documentation for recommendations regarding caching size and performance configuration. c. Lookup Policy on Multiple Match – generally take the default of ‘Report Error’. If there is a potential to return more than one row and that is acceptable, select one of the other options (return first, last or any). d. Lookup Condition – this is used as the selection criteria for the lookup. Ensure that the condition will result in no more than one row being returned; otherwise, see ‘Lookup Policy on Multiple Match’. e. Lookup Cache Persistent – click this when this lookup is being used from multiple mappings and the source data is unlikely to change during the run f. Re-cache from Lookup Source – click this when this is expected to be the first mapping to use the lookup. This is enabled only when ‘Lookup Cache Persistent’ is enabled. All
Page 27 of 55
Informatica ETL Standards and Guidelines
subsequent mappings should not have this option checked. This will ensure that the data is refreshed from the source the first time the lookup is used.
Figure 15 - Lookup Transformation Properties
7.14 Sequence Generator
The best practice is to use a sequence generator transformation to generate surrogate keys rather than relying on the database to generate sequence numbers. Predominant reason is to ensure optimal maintenance and flexibility. A few guidelines are: 1) Make the transformation reusable in all cases. This is to deployment and visibility into sequence generators. 2) When caching values, ensure the ‘Number of cached values’ is set to appropriate align with the numbers of rows to process. That is, for master data that may infrequently change and is of low volume, set the count to 10. If the volume is larger, say 10,000 average inserts in a run, set to 10,000. This is a judgment call, but in no case leave at 0, since this could cause a performance issue. Page 28 of 55
Informatica ETL Standards and Guidelines
3) When multiple mappings can use the same sequence generator at the same time, it is required to cache the lookup values.
Figure 16 - Sequence Generator Properties
7.15 Update Strategy
Update Strategy is used to ultimately describe the type of action occurring on the target, whether an insert, update or delete. 1) Place this transformation as the last transformation prior to the target. 2) Use this transformation in all mappings to ensure the mapping controls the target processing. Otherwise, the session must be configured to guide target behavior. 3) Constraint based loading may not operate properly when used in tandem with Update Strategies. 4) When configuring the lookup, the following guidelines are relevant to the ‘Ports’ tab a. Update Strategy Expression – rather than entering in numeric values, enter in ‘DD_INSERT’, ‘DD_UPDATE’ or ‘DD_DELETE’, for maintainability purposes.
Page 29 of 55
Informatica ETL Standards and Guidelines
Figure 17 - Update Strategy
7.16 Target
The Target transformation is the descriptor of the target, as well as, consists of many opportunities for SQL override. 1) When configuring the target, the following guidelines are relevant to the ‘Properties’ tab: a. Reject Truncation/Overflowed Rows – leave this unchecked since rejection processing will occur in accordance to reject/error handling standards and guidelines. b. Update Override – Do not use this unless absolutely necessary. It should be extremely rare and override is used. If it is, the ETL architect should approve an ensure this is documented appropriately. Also if an override is necessary, consider using in a session rather than a mapping. c. Table Name Prefix – rather than placing in the prefix here, leverage DBMS synonym or other features. d. Pre/Post SQL – this can be used to facilitate DBMS actions that may aid in target processing, such as index drops, index creates, run statistics. It is not recommended to place SQL in the mapping. To foster maintainability, place pre/post SQL in the session pre/post functions.
Page 30 of 55
Informatica ETL Standards and Guidelines
Figure 18 - Target Properties 7.17 Mapplet
The mapplet object is very useful in ing reusability, reducing mapping complexity and fostering maintainability. These are situations where mapplets are appropriate: 1) Elements of ETL common to multiple mappings 2) Complex logic that will simplify processing. When a mapping begins to drift from the standards templates, a mapplet is a mechanism to encapsulate differing logic. 3) When multiple sources are required to populate a single target and it becomes necessary to separate the sources based upon timing or performance. a. Each mapping will incorporate the source(s) and place the data into a common image to import into the mapplet b. The mapplet will contain the common processing that will occur regardless the source. This will include, transforming the data, performing a lookup in the target, ing information to determine the course of action (insert, insert/expire, update) The following diagram depicts a ‘Hub’ mapping with multiple sources. Should the mapping require separation, based upon source, the transformations in the box can be separated out into a mapplet. What may occur is any aggregation and union required due to processing multiple sources may be removed. Hence, ultimately the mapplet will contain a Mapping Input, Expression, Lookup, Expression and Mapping Output. The mapping itself will continue to contain any Filter or Router to ultimately process the target.
Page 31 of 55
Informatica ETL Standards and Guidelines
Figure 19 - Mapplet - Hub Example W/O Mapplet
The following figure depicts the removal of those common elements as well as those transformations, such as unions and aggregators, that may no longer be required.
Figure 20 -Mapplet - Hub Example with Mapplet
7.18 Session
Sessions are wrappers to mappings that provide the execution environment parameters, configuration and instructions to facilitate mapping execution. The initial setup of a session is essential to the long term operation and execution of a mapping. When configuring a session, follow these guidelines: 1) All sessions should be created as reusable and initially configured to be production ready. When testing, a copy of the session can be dropped into a worklet/workflow ultimately preserving the pristine version. 2) When configuring the session ‘Properties’ tab: a. Parameter Filename – keep this blank in all but rare cases. The parameter file should be referenced at that worklet/workflow levels. b. Enable Test Load – for the pristine version, keep this unchecked. c. Treat rows as – the setting for this depends on the type of mapping and database actions. For example, for Hub and Link mappings – set to ‘Insert’ or ‘Data Driven’ assuming there are ‘Update Strategy’ before each target; for Satellite – set to ‘Data Driven’. Page 32 of 55
Informatica ETL Standards and Guidelines
d. Commit Type – set to ‘Target’ e. Commit Interval – the default of 10000 may be sufficient, however, this should be tailored to the target load volume and should be altered to improve performance when deemed necessary. Generally, there exists a balance of configuring the commit interval and testing may be required.
Figure 21 - Session Properties
3) When configuring the session ‘Config Object’ tab: a. Constraint based load ordering –Only check this when there is a need to sequence the order of loading targets in a mapping. Constraint based loading may not operate properly when used in tandem with Update Strategies. b. Cache LOOKUP () Function – check this box in all cases unless the lookup is very large and the performance is acceptable. c. Save session logs for this run – set this to 10 d. Stop on errors – set this to 1. This should never be set to anything different, unless there is a rare instance where errors are acceptable. e. Override tracing – keep this set to the default, which provides the normal mode of tracing.
Page 33 of 55
Informatica ETL Standards and Guidelines
Figure 22 - Session Config Object
4) When configuring the session ‘Mapping’ tab: a. Connections – use the DB connection parameters present in the associated workflow parameter file. Do not directly reference the connection other than through the parameter. b. Target Load Type – the options are ‘normal’ or ‘bulk’. Use ‘bulk’ for loading stage tables, transactions or tables that are significant in size. c. Insert – For ‘Insert’ only mappings, do not check any of the other Update options since a performance hit may occur d. Update as Update– this is typically selected for all Satellite mappings e. Update as Insert – not typically used f. Update else Insert – not typically used since the norm is to use Update Strategy for update mappings. g. Delete – use only when deletes are used h. Truncate Table Option – Informatica allows the ‘Truncate’ of a table by checking the ‘Truncate’ options for the target table in the session properties. When Informatica cannot issue a truncate, it might issue a delete statement, which can cause performance issues. When using native database connections, use the ‘Truncate’ checkbox option. The on the connection must have access to truncate the target table. If the connection owner does not have truncate/drop privileges on the target table, this method cannot be used. If the connection owner is not the owner of the tables then the table owner must be specified in the session properties override. There may be additional restrictions imposed by the RDBMS that should be considered when using this option. For example, Oracle does not allow truncate via a synonym. In this case, the owner of the target table must be specified for the truncate to work. through the session log that truncate is working. When the use of this truncate option is not viable (for example when using ODBC connections), perform a table truncate via the sessions pre-SQL property. When removing all data from a target table that might be large, “delete” should be the final option. Page 34 of 55
Informatica ETL Standards and Guidelines
i.
Pre/Post SQL – use this sparingly, instead placing this in a SQL task in the workflow/worklet will make it more visible
Figure 23 - Session Mapping
Page 35 of 55
Informatica ETL Standards and Guidelines
5) When configuring the session ‘Components’ tab: a. On Failure E-Mail – every session should reference a reusable task that is used to send out consistent emails. Upon failure the email should sent out i. Name – Reference the email that is used in ing the application. Do not put in personal email. ii. Subject – Send the session name (%s) and text indicating a failure iii. Text – Attached more detailed text that contains the session name (%s), session status (%e), number of rejected records (%r), session start and end times (%b, %c), session elapsed time (%i) and attach the session log (%g).
Figure 24 - Session Component Tab
Figure 258 - Failure Email
Page 36 of 55
Informatica ETL Standards and Guidelines 7.19 Workflow/Worklet
Workflows and worklets are the mechanisms to orchestrate the execution of one or more sessions and tasks. Workflows have configuration parameters that define the execution environment. When configuring a workflow or worklet, follow these guidelines: 1) All workflow logs are written to the workflow log directory for that project 2) All session logs are written to the session log directory for that project 3) When configuring the worklet/session ‘General’ tab: a. All worklets, sessions and tasks within workflows should have the “Fail parent if this task fails” option selected. This will cause the workflow/worklet to end in failure when any task within it fails.
Figure 29 – Worklet/Session General
4) When configuring the workflow ‘Properties’ tab: a. Parameter Filename – The parameter file should be referenced here, rather than individual sessions. All sessions and worklets that operate within the workflow will use the referenced parameter file unless overridden at a session level. b. Save workflow log for these runs – keep the default of 10 unless there is a reason to alter.
Page 37 of 55
Informatica ETL Standards and Guidelines
Figure 30 - Workflow Properties
5) In the workflow, ensure there are conditions set between worklets and sessions. This is to ensure the prior session/worklet have completed successfully. This will ensure the subsequent sessions/worklets only process when that condition is met.
Page 38 of 55
Informatica ETL Standards and Guidelines
Figure 261 - Worklet Condition
Page 39 of 55
Informatica ETL Standards and Guidelines
8. Change Data Capture (CDC) Change data capture (CDC) is a technique to determine the effect of a change in the source, allowing history to be captured. The change detection process will determine if the source data is new, has changed, was removed or is missing. There are several options in detecting CDC, which depend on the type of the source. 1) Full master files – snapshot images of the data. Deletes are implicit. 2) Incremental Delta files – each image of data is tagged as either new, changed or removed (explicit deletes) 8.1 Full Master
A full master file depicts the snapshot image of a data source. Since there would not be any transaction codes or indication regarding the event that added, changed or removed data, the ETL process would have to compare the source with the current target data in order to determine the changes. Due to performance, full comparisons may not be possible with high volume data because it would require a full outer with the source and target. To ensure the ETL mappings are consistent in style and approach, use standard templates. See the section on Mapping Templates. 8.2 Incremental
An incremental file should explicitly indicate the event that occurred with data. Hence, there should be some sort of transaction code that indicates whether the data is new, changed or removed. The ETL approach in determining changes is straightforward since the transaction determines the outcome. The process to determine changes are: 1) New (insert) – essentially insert the data into the database. If the data already exists, then error out the row and report the error. This situation only occurs if the data has been already processed or the source system presented the data in error. 2) Changed (update) - take the incoming key fields and lookup the corresponding data in the target table using a Lookup transformation. If the result of the transformation indicates the row is found then apply the changes. If the data is not found, , then error out the row and report the error. 3) Removed (delete) – take the incoming key fields and lookup the corresponding data in the target table using a Lookup transformation. If the result of the transformation indicates the row is found then apply the delete. If the data is not found, then error out the row and report the error. Another option might be to perform an upsert process. 8.3 CDC Techniques
To achieve CDC processing, Informatica offers a few options. However, for information purposes the following describes a couple of techniques, the preferred one listed first. 1) Function – calculating the checksum value of data is an effective mechanism in detecting change. Rather than comparing incoming fields to prior fields, which can lead to errors and inefficient processing, storing the checksum value is the preferred approach. What will occur is: a. Calculate a checksum of key incoming data using a function (like MD5) and store it with the data in the target table. b. When looking up the data in the database, the lookup will return only the checksum value and minimal columns to CDC. This will minimize the size of the lookup. Page 40 of 55
Informatica ETL Standards and Guidelines
c.
Should the CDC values differ (incoming vs. lookup) the row is targeted for change, otherwise it is byed d. When calculating the checksum, always include the surrogate key, foreign keys, and satellite business data. Do not include the ODS ETL columns (effective dates, current indicator, active indicator, last modify date, create date, created by id and modified by id). 2) Dynamic Lookup – this type of lookup enables the insert and updates of the data to be applied to the lookup cache. After the row is added or altered in the cache, the result will be returned to the ETL flow allowing for subsequent insert/update processing. Additional logic is necessary to synchronize the lookup cache with the target table. This approach is useful if it is necessary to keep the cache up to date during processing. See Informatica documentation. 3) Timestamp Detection – Use a timestamp that is to determine source changes.
Page 41 of 55
Informatica ETL Standards and Guidelines
9. Logging A typical solution that uses Informatica and non-Informatica applications will have several layers of logging: 1) Informatica session logging 2) Informatica workflow logging 3) Non Informatica logging 9.1 Informatica Session Logging
Informatica will log data at run time at the session level for every session. Informatica’s default is to give the log file the same name as the session file with a .log extension and to overwrite the log file every time the session runs. Informatica also has the ability to archive log files and in this case, the old log files are stored in the same directory as current log files but with a different name. Informatica can store log files for a specified number of runs or based on timestamp (the lowest level of granularity being minute). To store log files for a specified number of runs, Informatica adds a sequential number to the end of the log file name (0 representing the current log). This setting is available at the server level (affecting all applications) or at the individual session level (requiring more control and maintenance). To store log files by timestamp, this setting is configured at the individual session level (requiring more control and maintenance). The standard is to allow Informatica to maintain log files and history. One advantage of letting Informatica generate log files is in this example: If a session processes twice in a schedule, it may be advantageous to have access to the log data from both runs. This will be lost if a single log is retained. The standard for session logging is use Informatica’s versioning of log files with a default setting of keeping 10 logs. With each run, log 9 would be removed, log 8 become log 9, log 7 become log 6, etc… 9.2 Informatica Workflow Logging
Informatica maintains workflow log details in a similar manner to maintaining session log details. The standards for workflow logging are the same as session logging using Informatica’s versioning of log files and to retain 10 versions of the log. 9.3 Non-Informatica Logging
Non-Informatica applications (for example Unix scripts) should log status data to flat files. This can be used for troubleshooting purposes. These logs should be stored by application to allow easier maintenance (archiving). These files will be generated with dates and times and maintained in a similar manner to Informatica session and workflow logs. Run time statistics for non-Informatica processes may be captured in a relational table. This can be used to monitor run time statistics.
Page 42 of 55
Informatica ETL Standards and Guidelines
10. Error/Exception Handling and Recovery Informatica has features that will assist in the handling of errors and recovery. When combined with standard error and recovery practices this provides a solid framework. 10.1 Essential Error Trapping and Recovery
This approach contains the basic essentials features that are inherent with Informatica and the design of the ETL mappings. This approach will consist of minimum data handling to avoid potential errors, session, worklet and workflow configuration. In many instances, there are aspects of data that either require a relationship to another piece of data or is a required field. In these events, fundamental solutions can be instituted to fill in the gaps or build the relationships so the data continues to maintain integrity and value and error conditions are avoided. The configuration of session, worklets and workflows affect the operation of data loads. The configuration is flexible allowing for the configuration of tightening or loosening error handling. If a enterprise scheduler is also required, then the failure and recovery of jobs can be controlled at both levels. Some guidelines: 1) All sessions should have an error threshold of 1. Hence, the session will fail when the first error is encountered. 2) All sessions and worklets should be configured to fail the parent worklet/workflow when it fails
Figure 272 - Session Runtime Configuration
3) Conditions should be set between sessions or worklets which evaluate the status of the session or worklet
Page 43 of 55
Informatica ETL Standards and Guidelines 10.2 Informatica Errors and Warnings
Fatal errors often occur due to a violation of an environment condition (hardware, infrastructure services, and ETL services), database or ETL logic/function. In addition, errors due to data anomalies may occur, but those are typically manifested in an ETL function or database constraint error condition. Warnings in context of ETL are those conditions where data situations or issues are encountered and handled according to a predefined set of rules. Encountering these types of conditions may require alteration of the pre-defined set of rules or fall back to a default result. Data quality programs and best practices will define the approach and handling of data aberrations. The following table depicts the typical conditions and the action typically taken. Condition Informatica mapping transformation or function failure SQL failure condition - Key constraint violation - Null constraint violation - Data type violation - Foreign key violation - General SQL failure Workflow/worklet flow failure
Warning/ Error Error Error
Error
Action Taken Developer must correct the failing function or transformation Developer or DBA must correct the condition
Data type of incoming data does not match target data Foreign key violation
Warning
Developer must correct the failing workflow or worklet The target field is defaulted to a standard value
Warning
Default the foreign key to a standard key value
Null constraint violation
Warning
The target field is defaulted to a standard value
Figure 283 - Warning and Failure Conditions
Page 44 of 55
Informatica ETL Standards and Guidelines
11. ETL Control An ETL control process is needed to be defined in order to manage the ETL execution and record runtime stats. Since the acquisition of data is often time driven, the ETL process may require a set of dates and/or time intervals by which data is filtered for extraction. This section includes a sample ETL Control solution. The actual ETL Control process will be designed based on Freddie Mac business requirements. ETL Control Table – this database table contains information to help drive ETL processing for a specific source. This may include dates and/or times that indicate the scope of data extracted from a source. ETL Transform Rule – since hard-coding codes or other attributes is not recommended, this database table is used to take the place of hard coding values or filtering conditions in ETL. For ETL control purposes, this will contain rules to help drive the generation of the ETL Control. Parameter File – these are parameters used within Informatica components to drive the ETL process. In of ETL control, key dates or times will be placed into the parameter file to enable the Informatica components (mapping) a means to filter or narrow the scope of incoming data from a particular source. 11.1 ETL Control Data Model
The ETL_CONTROL table is used to assist in the automation of the ETL process. It essentially contains dates for a specific source/target combination used to drive the scope of the incoming data, as well as, record the process event. The following table depicts the columns in the ETL_CONTROL. Column ETL_CONTROL_KEY SOURCE_SYS_CD
Data Type INTEGER VARCHAR2(20)
TARGET_SYS_CD
VARCHAR2(20)
EXTRACT_FROM_TM
TIMESTAMP
EXTRACT_THRU_TM
TIMESTAMP
LAST_BUSN_PROC_TM
TIMESTAMP
ETL_START_TM
TIMESTAMP
ETL_END_TM
TIMESTAMP
ETL_PROC_COMPLETE_FLG
INTEGER
Purpose Surrogate Key The code that represents the source system. The code that represents the target system. The date for which date sensitive data is extracted from. It is used to populate effective from dates in versioned tables. The date for which date sensitive data is extracted thru (inclusive). This represents the date and/or time for which the data is valid through. Typically, this is yesterday in the EDW, yet could be an end of the month for a data mart. Used to populate the ETL create and last update dates. The date and time for which the ETL run began. The date and time for which the ETL run completed. A 0/1 flag that indicates: 0 – run in process or halted, 1- successful completion.
Some key notes about the population of this table are: Page 45 of 55
Informatica ETL Standards and Guidelines
Once the ETL_CONTROL is initially configured (set upon deployment), it is completely maintained by the ETL process Each time a workflow commences execution, the ETL_CONTROL will roll the ETL_CONTROL forward based upon a rule established in the ETL_TRANSFORM_RULE table At successful completion of the workflow, the appropriate row will be closed, by updating the completion flag and ETL end times The ETL_CONTROL dates are subsequently placed into the appropriate parameter file. See the following sections.
11.2 ETL Transform Rule Data Model
For purposes of ing ETL control processing, ETL_TRANSFORM_RULE table is defined as follows: Column ETL_TRANSFORM_RULE_KEY RULE_PRIMARY_ID RULE_SECONDARY_ID RULE_VALUE_NBR RULE_VALUE_DT RULE_VALUE_TXT
Data Type NUMBER VARCHAR2(30) VARCHAR2(30) NUMBER DATE VARCHAR2(255)
RULE_DESC VALID_FROM_DT
VARCHAR2(2056) DATE
VALID_THRU_DT
DATE
Purpose Surrogate Key The primary identifier or key. The secondary identifier or key. A numeric value that is used in the rule. E.g. 1 A date that is used in a rule. The text value that is used for a rule. For the ETL control purposes, it will be either HOURLY, DAILY, MONTHLY, YESTERDAY (or any other type) A detailed description and purpose of the rule The effective start date for the rule. E.g. 01JAN-1900 The last day the rule is valid. E.g. 31-DEC-9999
11.3 Rolling ETL Control Forward
Rolling the ETL Control forward is critical to the automated load process. The ETL process will read the ETL_CONTROL table for the applicable source and target combination and roll the effective time forward, based upon rules set forth in the ETL_TRANSFORM_RULE table. Since the ETL_TRANSFORM_RULE table contains the rules regarding how to increment the effective dates, refer to the following for the possible increment type situations. 1) Get the latest (most recent EXTRACT_FROM_TM and completed successfully) ETL_CONTROL based upon the source, target combination (SOURCE_SYS_CD & TARGET_SYS_CD) 2) Get from the ETL_TRANSFORM_RULE table the rule for incrementing by matching the SOURCE_SYS_CD & TARGET_SYS_CD to RULE_PRIMARY_ID & RULE_SECONDARY_ID respectively for the valid date range. 3) Increment based upon the following: RULE_VALUE_TXT – this contains the increment cycle for the source/target. When set to: ‘HOURLY’ – the timestamp will roll to the next time period that is not in the future 1. ETL_CONTROL.EXTRACT_FROM_TM is set to the most recent ETL_CONTROL.EXTRACT_THRU_TM + 1 second 2. ETL_CONTROL.EXTRACT_THRU_TM is set to the most recent ETL_CONTROL.EXTRACT_THRU_TM + xx hours (xx = RULE_VALUE_NBR) 3. ETL_CONTROL.LAST_BUSN_PROC_DT is set to EXTRACT_THRU_TM ‘DAILY’ – the timestamp will roll to the next time that is not in the future Page 46 of 55
Informatica ETL Standards and Guidelines
1. ETL_CONTROL.EXTRACT_FROM_TM is set to the most recent ETL_CONTROL.EXTRACT_THRU_TM + 1 second 2. ETL_CONTROL.EXTRACT_THRU_TM is set to the most recent ETL_CONTROL.EXTRACT_THRU_TM + xx day(s) (xx = RULE_VALUE_NBR) 3. ETL_CONTROL.LAST_BUSN_PROC_DT is set to EXTRACT_THRU_TM ‘YESTERDAY’ – the timestamp will roll through yesterday. 1. ETL_CONTROL.EXTRACT_FROM_TM is set to the most recent ETL_CONTROL.EXTRACT_THRU_TM + 1 second 2. ETL_CONTROL.EXTRACT_THRU_TM is set to yesterday’s date 3. ETL_CONTROL.LAST_BUSN_PROC_DT is set to EXTRACT_THRU_TM ‘MONTHLY’ – the timestamp will roll to the next month that is not in the future 1. ETL_CONTROL.EXTRACT_FROM_TM is set to the first day of the month by adding one month to the most recent ETL_CONTROL.EXTRACT_FROM_TM + 1 month 2. ETL_CONTROL.EXTRACT_THRU_TM is set to end of the month for the new EXTRACT_FROM_TM 3. If the dates are in the future, the ETL will ABORT since it is attempting to process future data
Examples: HOURLY
The increment frequency is set to ‘HOURLY’ incrementing by 4 hours. Column Recent SOURCE_SYS_CD PSFT TARGET_SYS_CD STG EXTRACT_FROM_TM 15-OCT-2008 00:00:00 EXTRACT_THRU_TM 15-OCT-2008 03:59:59 LAST_BUSN_PROC_DT 15-OCT-2008 03:59:59
New PSFT STG 15-OCT-2008 04:00:00 15-OCT-2008 07:59:59 15-OCT-2008 07:59:59
DAILY
The increment frequency is set as ‘DAILY’ incrementing by 1 day. Column Recent SOURCE_SYS_CD PSFT TARGET_SYS_CD STG EXTRACT_FROM_TM 15-JUN-2008 00:00:00 EXTRACT_THRU_TM 15-JUN-2008 23:59:59 LAST_BUSN_PROC_DT 15-JUN-2008 23:59:59
New PSFT STG 16-JUN-2008 00:00:00 16-JUN-2008 23:59:59 16-JUN-2008 23:59:59
MONTHLY
The increment frequency is set as ‘MONTHLY incrementing by 2 months. Column Recent New SOURCE_SYS_CD PSFT PSFT TARGET_SYS_CD IDR IDR EXTRACT_FROM_TM 01-MAY-2008 00:00:00 01-JUL-2008 00:00:00 EXTRACT_THRU_TM 30-JUN-2008 23:59:59 31-AUG-2008 23:59:59 LAST_BUSN_PROC_DT 30-JUN-2008 23:59:59 31-AUG-2008 23:59:59
Page 47 of 55
Informatica ETL Standards and Guidelines 11.4 Parameter File and ETL_CONTROL Synchronization
Any date sensitive parameter is updated after the ETL_CONTROL has rolled forward for a run (see prior sections) and prior to any session processing The parameter dates are populated from the ETL_CONTROL using the mapping (m_PARMETER_FILE_INITIALIZATION). The synchronization occurs as follows: 1) The ETL_CONTROL is rolled forward (see mapping s_m_EDW_ETL_CONTROL_START) based upon the parameter source/target values. Note – if new date sensitive parameters are required, the mapping must be changed incorporating the new parameter. 2) A copy of the session that was dropped into a worklet points to its parameter file. 3) The mapping will read the parameter file and detect those strings that contain a date and substitute the dates that exist in the ETL_CONTROL. See the following table. Parameter $$EXT_FROM_TM $$EXT_THRU_TM $$ETL_EFF_FROM_TM $$ETL_EFF_THRU_TM
$$LAST_BUSN_PROC_TM $$ETL_LAST_MOD_TM $$ETL_CREATE_TM $$INFINITY_TM
ETL_CONTROL (after rolled forward) EXTRACT_FROM_TM Note – upon table initialization, this will be set to ‘1900-01-01 00:00:00’ EXTRACT_THRU_TM EXTRACT_FROM_TM EXTRACT_FROM_TM - 1 second Note – upon inserts the ETL logic will use the $ $INFINITY_TM parameter, whereas an update of a row, this value ($$ODS_EFF_THRU_TM) is used LAST_BUSN_PROC_TM System Time System Time Set to ‘9999-12-31 00:00:00’
Figure 29 - Parameter and ETL_CONTROL Sync
4) A temporary file is created and subsequently renamed by a command task to the expected parameter file name 5) The next worklet, in the execution path, will pick up the updates. Note – the workflow will continue to have the old values of the parameter file, whereas the worklets will properly pick up the values in the updated file. 11.5 ETL Start
The initial ETL Control mapping is key to ensuring ETL processing is executing properly. This mapping fulfills the following needs: 1) Acquires the most recent ETL_CONTROL for the indicated source and target combination respectively where the completion is successful. ETL_CONTROL.SOURCE_SYS_CODE= '$$SRC_SYS_CD' AND ETL_CONTROL.TARGET_SYS_CD= '$$TGT_SYS_CD’ AND ETL_CONTROL.ETL_PROC_COMP_FLG=1 2) It performs a lookup in ETL_TRANSFORM_RULE to acquire the increment rule (see section Error: Reference source not found Error: Reference source not found) 3) The process will stop processing when the following conditions are met: a. Could not find a matching ETL_CONTROL row b. Found an ETL_CONTROL row already existing for the increment start date c. The increment start date is in the future d. The increment end date is before the increment start date 4) The dates are incremented and inserted as a new row into ETL_CONTROL
Page 48 of 55
Informatica ETL Standards and Guidelines
Figure 30 - Start ETL Control Mapping 11.6 ETL End
The ending ETL Control mapping simply closes out the Workflow run by recording relevant completion code and timing: 1) Acquires the most recent ETL_CONTROL for the indicated source and target combination respectively where the completion is in process. ETL_CONTROL.SOURCE_SYS_CODE= '$$SRC_SYS_CD' AND ETL_CONTROL.TARGET_SYS_CD= '$$TGT_SYS_CD’ AND ETL_CONTROL.ETL_PROC_COMP_FLG=0 2) The process will stop processing when the following conditions are met: a. Could not find a matching ETL_CONTROL row b. Multiple rows are returned 3) The relevant completion time and status is updated in the ETL_CONTROL row.
Figure 31 - End ETL Control Mapping
Page 49 of 55
Informatica ETL Standards and Guidelines 11.7 ETL Control Workflow
The ETL Control components in the workflows orchestrate the flow of incoming data. Essentially, there are three events that drive the process: first, start the ETL process rolling forward the ETL_CONTROL; second, initialize the parameter file based upon ETL_CONTROL; and finally, close out the ETL_CONTROL recording completion. The following worklet diagram depicts the execution of the ETL roll forward and parameter initialization process.
Start
s_m_ODS_LOA D_ETL_CONT ROL_START
s_m_ODS_LOAD $s_m_ODS_LO..._PARAMETER_ FILE_INITIA LIZATION
c_ODS_Move_Ne $s_m_ODS_LO... w_Control_Fi le
Figure 32 - Sample Worklet to Roll Forward ETL_CONTROL & Initialize Parameter File
The following diagram depicts the worklet that marks the completion of the ETL process.
Start
s_m_ODS_LOA D_ETL_CONT ROL_END
Figure 33 - Sample Worklet for ETL_CONTROL Completion
Finally, the following workflow depicts the start and end worklets that drive the ETL process.
Start
wl_ODS_STAR T
$wl_ODS_STA...
wl_ODS_1
$wl_ODS_1.St...
wl_ODS_2
$wl_ODS_1.St...
$wl_ODS_2.St...
wl_ODS_3
Figure 34 - Sample Workflow Depicting ETL Flow
Page 50 of 55
$wl_ODS_3.St... wl_ODS_END
Informatica ETL Standards and Guidelines
12. UNIX
12.1 UNIX Data Directories
The Informatica installation (INFA_HOME) is in “/opt/Informatica/PowerCenter8.1.1”. The project specific folders and Unix scripts will reside under “$INFA_HOME/server/infa_shared”. The Informatica installation and working directories could reside on a disk mounted separately from the sftp directories. This will ensure that if there is a disk failure on the SFTP drive, Informatica can continue processing any files that have been moved to the working directory. Additionally, an archive directory will also exist on the SFTP drive. After an Informatica file has been successfully processed through an entire ETL cycle it will be moved to the archive directory and compressed. It should not be archived before this time as until a Cycle is complete there is no certainty that a file or file set is valid. Only after it has been processed successfully can it be classified as a good file and only good files should be archived and compressed. The following image represents the shared folder structure of the Informatica installation directory. The shared folder structure contains folders that, when represented with Informatica parameters, will house data used in the ETL processing. The server will have a base set of folders and each project, corresponding to an Informatica folder, will have its own set of folders as well. The following is the listing of the base set of directories: TgtFiles BWParam Backup LkpFiles SrcFiles BadFiles WorkflowLogs Cache Storage Log Temp SessLogs In the Production and QA environments these folder structures will need to be restricted to the id and/or group which the id executing the Informatica code is utilizing. In the development environment, each developer will need to have read/edit/execute (7) level access to these folders in order to accommodate development and testing scenarios. To accommodate this each developer will need to have a id in the group which owns the folder (if 775 is used as above) or the access will need to be updated to be 777.
Page 51 of 55
Informatica ETL Standards and Guidelines
12.2 UNIX Landing Directories
The UNIX landing zones are directories in which ftp files from the source systems will be stored. There are two classifications of landing zones. There will be inbound and outbound directories for each application. There must be specific ftp s created for performing the sftp of files 1. SFTP Inbound: “/fmacdata/{project}/datain 2. SFTP Outbound: 3. /fmacdata/{project}/dataout/ 12.3UNIX Informatica Environment Variables
The .profile has the variables and paths set to allow Informatica system commands such as INFA_HOME. Manual startup and shutdown of the Informatica domain and services with the infaservice.sh script. $INFA_HOME/server/tomcat/bin/infaservice.sh 12.4 UNIX Logging
Custom script logs will reside under the “…/scripts/logs” folder. 12.5 UNIX Naming Unless there is a standard shell used, all shell scripts should have a “.sh” extension. Example: fdw_archive.sh
Page 52 of 55
Informatica ETL Standards and Guidelines
13. File Processing
13.1 File Types
It is expected that the content of the data extracts provided by various sources will reflect the data in of timing, subject and extraction criteria. There are typically two differing extract formats. Note – Once the file formats have been determined, a detailed description of that format should be communicated (file name, format, field characteristics, field description, etc.)
Master file – master or reference data, such as WBS Transaction file – transactions, such as financials
13.2 File Format and Transfer Validation
In order to perform validation for transmitted files, the following items are helpful: Master File As of Date – the end period for which the reference data is applicable Row Count – count of master rows in the file or checksum of the file Checksum – checksum of the file Transaction File Date or Date Range – the period for which the transactions are applicable Row Count – count of transactions in the file Balance Total – a grand total field that represents a summation of a key value for all records in the file Checksum – checksum of the file There are a couple of options that can be utilized to fulfill these requirements. 1) Use of header, detail and trailer records in one file 2) Two separate files, a summary trigger/signal file consisting of the ‘as of’ or range date(s), row count and hash total and a data file that contains the actual business data. The actual format will be determined depending on the capability of the extraction toolset. 13.3 File Naming
The naming of the incoming file is critical in understanding the source of that file. They should contain a timestamp when possible. They should also have an appropriate extension. For example: 1) Data file – if a relational source, then name with table name appended with a ‘. dat’ extension (e.g. source.dat). For non-relational source, then name with a meaningful name appended with extension 2) Trigger file – if a relational source, source table name appended with an ‘.ind’ extension (e.g. source.ind). When the trigger file represents a series of files, the name should be recognizable (e.g.
Page 53 of 55
Informatica ETL Standards and Guidelines 13.4 File Notification
The fact that the source file has successfully landed in the landing zone, should trigger an event that notifies the load process. There are several options in triggering this event: 1) Event periodically looks for a presence of a file 2) Schedule time where the file is expected to be there The preferred option is to be immediately notified of the file presence to promote efficient and timely processing rather than wait. The event can be triggered in a couple of ways: 1) Scheduling tool – the source system scheduling tool will trigger an event or process that can be recognized by the Informatica environment to subsequently launch a workflow. An integrated corporate scheduling tool can notify a server to launch an Informatica workflow (via a UNIX command line). 2) Informatica Event – an Informatica workflow will contain an event that continually monitors the landing zone for the presence of a file. The workflow is constantly polling and once the file shows, the workflow will commence processing. After successful completion, the Informatica WorKflow Manager, will schedule the workflow to await the presence of the file. a. Trigger file – the event can poll for a file that contains no business data, but is used only as an indication that data is ready for processing. This is useful when waiting for a collection of files. The trigger file will show only when all files are ready for processing. b. Rename of the incoming data file - this is simply the source system renaming the extension of the data file upon successful completion. For example, the file may be name *.tmp during transfer and will be renamed to *.dat when complete. 13.5 File Landing
When files are pushed from the source system, each file should be placed in the appropriate landing directory. The subsequent Informatica processes will acquire the file and move it to a location on the Informatica service for load processing. 13.6 File Archiving
Every incoming file will be archived for recovery and audit purposes. Based upon the design requirements, each file is retained for a period. Essentially the file is acquired, moved to working folder, processed and then archived. They may also be tar’d and compressed (gzip). The archived files are typically organized in a similar fashion to the incoming sources; hence, there would be an archive folder for each source. When archiving each file or directory, the name should reflect the file or directory and timeframe for which the files are relevant. For example: 1) File – [file name][timestamp].ext (e.g. sbo_2008110113:30:45.tar) 2) Directory – [directory name][timestamp] (e.g. lms_20081103144500.tar) The process for which a file is processed and then archived, should consist of 1) Move the file from the landing zone into the Informatica working directory 2) At the end of the workflow, a UNIX script will archive all of the files within the scope of the workflow a. Keep the most recent files in the working directory b. TAR and/or ZIP the referenced files, placing them in the appropriate directory
Page 54 of 55
Informatica ETL Standards and Guidelines
14. Scheduling
14.1 Informatica Workflow
The Informatica Workflow is a self-contained scheduler that integrates the ETL repository and processing within its own scheduling function. Depending on enterprise standards, the Workflow may be used to schedule the entire ETL process. At a minimum, all mappings must belong to at least one workflow. 14.2 Corporate Scheduler
The client’s corporate standard is AutoSys. Since it is an enterprise scheduler, which can integrate jobs across distributed servers, it may be used to encapsulate the Informatica workflows. The Informatica ‘pmcmd’ command can be wrapped in a shell script for further control of the job. This may also provide more insight to production personal that are exclusively monitoring production jobs via AutoSys. 14.3 Data Warehouse istration Console (DAC)
The DAC is a centralized console that allows configuration, istration, loading, and monitoring of the Oracle Business Analytics Warehouse. This includes execution plans that run pre-built Informatica mappings through individual workflows. The DAC will be used to schedule the pre-built custom Informatica mappings and any customization made for the Oracle Business Analytics Warehouse. The execution plans can be executed via a corporate scheduler (AutoSys).
Page 55 of 55