Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In this article

Table of Contents
maxLevel2
excludeOlder Versions|Additional Resources|Related Links|Related Information

Overview

This Snap reads any type of data from various sources (such as SLDB, HTTP, S3, SFTP, HDFS, etc.) and produces a binary data stream at the output.

Snap Type

The File Reader Snap is a Read type Snap.

Prerequisites

Multiexcerpt macro
nameEC2Prerequisite

IAM Roles for Amazon EC2

The 'IAM_CREDENTIAL_FOR_S3' feature is used to access S3 files from EC2 Groundplex, without Access-key ID and Secret key in the AWS S3 account in the Snap. The IAM credential stored in the EC2 metadata is used to gain access rights to the S3 buckets. To enable this feature, the following line should be added to global.properties and the jcc (node) restarted:
jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

Please note this feature is supported in the EC2-type Groundplex only.

For more information on IAM Roles, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Support for Ultra Pipelines 

Works in Ultra Pipelines

Limitations

  • For most file protocols, the Snap behaves the same in both Snaplex and Groundplex. However, the HDFS protocol works only in the Groundplex. The Hadoop cluster must be open to the Groundplex server instance without any authentication.

  • When reading a file over HTTP, the File Reader Snap displays an error if the number of bytes consumed does not match the Content-Length header value present in the response.

Known Issues

This Snap fails for SMB file path with the error: unable to create new native thread.

Snap Views

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Input 

Document

  • Min: 0

  • Max: 1

Upstream Snap is optional. Any Snap with a document output view can be connected upstream.

Input may contain value(s) to evaluate the JavaScript expression in the File property.

Output

Document


  • Min: 1

  • Max: 2

  • File Writer

  • CSV Parser

  • JSON Parser

  • XML Parser

Binary data read from the source specified in the File property with header information about the binary stream. 

An example of the output preview on the File property value of "http://www.facebook.com" is as follows:

Code Block
[ { "": "Preview binary0...", "content-type": "text/html; charset=utf-8", "x-frame-options": "DENY",
 "connection": "keep-alive", "transfer-encoding": "chunked", "date": "Thu, 23 Oct 2014 00:24:40 
GMT", "content-location": "https://www.facebook.com", "pragma": "no-cache", "p3p": "CP=\"Facebook 
does not have a P3P policy. Learn why here: http://fb.me/p3p\"", "cache-control": "private, no-
cache, no-store, must-revalidate", "x-xss-protection": "0", "x-content-type-options": "nosniff", "x-
fb-debug": 
"N6wiHWAvz9kzpPUoM5vTm+yZzCZyiSrHXFXumHQixfMd0Qi+VDm514PkrrmQu2ISuuMTTFtUTqDZgDVG4blPTw==", 
"expires": "Sat, 01 Jan 2000 00:00:00 GMT", "set-cookie": "reg_ext_ref=deleted; expires=Thu, 01-Jan-
1970 00:00:01 GMT; Max-Age=0; path=/; domain=.facebook.com" } ]


Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab:

  • Stop Pipeline Execution: Stops the current pipeline execution when the Snap encounters an error.

  • Discard Error Data and Continue: Ignores the error, discards that record, and continues with the remaining records.

  • Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.


Snap Settings


FieldField TypeDescription

Label*


Default Value: File Reader
Example
File Reader

String


Excerpt

Specify a unique name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.


File*


Default Value: N/A
Example


String/Expression

Specify the URL for a regular file that must begin with a file protocol. The supported file protocols are:

  • http:

  • https:

  • s3:

  • ftp:

  • ftps:

  • sftp: 

  • hdfs:

  • sldb: 

  • smb:

  • file: (only for use with a Groundplex)

  • wasb:

  • wasbs:

  • gs:

  • adl:

You can also upload a file from using the Upload  icon. You can preview the uploaded file using the previewicon. Learn more about Previewing File.

This Snap supports S3 Virtual Private Cloud (VPC) endpoint.

Info

Reading files from Project and Shared Project Spaces

  • If a Pipeline is created in a project other than the shared project and you want to read the "asset.json" file from the same project, enter "asset.json" or "sldb:///asset.json".

  • If a Pipeline is created in the shared project and you want to read the "asset.json" file from the shared project, enter "asset.json" or "sldb:///asset.json".

  • If a Pipeline is created in a project other than the shared project and you want to read the "asset.json" file from the shared project, enter "shared/asset.json" or "sldb:///shared/asset.json".

  • Ensure the file name, folder name, or the file path does not contain '?' character because it is not fully supported and when present, the Snap might fail.


Info

File value as an Expression

The File value can be a JavaScript expression which is evaluated with values from the input view document and the Pipeline parameters. The syntax for file value is: [protocol]://[host][:port]/[path]

  • $filename (The value of the $filename is obtained from the input document and the document should have an entry with the "filename" key.)

  • _filename (A key/value pair with "filename" key should be defined as a pipeline parameter.)


Note

The File value should be an absolute path for all protocols except for SLDB. For files in SLDB, the Snap can read only files in the same Project Directory or the Shared Project Directory. It cannot access files from other Projects. Typically, the file names in the Reader Snaps are read from incoming document which might have a structure different from the relative path. For optimal results, we recommend that you build absolute paths to their projects and then add the file name.
Note: When you provide a file path that contains more than five entities (for example, entity1/entity2/entity3/entity4/file1.json) the Snap displays a Lint Warning in your Pipeline.


Info
  • "://" is a separator between the file protocol and the rest of the URL and the host name and the port number should be between "://" and "/". The hostname and port number are omitted in the SLDB and s3 protocols. If the port number is omitted, a default port for the protocol is used.

  • The file:/// protocol is supported only on Groundplex. In Cloudplex configurations, use SLDB or other file protocols. When using the file:/// protocol, the file access is done using the permissions of the user assigned or associated with the Snaplex (by default Snapuser). File system access is to be used with caution, and it is the customer's own responsibility to ensure that file system is cleaned up after use. 



Prevent URL encoding


Default valueDeselected  

Checkbox

When enabled, this will prevent the Snap from automatically URL encoding the file path (including the query string if it exists). Enable this setting to use the file path value as-is.  

When disabled, the following are some of the common characters that are automatically encoded by the Snap: 

Character name 

Character  

URL Encoded value

backslash    

  \

 %5C

Pound

 #   

 %23

space       


 %20 

percent   

 %   

  %25 

Left-angle

<

%3C

Right-angle

>

%3E

Left-angle

[

%5B

Right-square

]

%5D

Right-curly

{

%7B

Right-curly

}

%7D

And these are some of the characters that are not automatically encoded by the Snap:

Character name 

Character  

URL Encoded value

semi-colon    

 ;   

 %3B

question mark     

?

 %3F

forward slash      

/

 %2F

colon       

 : 

  %3A 

ampersand      

 &  

%26

equals   

 =  

%3D

plus        

+  

%2B

dollar   

 $    

 %24

comma  

   ,     

%2C


Enable staging


Default valueDeselected

Checkbox

If selected, the Snap downloads the source file into a local temporary file. When the download is completed, it streams the data from the temporary file to the output view. This property prevents the Snap from being blocked by slow downstream pipeline. The local disk should have sufficient free space as large as the expected file size. 

Note

Some Snaps may take a long time to process large amounts of data. This, in turn, could lead to connection timeouts, causing the pipeline to fail. Selecting this property saves the data on your local disk, enabling you to avoid such timeouts.


Number of retries


Default Value: 0
Example:
3

Integer/Expression

Specify the maximum number of retry attempts that the Snap must make in case there is a network failure, and the Snap is unable to read the target file.

If the value is larger than 0, the Snap first downloads the target file into a temporary local file. If any error occurs during the download, the Snap waits for the time specified in the Retry interval and attempts to download the file again from the beginning. When the download is successful, the Snap streams the data from the temporary file to the downstream Pipeline. All temporary local files are deleted when they are no longer needed.

Info

Ensure that the local drive has sufficient free disk space to store the temporary local file.

Minimum value: 0


Retry interval (seconds)


Default Value: 1
Example:
3

Integer/Expression

Specify the minimum number of seconds for which the Snap must wait before attempting recovery from a network failure.

Minimum value: 1


Advanced properties


Use this field set to define specific settings for polling files. Click to add a new row for defining an advanced property. This field set contains the following fields:
  • Properties
  • Values
PropertiesDropdown list


Multiexcerpt macro
nameSASURI_Description

The URI of the Shared Access Storage (SAS) to be accessed. Supported SAS types are:

  • Service SAS on container
  • Service SAS on blob
  • Account SAS


Values


Default Value: N/A

Example: https://myaccount.blob.core.windows.net/sascontainer/sasblob.txt?sv=2015-04-05&st=2015-04-
29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60
-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D

String/Expression

Specify the value for the SAS URI.


Note

Ensure that the URI is specified in the format described here

If the SAS URI value is provided in the Snap settings, then the settings provided in the account (if any account is attached) are ignored.


Snap Execution


Default Value:
Execute only
Example:
Validate & Execute

Dropdown list

Multiexcerpt include macro
nameSnap_Execution_Introduced
pageAnaplan Read


Note
  • As of the Fall 2015 release, only File Reader/Writer Snaps support the Azure Storage protocol (azure:///).

  • As of the Fall 2017 release, the File Reader Snap shows the progress message during the Pipeline runtime.

Previewing File

To preview a file, in the File field, click the Preview  icon. 

The Preview Type contains the following options:

  • Hex: Displays the preview data in hexadecimal format.
  • Text: Displays the preview data in text format.
  • Render text with whitespace: Renders whitespaces as dots "." and tabs as underscores "_" in the preview data.


Examples

Expand
title HDFS Example

HDFS

For hdfs:// file access, please use a SnapLogic on-premises Groundplex and make sure that its instance is within the Hadoop cluster and SSH authentication has already been established. You can access HDFS files in the same way as other file protocols in File Reader and File Writer Snaps. There is no need to use any account in the Snap.

Note

HDFS 2.4.0 is supported for the hdfs protocol.


Code Block
hdfs://<hostname>:<port number>/<path to folder>/<filename>

An example for HDFS is: 

Code Block
hdfs://<hostname>:<port number>/<path to folder>/<filename>


If Cloudera Hadoop Namenode is installed in AWS EC2 and its hostname is "ec2-54-198-212-134.compute-1.amazonaws.com" and its port number is 8020, then you would enter:

hdfs://ec2-54-198-212-134.compute-1.amazonaws.com:8020/user/john/input/sample.csv

SFTP File Read

Example pipeline for an SFTP file read as shown below:


Note
  • The 'IAM_CREDENTIAL_FOR_S3' feature is to access S3 files from EC2 Groundplex without  Access-key  ID and Secret key in the AWS S3 account in the Snap. The IAM credential stored in the EC2 metadata is used to gain the access rights to the S3 buckets. To enable this feature, the following line should be added to global.properties and restart the JCC:

     jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

  • This feature is supported in the EC2-type Groundplex only.

Sample for AWS S3 Support


See Also

Insert excerpt
Binary Snap Pack
Binary Snap Pack
nopaneltrue