APIs – Data Pipeline

Setting up API Keys

API keys are not required to use the API, but you can setup the Pipeline to enforce the use of API keys. First, in the Helm or project resource application configuration. Set the useApiKeys value to “true” as shown below in bold.

...
schedule:
  checkFileNotifierQueue: "5000"
  findJobsToStart: "5000"
environment: idata-poc
useApiKeys: "true"
aws:
...

Next you will need to configure the API keys in AWS Secrets Manager. You can setup one or many API keys for use by the Pipeline API as key/value pairs. When a Pipeline API request is made, the key/values are read from the Secrets Manager secret and validated. The key names are ignored and only the values are used to validate.

Save the secret and set the apiKeysSecretName value to the secret name saved as shown below in bold.

aws:
  region: us-east-1
  secretsManager:
    apiKeysSecretName: my-api-keys-secret-name

Version API

1. Retrieve the Pipeline Version

GET http://hostname/version

Header	Value
x-api-key	api-key

Dataset Configuration API

This API enables the registration, retrieval and deletion of dataset JSON configurations.

1. Registering a Dataset Configuration

POST http://hostname/dataset

Header	Value
x-api-key	api-key
Content-Type	application/json

Body

{
    [dataset_configuration_here]
}

2. Retrieving a Dataset Configuration

GET http://hostname/dataset?dataset=dataset_name_here

Header	Value
x-api-key	api-key
Content-Type	application/json

Parameters

Parameter	Value
dataset	The name of the dataset to retrieve

3. Delete a Dataset Configuration

DELETE http://hostname/dataset?dataset_name_here

Header	Value
x-api-key	api-key
Content-Type	application/json

Parameters

Header	Value
dataset	The name of the dataset to delete

4. Retrieve all Registered Dataset Configurations

GET http://hostname/datasets

Header	Value
x-api-key	api-key

Notification Subscription API

When a dataset arrives at its final destination, whether it is S3, Snowflake or Redshift, a notification is sent to the Dataset Notification SNS Topic. If you would like to attach an SQS queue or other type of resource to these notifications, this can be done using the Notification Subscription API.

1. Create a Subscription to a Dataset Notification

POST http://hostname/subscription

Header	Value
x-api-key	api-key
Content-Type	application/json

Body:

{
    "endpointArn": "arn:aws:sqs:us-east-1:196014872813:idata-dev-stock-price",
    "protocol": "sqs",
    "filterPolicy": "{\"prefixKey\": [\"yahoo/finance\"]}"
}

Field	Required	Description
endpointArn	yes	The ARN of the destination of the notification message
protocol	yes	The protocol for the ARN above. Valid values include sqs, http, https, email, email-json, sms, application, lambda or firehose
filterPolicy	no	If no filterPolicy is included, the endpointArn will receive notifications for all datasets delivered to their destination(s). Using a filterPolicy, you can filter what notifications you would like to receive at the endpoint. If included, this field must contain an escaped JSON string. Examples are below.

filterPolicy Examples

To receive a notification for a specific dataset – “filterPolicy”: “{\”dataset”: [\”stock_price\”]}”
To receive notifications for more than one specific dataset – “filterPolicy”: “{\”dataset”: [\”yahoo\”, \“finance\”]}”
To receive notifications for a family of datasets, use the prefixKey in the filterPolicy – "filterPolicy": "{\"prefixKey\": [\"yahoo/finance\"]}"

Response:

{
    "owner": "196014872813",
    "subscriptionArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification:cb33899e-c115-4dba-8529-3a76b80aaa93",
    "topicArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification",
    "endpointArn": "arn:aws:sqs:us-east-1:196014872813:idata-dev-rimes",
    "protocol": "sqs",
    "filterPolicy": "{\"prefixKey\": [\"yahoo/finance\"]}"
}

A note on permissions to your endpoint. Make sure that the endpoint provides permission to the Pipeline SNS Notification Queue to write to your endpoint. For example, if your endpoint is SQS, the SNS topic must have permission to write to that SQS endpoint, as in below.

SQS Access Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "MySQSPolicy001",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "sqs:SendMessage",
      "Resource": "arn:aws:sqs:us-east-1:196014872813:idata-dev-rimes",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification"
        }
      }
    }
  ]
}

2. Get a subscription

GET http://hostname/subscription?subscriptionarn=[subscriptionArn]

Header	Value
x-api-key	api-key

Parameter	Description
subscription	The ARN for the subscription

Response:

{
    "owner": "196014872813",
    "subscriptionArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification:cb33899e-c115-4dba-8529-3a76b80aaa93",
    "topicArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification",
    "endpointArn": "arn:aws:sqs:us-east-1:196014872813:idata-dev-rimes",
    "protocol": "sqs",
    "filterPolicy": "{\n  \"prefixKey\": [\"yahoo/finance\"]\n}"
}

3. Get all subscriptions

GET http://hostname/subscriptions

Header	Value
x-api-key	api-key

Response:

[
    {
        "owner": "196014872813",
        "subscriptionArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification:cb33899e-c115-4dba-8529-3a76b80aaa93",
        "topicArn": "arn:aws:sns:us-east-1:196014872813:idata-dev-dataset-notification",
        "endpointArn": "arn:aws:sqs:us-east-1:196014872813:idata-dev-rimes",
        "protocol": "sqs",
        "filterPolicy": "{\n  \"prefixKey\": [\"yahoo/finance\"]\n}"
    }
]

4. Delete a subscription

DELETE http://hostname/subscription?subscriptionarn=[subscriptionArn]

Header	Value
x-api-key	api-key

Parameter	Description
subscription	The ARN for the subscription

Data Ingestion API

To ingest raw data into the Pipeline, you can make use of the Data Ingestion API.

1. Ingest a Data File into the Pipeline

This endpoint employs a multi-part file upload.

POST http://hostname/dataset/upload

Header	Value
x-api-key	api-key
Content-Type	multipart/form-data

Parameter	Required	Description
dataset	yes	The dataset associated with the data file being uploaded
publishertoken	no	Optionally, you can pass a UUID as this value. If none is passed, the Pipeline will automatically generate one.

Here’s a generic curl example:

curl --location --request POST 'https://hostname/dataset/upload?dataset=stock_price' \
--header 'x-api-key: d0322db4-f8ac-11ec-b939-0242ac120002' \
--form 'file=@"stock_price.20190303.dataset.csv"'

Dataset Generation API

You can use this API to infer a dataset and its schema from a raw data file. Note: It’s best to use a file with a large amount of data to get the inference correct.

1. Infer a Dataset with Schema using the REST multi-part upload endpoint

POST http://hostname/dataset/generate

Header	Value
x-api-key	api-key
Content-Type	multipart/form-data

Parameter	Required	Description
file	yes	multipart file
dataset	yes	The name of the dataset configuration to generate
delimiter	no	The delimiter of the file to be inferred (e.g. ,)
header	no	If this value is true, the file has a header

Here’s an example using curl to generate a dataset with a schema from a data file.

curl --location --request POST 'http://localhost:8080/dataset/generate?dataset=stock_price' \
--header 'x-api-key: 1847626a-5b46-4d43-827c-25f323d9201b' \
--form 'file=@"./pipeline-server/test/files/stock_price.20170104.dataset.csv"'

Dataset Status API

You can use this API to build your own dataset monitoring user-interface.

1. Retrieve the dataset summary page

This method will retrieve a list of dataset status records sorted by time, most recent first.

GET http://hostname/dataset/status

Header	Value
x-api-key	api-key

Parameter	Required	Description
page	no	If the page number is not present, page 1 is defaulted

Here’s a curl example to make the call:

echo GET /dataset/status
curl --location --request GET 'http://localhost:8080/dataset/status' \
--header 'x-api-key: 1847626a-5b46-4d43-827c-25f323d9201b' | json_pp

Sample response:

[
    {
        "createdAtTimestamp": "2023-02-06 14:05:35.886",
        "createdAt": 1675710335886,
        "updatedAt": 1675710343926,
        "dataset": "stock_price",
        "pipelineToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "process": "SnowflakeLoader",
        "startTime": "02-06-2023 14:05:35 EST",
        "endTime": "02-06-2023 14:05:43 EST",
        "totalTime": "8 sec",
        "status": "success"
    },
    {
        "createdAtTimestamp": "2023-02-03 15:35:29.666",
        "createdAt": 1675456529666,
        "updatedAt": 1675456545083,
        "dataset": "cusips",
        "pipelineToken": "29c4de30-52e7-5112-8999-21c298ab34b1",
        "process": "RedshiftLoader",
        "startTime": "02-03-2023 15:35:29 EST",
        "endTime": "02-03-2023 15:35:45 EST",
        "totalTime": "15 sec",
        "status": "success"
    },
    {
        "createdAtTimestamp": "2023-02-03 14:53:06.591",
        "createdAt": 1675453986591,
        "updatedAt": 1675454005096,
        "dataset": "rimes_idx_std",
        "pipelineToken": "588b6bda-ea7a-51d2-89c9-085ae56dc5a0",
        "process": "ObjectStoreLoader",
        "startTime": "02-03-2023 14:53:06 EST",
        "endTime": "02-03-2023 14:53:25 EST",
        "totalTime": "timed out",
        "status": "error"
    },
    ...
]

NOTE: You can use the “pipelineToken” returned in the above call to make the next dataset status call to get the detail for the Pipeline job.

2. Retrieve the dataset status detail for a specific dataset ingestion

If you remember, each time a dataset job is triggered, the Pipeline generates a unique pipeline token for the job. This pipelineToken can be used to query the detail of a specific dataset job.

GET http://hostname/dataset/status

Header	Value
x-api-key	api-key

Parameter	Required	Description
pipelinetoken	yes	The pipeline token of the job you which to query detail

Here’s a curl example to make the call.

curl --location --request GET 'http://localhost:8080/dataset/status?pipelinetoken=92adbf1a-94de-5188-8e1d-607233b7a9d2' \
--header 'x-api-key: 1847626a-5b46-4d43-827c-25f323d9201b'

Sample response:

[
    {
        "id": 0,
        "dateTime": "02-06-2023 14:05:35 EST",
        "dataset": "stock_price",
        "processName": "FileNotifier",
        "publisherToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "pipelineToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "filename": "stock_price.2023-02-06.14-05-33-271.1675710333271.dataset.csv",
        "state": "begin",
        "code": "info",
        "description": "Data received, bucket: idata-poc-raw, key: temp/stock_price/stock_price.2023-02-06.14-05-33-271.1675710333271.dataset.csv",
        "epoch": 1675710335886
    },
    {
        "id": 0,
        "dateTime": "02-06-2023 14:05:36 EST",
        "dataset": "stock_price",
        "processName": "FileNotifier",
        "publisherToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "pipelineToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "filename": "stock_price.2023-02-06.14-05-33-271.1675710333271.dataset.csv",
        "state": "processing",
        "code": "info",
        "description": "Total file size: 9344845",
        "epoch": 1675710336498
    },
    {
        "id": 0,
        "dateTime": "02-06-2023 14:05:36 EST",
        "dataset": "stock_price",
        "processName": "FileNotifier",
        "publisherToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "pipelineToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "filename": "stock_price.2023-02-06.14-05-33-271.1675710333271.dataset.csv",
        "state": "end",
        "code": "info",
        "description": "Process completed successfully",
        "epoch": 1675710336623
    },
    {
        "id": 0,
        "dateTime": "02-06-2023 14:05:36 EST",
        "dataset": "stock_price",
        "processName": "JobRunner",
        "publisherToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "pipelineToken": "92adbf1a-94de-5188-8e1d-607233b7a9d2",
        "filename": "stock_price.2023-02-06.14-05-33-271.1675710333271.dataset.csv",
        "state": "begin",
        "code": "info",
        "description": "Process started",
        "epoch": 1675710336749
    },
    ...
]