How to automatically add timestamp to documents and find the latest document in Elasticsearch?

How to search a file or objects by name inside an S3 bucket?

March 23, 2023

How to read and write XML files with Spark?

March 30, 2023

Published by Big Data In Real World at March 27, 2023

What is a pipeline?

A pipeline is a definition of a series of processors that are to be executed in the same order as they are declared.

Think of a processor as a series of instructions that will be executed.

In this post we are going to create a pipeline to add a field named doc_timestamp to all the documents that are added to the index.

Creating a pipeline

With the below PUT we are creating a pipeline name doc_timestamp. This pipeline has only one processor which sets a field name doc_timestamp and set the value of the field to the timestamp when it is being ingested or added to the index.

curl -X PUT http://localhost:9200/_ingest/pipeline/doc_timestamp?pretty -H 'Content-Type: application/json' -d '
{
  "description": "pipeline to add timestamp to documents",
  "processors": [
    {
      "set": {
        "field": "_source.doc_timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}'

{
  "acknowledged" : true
}

Attach the pipeline to an index

Here we are attaching the pipeline doc_timestamp to account_v2 index but marking it as the default_pipeline for the index.

curl -X PUT http://localhost:9200/account_v2/_settings?pretty -H 'Content-Type: application/json' -d '
{
  "index.default_pipeline": "doc_timestamp"
}'

{
  "acknowledged" : true
}

Now that the pipeline is attached to the index, anytime a document is added to the index a new field doc_timestamp will be added to the document. This doesn’t affect any of the existing documents in the index.

Let’s look up an existing document. We don’t see the doc_timestamp field in this document and it is expected.

curl -X GET localhost:9200/account_v2/_doc/735?pretty

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "735",
  "_version" : 1,
  "_seq_no" : 344,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "account_number" : 735,
    "balance" : 3984,
    "firstname" : "Loraine",
    "lastname" : "Willis",
    "age" : 32,
    "gender" : "F",
    "address" : "928 Grove Street",
    "employer" : "Gadtron",
    "email" : "lorainewillis@gadtron.com",
    "city" : "Lowgap",
    "state" : "NY"
  }
}

Add a new document to the index

Let’s add a new document to the index with id 2000.

curl -XPUT http://localhost:9200/account_v2/_doc/2000?pretty -H 'Content-Type: application/json' -d '{
    "account_number": 2000,
    "balance": 16418,
    "firstname": "Elinor",
    "lastname": "Ratliff",
    "age": 36,
    "gender": "M",
    "address": "282 Kings Place",
    "employer": "Scentric",
    "email": "elinorratliff@scentric.com",
    "city": "Ribera",
    "state": "WA"
}'

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "2000",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 993,
  "_primary_term" : 1
}

Now that the document is added, let’s look up the document and there we see the new field doc_timestamp added to the document with the timestamp at which it was added to the index.

curl -X GET localhost:9200/account_v2/_doc/2000?pretty

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "2000",
  "_version" : 1,
  "_seq_no" : 993,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "account_number" : 2000,
    "firstname" : "Elinor",
    "address" : "282 Kings Place",
    "gender" : "M",
    "city" : "Ribera",
    "lastname" : "Ratliff",
    "balance" : 16418,
    "employer" : "Scentric",
    "state" : "WA",
    "age" : 36,
    "email" : "elinorratliff@scentric.com",
    "doc_timestamp" : "2020-11-19T20:39:33.639398617Z"
  }
}

Latest document added to the index

Now that we have a field in our documents with the ingestion timestamp. We can run a simple search by sorting on the doc_timestamp field in descending order and displaying the top result.

curl -X GET localhost:9200/account_v2/_search?pretty -H 'Content-Type: application/json' -d'
{
   "size": 1,
   "sort": { "doc_timestamp": "desc"},
   "query": {
      "match_all": {}
   }
}'

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 994,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "account_v2",
        "_type" : "_doc",
        "_id" : "2000",
        "_score" : null,
        "_source" : {
          "account_number" : 2000,
          "firstname" : "Elinor",
          "address" : "282 Kings Place",
          "gender" : "M",
          "city" : "Ribera",
          "lastname" : "Ratliff",
          "balance" : 16418,
          "employer" : "Scentric",
          "state" : "WA",
          "age" : 36,
          "email" : "elinorratliff@scentric.com",
          "doc_timestamp" : "2020-11-19T20:39:33.639398617Z"
        },
        "sort" : [
          1605818373639
        ]
      }
    ]
  }
}

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to automatically add timestamp to documents and find the latest document in Elasticsearch?

How to search a file or objects by name inside an S3 bucket?

How to read and write XML files with Spark?

How to search a file or objects by name inside an S3 bucket?

How to read and write XML files with Spark?

What is a pipeline?

Creating a pipeline

Attach the pipeline to an index

Add a new document to the index

Latest document added to the index

Big Data In Real World

Related posts

How to fix unassigned shards issue in Elasticsearch?

How to delete an index in Elasticsearch?

How to properly remove or decommission a node from an Elasticsearch cluster?