How to automatically add timestamp to documents and find the latest document in Elasticsearch? - Big Data In Real World

How to automatically add timestamp to documents and find the latest document in Elasticsearch?

How to search a file or objects by name inside an S3 bucket?
March 23, 2023
How to read and write XML files with Spark?
March 30, 2023
How to search a file or objects by name inside an S3 bucket?
March 23, 2023
How to read and write XML files with Spark?
March 30, 2023

Elasticsearch used to add _timestamp field with the ingestion timestamp automatically to all documents that are being added to the index. Unfortunately, this was removed in the later versions of Elasticsearch.

It is not realistic to create a separate ingestion process to add a timestamp field forcefully to all the documents but thankfully we can solve this problem with Elasticsearch’s ingestion pipeline functionality.

What is a pipeline?

A pipeline is a definition of a series of processors that are to be executed in the same order as they are declared. 

Think of a processor as a series of instructions that will be executed.

In this post we are going to create a pipeline to add a field named doc_timestamp to all the documents that are added to the index.

Creating a pipeline

With the below PUT we are creating a pipeline name doc_timestamp. This pipeline has only one processor which sets a field name doc_timestamp and set the value of the field to the timestamp when it is being ingested or added to the index.

curl -X PUT http://localhost:9200/_ingest/pipeline/doc_timestamp?pretty -H 'Content-Type: application/json' -d '
{
  "description": "pipeline to add timestamp to documents",
  "processors": [
    {
      "set": {
        "field": "_source.doc_timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}'

{
  "acknowledged" : true
}

Attach the pipeline to an index

Here we are attaching the pipeline doc_timestamp to account_v2 index but marking it as the default_pipeline for the index.

curl -X PUT http://localhost:9200/account_v2/_settings?pretty -H 'Content-Type: application/json' -d '
{
  "index.default_pipeline": "doc_timestamp"
}'

{
  "acknowledged" : true
}

Now that the pipeline is attached to the index, anytime a document is added to the index a new field doc_timestamp will be added to the document. This doesn’t affect any of the existing documents in the index. 

Let’s look up an existing document. We don’t see the doc_timestamp field in this document and it is expected.

curl -X GET localhost:9200/account_v2/_doc/735?pretty

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "735",
  "_version" : 1,
  "_seq_no" : 344,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "account_number" : 735,
    "balance" : 3984,
    "firstname" : "Loraine",
    "lastname" : "Willis",
    "age" : 32,
    "gender" : "F",
    "address" : "928 Grove Street",
    "employer" : "Gadtron",
    "email" : "lorainewillis@gadtron.com",
    "city" : "Lowgap",
    "state" : "NY"
  }
}

Add a new document to the index

Let’s add a new document to the index with id 2000.

curl -XPUT http://localhost:9200/account_v2/_doc/2000?pretty -H 'Content-Type: application/json' -d '{
    "account_number": 2000,
    "balance": 16418,
    "firstname": "Elinor",
    "lastname": "Ratliff",
    "age": 36,
    "gender": "M",
    "address": "282 Kings Place",
    "employer": "Scentric",
    "email": "elinorratliff@scentric.com",
    "city": "Ribera",
    "state": "WA"
}'

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "2000",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 993,
  "_primary_term" : 1
}

Now that the document is added, let’s look up the document and there we see the new field doc_timestamp added to the document with the timestamp at which it was added to the index.

curl -X GET localhost:9200/account_v2/_doc/2000?pretty

{
  "_index" : "account_v2",
  "_type" : "_doc",
  "_id" : "2000",
  "_version" : 1,
  "_seq_no" : 993,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "account_number" : 2000,
    "firstname" : "Elinor",
    "address" : "282 Kings Place",
    "gender" : "M",
    "city" : "Ribera",
    "lastname" : "Ratliff",
    "balance" : 16418,
    "employer" : "Scentric",
    "state" : "WA",
    "age" : 36,
    "email" : "elinorratliff@scentric.com",
    "doc_timestamp" : "2020-11-19T20:39:33.639398617Z"
  }
}

Latest document added to the index

Now that we have a field in our documents with the ingestion timestamp. We can run a simple search by sorting on the doc_timestamp field in descending order and displaying the top result.

curl -X GET localhost:9200/account_v2/_search?pretty -H 'Content-Type: application/json' -d'
{
   "size": 1,
   "sort": { "doc_timestamp": "desc"},
   "query": {
      "match_all": {}
   }
}'

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 994,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "account_v2",
        "_type" : "_doc",
        "_id" : "2000",
        "_score" : null,
        "_source" : {
          "account_number" : 2000,
          "firstname" : "Elinor",
          "address" : "282 Kings Place",
          "gender" : "M",
          "city" : "Ribera",
          "lastname" : "Ratliff",
          "balance" : 16418,
          "employer" : "Scentric",
          "state" : "WA",
          "age" : 36,
          "email" : "elinorratliff@scentric.com",
          "doc_timestamp" : "2020-11-19T20:39:33.639398617Z"
        },
        "sort" : [
          1605818373639
        ]
      }
    ]
  }
}
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to automatically add timestamp to documents and find the latest document in Elasticsearch?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X