Designing Index Structure for Large Volumes of Data in Elasticsearch

[post-views]
December 11, 2024 · 3 min read
Designing Index Structure for Large Volumes of Data in Elasticsearch

Elasticsearch, a powerful distributed search and analytics engine, requires careful index structure design for optimal performance with large datasets, avoiding performance degradation, increased storage costs, and reduced query efficiency.

Understand Your Data and Use Case

Before creating an index structure, analyze:
  • Data Volume: How much data will be ingested daily?
  • Data Retention: How long will you keep the data?
  • Query Patterns: What types of searches or aggregations will you run?
Key Considerations:
  • For time-series data, use time-based indices to enable efficient rollover and deletion.
  • For static or categorical datasets, use single indices with optimized mappings.

Optimize Index and Shard Size

Why It Matters:
  • Each shard in Elasticsearch is a Lucene index and requires memory and disk resources.
  • Over-sharding leads to wasted resources, while under-sharding limits scalability.
Recommendations:
  • Aim for 20-50 GB per shard.
  • Use the _cat/indices API to monitor shard sizes.
  • Adjust shard count based on expected data volume.
number_of_shards: 3  # Example for moderate data volumes
number_of_replicas: 1

Use Rollover for Time-Based Data

Why It Matters:
  • A single large index becomes unwieldy to manage and query.
  • Time-based indices allow efficient management and cleanup.
Implementation: use Index Lifecycle Management (ILM) to automate index rollover:
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Map Fields Efficiently

Why It Matters:
  • Dynamic mapping is convenient but can lead to excessive resource use.
  • Defining explicit mappings ensures better control over index size and performance.
Best Practices:
  • Disable dynamic mapping for unnecessary fields:
dynamic: false
  • Use appropriate field types:
    • keyword for exact matches.
    • text for full-text search.
    • date for time-based queries.
  • Avoid storing large arrays or nested fields unnecessarily.
Example Mapping:
PUT my_index
{
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "user_id": { "type": "keyword" },
      "message": { "type": "text" }
    }
  }
}

Index Only What You Need

Why It Matters:
  • Indexing every field increases storage and processing overhead.
Recommendations:
  • Use enabled: false for fields that do not require indexing.
  • Store raw data in _source but exclude it from indexing if it’s not queried.
"properties": {
  "raw_data": {
    "type": "object",
    "enabled": false
  }
}

Leverage Compressions and Storage Optimizations

Why It Matters:
  • Compression reduces disk usage without significantly affecting performance.
Best Practices:
  • Use best_compression for less frequently queried indices:
index.codec: best_compression
  • Minimize the number of replicas for indices that do not require high availability.

Monitor and Tune Shard Allocation

Why It Matters:
  • Uneven shard distribution can cause cluster imbalances.
Recommendations:
  • Use the _cat/allocation API to monitor shard allocation.
  • Set shard allocation awareness to distribute shards across availability zones or racks:
cluster.routing.allocation.awareness.attributes: rack_id

Implement Query and Indexing Throttling

Why It Matters:
  • High query or indexing rates can overwhelm the cluster.
Best Practices:
  • Use rate limiting during bulk indexing:
curl -XPUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "indexing.slowlog.threshold.index.warn": "10s"
  }
}'
  • Optimize queries to use filters and avoid expensive wildcard searches.

Test and Validate Index Structure

Key Steps:
  • Load test the index with realistic data and query patterns.
  • Use tools like Rally or Kibana’s Dev Tools to benchmark performance.

Regularly Monitor and Maintain

Metrics to Watch:
  • Shard sizes (_cat/shards).
  • Query latency and resource usage (_nodes/stats).
  • Cluster health (_cluster/health).
Use Kibana or external tools like Metricbeat and Grafana for visualization.
For more details, refer to the official Elasticsearch documentation.

Was this article helpful?

Like and share it with your peers.
Join SOC Prime's Detection as Code platform to improve visibility into threats most relevant to your business. To help you get started and drive immediate value, book a meeting now with SOC Prime experts.

Related Posts