Terraform Data Sources

Introduction

When working with Terraform, you'll often need to reference resources or information that already exists in your infrastructure or that's defined outside of your Terraform configuration. This is where data sources come in.

Data sources in Terraform allow you to fetch or compute values that can be used elsewhere in your configuration. Unlike resources that create and manage infrastructure, data sources only read information. Think of them as read-only queries that bring external information into your Terraform project.

What Are Data Sources?

Data sources are a way to query external systems and fetch data without actually creating or modifying anything. They help you:

Reference existing infrastructure not managed by Terraform
Fetch information from your cloud provider
Query attributes of resources managed in other Terraform configurations
Import data from external systems or APIs

Basic Syntax

The basic syntax for a data source is:

data "provider_type" "name" {
  [CONFIG ...]
}

Where:

provider_type is the type of data source (like aws_ami, azurerm_resource_group, etc.)
name is a unique identifier you choose for this data source
[CONFIG ...] represents the configuration arguments specific to that data source

Accessing Data Source Attributes

Once you've declared a data source, you can reference its attributes using the syntax:

data.provider_type.name.attribute

Common Data Source Examples

Example 1: Finding the Latest Amazon Machine Image (AMI)

One of the most common uses of data sources is to find the latest Amazon Machine Image for EC2 instances:

data "aws_ami" "ubuntu" {
  most_recent = true
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
  
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
  
  owners = ["099720109477"] # Canonical's AWS account ID
}

resource "aws_instance" "web_server" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  
  tags = {
    Name = "WebServer"
  }
}

What's happening here?

We define a data source of type aws_ami named "ubuntu"
We set filters to find Ubuntu 20.04 images with HVM virtualization
We specify that we want the most recent image
We use the AMI ID from this data source when creating our EC2 instance

Output: When you run terraform apply, Terraform will:

Query AWS to find the latest Ubuntu image matching our criteria
Store all the image attributes in the data source
Use the image ID when creating the EC2 instance

Example 2: Reading Environment Variables

Terraform includes a special data source for reading environment variables:

data "external" "env" {
  program = ["bash", "-c", "env | grep TF_ | sort | jq -R 'split(\"=\") | {key: .[0], value: .[1]}' | jq -s ."]
}

output "environment_variables" {
  value = data.external.env.result
  sensitive = true
}

What's happening here?

We use the external data source to run a command
The command outputs environment variables starting with "TF_" in JSON format
We capture these in an output variable

Example 3: Fetching IP Ranges for a Service

You can use data sources to get IP ranges for services like AWS:

data "aws_ip_ranges" "ec2" {
  regions  = ["us-east-1", "us-west-2"]
  services = ["ec2"]
}

resource "aws_security_group" "from_ec2" {
  name = "from_ec2"

  ingress {
    from_port   = "443"
    to_port     = "443"
    protocol    = "tcp"
    cidr_blocks = data.aws_ip_ranges.ec2.cidr_blocks
  }
}

What's happening here?

We fetch all IP ranges used by EC2 in specific regions
We use these IP ranges to create security group rules

How Data Sources Work

Let's break down the lifecycle of a data source:

When you run terraform apply:

Terraform identifies all data sources in your configuration
For each data source, Terraform makes API calls to fetch the requested information
The data is stored in the Terraform state file
Resources that reference the data source can access its attributes

Advanced Usage Patterns

Computed Data Sources

Some data sources can accept arguments that come from other resources:

resource "aws_vpc" "example" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "example-vpc"
  }
}

data "aws_vpc_endpoint_service" "s3" {
  service      = "s3"
  service_type = "Gateway"
}

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.example.id
  service_name = data.aws_vpc_endpoint_service.s3.service_name
}

What's happening here?

We create a VPC resource
We query for the S3 endpoint service details
We create a VPC endpoint for S3 using both the VPC ID and the service name from the data source

Filtering and Selecting

Many data sources allow you to filter results:

data "aws_subnet_ids" "private" {
  vpc_id = aws_vpc.main.id
  
  tags = {
    Tier = "Private"
  }
}

resource "aws_lb" "internal" {
  name               = "internal-lb"
  internal           = true
  load_balancer_type = "application"
  subnets            = data.aws_subnet_ids.private.ids
}

What's happening here?

We're finding all subnets in a VPC that have the tag Tier = "Private"
We're using these subnet IDs for an internal load balancer

Local Data Sources

Terraform also has data sources that don't interact with providers:

data "local_file" "config" {
  filename = "${path.module}/config.json"
}

output "config_content" {
  value = jsondecode(data.local_file.config.content)
}

What's happening here?

We read a local file's content
We decode it as JSON and output it

Common Data Sources by Provider

AWS

aws_ami: Find Amazon Machine Images
aws_availability_zones: List available AZs
aws_region: Get current region details
aws_vpc: Find an existing VPC

Azure

azurerm_resource_group: Reference an existing resource group
azurerm_virtual_network: Get an existing VNet
azurerm_subscription: Get details about the current subscription

Google Cloud

google_compute_image: Find a compute image
google_compute_zones: List available zones
google_project: Get details about a project

Best Practices

Use data sources for read-only operations
- Data sources should only read information, not modify it
Handle changes gracefully
- Data sources might return different values over time (e.g., latest AMI)
- Consider using version constraints when applicable
Cache data when appropriate
- Some data sources make API calls that count against rate limits
- For static data, you might want to store the values in variables instead
Use count or for_each with data sources
- You can dynamically create multiple instances of a data source:

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "primary" {
  count             = length(data.aws_availability_zones.available.names)
  vpc_id            = aws_vpc.main.id
  availability_zone = data.aws_availability_zones.available.names[count.index]
  cidr_block        = "10.0.${count.index}.0/24"
}

Common Pitfalls

Data source depends on a resource that doesn't exist yet
- Solution: Use depends_on to create explicit dependencies
Data source returns too many or too few results
- Solution: Refine your filters or use more specific queries
Values change between plan and apply
- Solution: For critical values that shouldn't change, consider hardcoding them

Summary

Data sources are a powerful feature in Terraform that allow you to query and use information from existing infrastructure or external systems. They help you:

Reference resources created outside of Terraform
Fetch dynamic information from your cloud provider
Make your configurations more flexible and reusable
Integrate with existing infrastructure

By mastering data sources, you can build more dynamic and adaptable Terraform configurations that work seamlessly with both managed and unmanaged resources.

Additional Resources

Exercises

Use the aws_ami data source to find the latest Amazon Linux 2 AMI in your region.
Create a configuration that uses data sources to fetch all availability zones in your region, and then creates a subnet in each one.
Use the http data source to fetch information from a public API and use it in your Terraform configuration.
Create a data source that queries for existing security groups with specific tags, and then references them in a new EC2 instance.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Data Sources?​

Basic Syntax​

Accessing Data Source Attributes​

Common Data Source Examples​

Example 1: Finding the Latest Amazon Machine Image (AMI)​

Example 2: Reading Environment Variables​

Example 3: Fetching IP Ranges for a Service​

How Data Sources Work​

Advanced Usage Patterns​

Computed Data Sources​

Filtering and Selecting​

Local Data Sources​

Common Data Sources by Provider​

AWS​

Azure​

Google Cloud​

Best Practices​

Common Pitfalls​

Summary​

Additional Resources​

Exercises​

Introduction

What Are Data Sources?

Basic Syntax

Accessing Data Source Attributes

Common Data Source Examples

Example 1: Finding the Latest Amazon Machine Image (AMI)

Example 2: Reading Environment Variables

Example 3: Fetching IP Ranges for a Service

How Data Sources Work

Advanced Usage Patterns

Computed Data Sources

Filtering and Selecting

Local Data Sources

Common Data Sources by Provider

AWS

Azure

Google Cloud

Best Practices

Common Pitfalls

Summary

Additional Resources

Exercises