Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-and-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ jobs:
conda config --env --add pinned_packages python=$PYTHON_VERSION
conda config --env --add pinned_packages pandas==$PANDAS_VERSION
conda config --env --add pinned_packages pyarrow==$PYARROW_VERSION
conda install -c conda-forge --yes pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION
conda install -c conda-forge --yes pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION pip
sed -i -e "/pandas/d" -e "/pyarrow/d" python/requirements-dev.txt
conda install -c conda-forge --yes --file python/requirements-dev.txt
conda list
Expand Down
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,38 @@ Download the pre-built package `delta-sharing-server-x.y.z.zip` from [GitHub Rel
- Make changes to your yaml file. You may also need to update some server configs for special requirements.
- To add Shared Data, add reference to Delta Lake tables you would like to share from this server in this config file.

### Optional Share Egress Access Logging

The server can emit share-attributed structured access log entries for each query and CDF request.

Fields emitted in each log line:

- `share`
- `schema`
- `table`
- `requestType` (`query` or `cdf_stream`)
- `egressBytes`
- `pricingTier`
- `timestampMs`
- `clientRegion` (if available)

Comment thread
Merteg marked this conversation as resolved.
To enable this feature, configure the `accessLogging` block in the server yaml.

Example yaml:

```yaml
shares:
- name: "share1"
schemas:
- name: "schema1"
tables:
- name: "table1"
location: "gs://my-bucket/my-table"

accessLogging:
enabled: true
```

## Config the server to access tables on cloud storage

We support sharing Delta Lake tables on S3, Azure Blob Storage and Azure Data Lake Storage Gen2.
Expand Down
209 changes: 209 additions & 0 deletions docs/PER_SHARE_EGRESS_MONITORING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Per-Share Egress Monitoring

## Overview

Structured access logs for egress tracking with automatic GCP pricing tier classification.

When enabled, each data access emits JSON logs with:
- Share/schema/table identification
- Total egress bytes
- Pricing tier (e.g., `internet_to_na_eu`, `interregion_na_to_eu`)
- Client region code

---

## Pricing Tiers

### 1. Free/Internal Traffic

| Tier | Description | Approximate Cost |
|------|-------------|------------------|
| `same_zone` | Traffic within the same GCP zone | Free |
| `same_region` | Traffic within the same region or Kubernetes cluster | ~Free |

### 2. Inter-Region GCP Traffic

Traffic between GCP services (e.g., between regions via service mesh):

| Tier | Route | Approximate Cost |
|------|-------|------------------|
| `interregion_na_to_na` | North America → North America | $0.02/GiB |
| `interregion_eu_to_eu` | Europe → Europe | $0.02/GiB |
| `interregion_na_to_eu` | North America → Europe | $0.05/GiB |
| `interregion_eu_to_na` | Europe → North America | $0.05/GiB |
| `interregion_to_apac` | Any region → Asia Pacific | $0.08/GiB |
| `interregion_to_oceania` | Any region → Australia/Oceania | $0.10/GiB |
| `interregion_to_latam` | Any region → Latin America | $0.14/GiB |

### 3. Internet Egress (Premium Tier)

Traffic leaving GCP to external clients:

| Tier | Destination | Approximate Cost |
|------|-------------|------------------|
| `internet_to_na_eu` | North America or Europe | $0.12/GiB |
| `internet_to_apac` | Asia Pacific | $0.12/GiB |
| `internet_to_latam` | Latin America | $0.19/GiB |
| `internet_to_oceania` | Australia/Oceania | $0.15/GiB |

### Special Cases

| Tier | Description |
|------|-------------|
| `unknown` | Unable to determine pricing tier |

---

## How Pricing Tiers Are Resolved

### Step 1: Determine Egress Type

```
┌─────────────────────────────────────────────────────────────────┐
│ Egress Type Detection │
├─────────────────────────────────────────────────────────────────┤
│ 1. Client IP is private (10.x, 172.16-31.x, 192.168.x)? │
│ → SAME_REGION (internal cluster traffic) │
│ │
│ 2. GCP IP Range Lookup (from cloud.json) finds client IP? │
│ a. Client GCP region matches sourceRegion? │
│ → SAME_REGION │
│ b. Client GCP region differs from sourceRegion? │
│ → INTER_REGION (with exact destination region) │
│ │
│ 3. Client IP looks like GCP (34.x/35.x) but not in ranges? │
│ → INTER_REGION (conservative fallback) │
│ │
│ 4. Non-GCP IP with valid country code? │
│ → INTERNET │
│ │
│ 5. No region header? │
│ → UNKNOWN │
└─────────────────────────────────────────────────────────────────┘
```

The system fetches GCP's published IP ranges from `https://www.gstatic.com/ipranges/cloud.json`
and uses a CIDR trie for efficient IP-to-region lookup (refreshed every 24 hours).

### Step 2: Determine Continents

Based on the source region (server configuration) and destination (detected from headers):

**GCP Regions → Continent Mapping:**
- `us-*`, `northamerica-*` → North America (NA)
- `europe-*` → Europe (EU)
- `asia-*` → Asia Pacific (APAC)
- `southamerica-*` → Latin America (LATAM)
- `australia-*` → Oceania (OCEANIA)

**Country Codes → Continent Mapping:**
- US, CA, MX → NA
- GB, IE, DE, FR, NL, BE, CH, AT, ES, PT, IT, SE, NO, DK, FI, PL, CZ, HU, RO, etc. → EU
- JP, KR, CN, HK, TW, SG, MY, ID, TH, VN, PH, IN, PK, BD → APAC
- AU, NZ → OCEANIA
- BR, AR, CL, CO, PE, VE, EC, etc. → LATAM

### Step 3: Calculate Pricing Tier

Based on egress type and continent pair:

| Egress Type | Calculation Method |
|-------------|-------------------|
| `SAME_ZONE` | Returns `same_zone` |
| `SAME_REGION` | Returns `same_region` |
| `INTER_REGION` | Based on source→destination continent pair |
| `INTERNET` | Based on destination continent only |

---

## Configuration

```yaml
accessLogging:
enabled: true
sourceRegion: "us-central1" # GCP region where server runs
detectGcpTraffic: true # Enable GCP IP range lookup
clientRegionHeader: "x-client-region" # Header with country code
clientIpHeader: "x-forwarded-for" # Header with client IP chain
```

### GCP Load Balancer Headers

Configure custom request headers on your backend:
```
X-Client-Region: {client_region}
X-Client-Region-Subdivision: {client_region_subdivision}
```

---

## Log Output

**ACCESS_LOG** — Emitted for each request with non-zero egress:
```json
{
"logType": "ACCESS_LOG",
"share": "myshare",
"schema": "myschema",
"table": "mytable",
"egressBytes": 1048576,
"pricingTier": "internet_to_na_eu",
"timestampMs": 1717502400000,
"requestType": "query",
"clientRegion": "US"
}
```

**PRICING_CONTEXT** — Debug entry with detection details:
```json
{
"logType": "PRICING_CONTEXT",
"clientIp": "203.0.113.45",
"isGcpIp": false,
"egressType": "internet",
"sourceRegion": "us-central1",
"destinationContinent": "NA",
"pricingTier": "internet_to_na_eu"
}
```

**REQUEST_HEADERS** — All headers for debugging (keys lowercased).

---

## Header Detection Priority

**Region Headers:**
1. Configured `clientRegionHeader` (default: `x-client-region`)
2. `x-appengine-country`
3. `cf-ipcountry`
4. `cloudfront-viewer-country`

**Client IP Headers:**
1. Configured `clientIpHeader` (default: `x-forwarded-for`)
2. `x-envoy-external-address`

---

## Implementation

| File | Purpose |
|------|---------|
| `GcpPricingTier.scala` | Continent mapping, egress type detection, pricing calculation |
| `GcpIpRangeLookup.scala` | GCP IP range fetching and CIDR trie lookup |
| `AccessLogEmitter.scala` | Log entry models and JSON emission |
| `DeltaSharingService.scala` | Integration for query/CDF endpoints |
| `ServerConfig.scala` | `AccessLoggingConfig` model |

### Egress Bytes Calculation

Sum of `size` from all file actions: `AddFile`, `AddFileForCDF`, `AddCDCFile`.

---

## Notes

- Logs emitted to `delta.sharing.access` logger
- Zero-byte requests not logged
- GCP IP ranges refreshed every 24 hours
- Pricing tiers match GCP documentation
6 changes: 6 additions & 0 deletions manifests/base/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,9 @@ data:
queryTablePageSizeLimit: 10000
queryTablePageTokenTtlMs: 259200000
refreshTokenTtlMs: 3600000
accessLogging:
enabled: true
sourceRegion: "us-central1"
detectGcpTraffic: true
clientRegionHeader: "x-client-region"
clientIpHeader: "x-forwarded-for"
Comment thread
Merteg marked this conversation as resolved.
11 changes: 9 additions & 2 deletions manifests/base/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ spec:
- -c
- |
echo "Merging base config and shares config..."
# Replace bearer token placeholder in base config
sed "s/BEARER_TOKEN_PLACEHOLDER/${BEARER_TOKEN}/g" /base-config/base-config.yaml > /tmp/base.yaml
# Replace placeholders in base config
sed -e "s/BEARER_TOKEN_PLACEHOLDER/${BEARER_TOKEN}/g" -e "s/GCP_PROJECT_ID_PLACEHOLDER/${GCP_PROJECT_ID}/g" /base-config/base-config.yaml > /tmp/base.yaml
# Merge base config with shares config
yq eval-all 'select(fileIndex == 0) * select(fileIndex == 1)' /tmp/base.yaml /shares-config/shares-config.yaml > /shared-config/delta-sharing-server-config.yaml
echo "Config merged successfully"
Expand All @@ -39,6 +39,12 @@ spec:
secretKeyRef:
name: delta-sharing-auth
key: bearer-token
- name: GCP_PROJECT_ID
valueFrom:
configMapKeyRef:
name: project-common
key: PROJECT_ID
Comment thread
Copilot marked this conversation as resolved.
optional: true
volumeMounts:
- name: base-config
mountPath: /base-config
Expand All @@ -54,6 +60,7 @@ spec:
configMapKeyRef:
name: project-common
key: PROJECT_ID
optional: true
- name: ZC_API_PROXY_DL_SHARING_BEARER
valueFrom:
secretKeyRef:
Expand Down
29 changes: 29 additions & 0 deletions manifests/zcloud-emea/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: delta-sharing-base-config
namespace: default
data:
base-config.yaml: |
version: 1
authorization:
bearerToken: "BEARER_TOKEN_PLACEHOLDER"
host: "0.0.0.0"
port: 8080
endpoint: "/delta-sharing"
preSignedUrlTimeoutSeconds: 3600
deltaTableCacheSize: 100
stalenessAcceptable: false
evaluatePredicateHints: false
evaluateJsonPredicateHints: true
evaluateJsonPredicateHintsV2: true
requestTimeoutSeconds: 180
queryTablePageSizeLimit: 10000
queryTablePageTokenTtlMs: 259200000
refreshTokenTtlMs: 3600000
accessLogging:
enabled: true
sourceRegion: "europe-west3"
detectGcpTraffic: true
clientRegionHeader: "x-client-region"
clientIpHeader: "x-forwarded-for"
2 changes: 2 additions & 0 deletions manifests/zcloud-emea/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../base
patchesStrategicMerge:
- configmap.yaml
6 changes: 6 additions & 0 deletions manifests/zcloud-prod/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ data:
queryTablePageSizeLimit: 10000
queryTablePageTokenTtlMs: 259200000
refreshTokenTtlMs: 3600000
accessLogging:
enabled: true
sourceRegion: "us-central1"
detectGcpTraffic: true
clientRegionHeader: "x-client-region"
clientIpHeader: "x-forwarded-for"
6 changes: 6 additions & 0 deletions manifests/zcloud-prod2/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ data:
queryTablePageSizeLimit: 10000
queryTablePageTokenTtlMs: 259200000
refreshTokenTtlMs: 3600000
accessLogging:
enabled: true
sourceRegion: "us-west4"
detectGcpTraffic: true
clientRegionHeader: "x-client-region"
clientIpHeader: "x-forwarded-for"
29 changes: 29 additions & 0 deletions manifests/zcloud-prod3/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: delta-sharing-base-config
namespace: default
data:
base-config.yaml: |
version: 1
authorization:
bearerToken: "BEARER_TOKEN_PLACEHOLDER"
host: "0.0.0.0"
port: 8080
endpoint: "/delta-sharing"
preSignedUrlTimeoutSeconds: 3600
deltaTableCacheSize: 100
stalenessAcceptable: false
evaluatePredicateHints: false
evaluateJsonPredicateHints: true
evaluateJsonPredicateHintsV2: true
requestTimeoutSeconds: 180
queryTablePageSizeLimit: 10000
queryTablePageTokenTtlMs: 259200000
refreshTokenTtlMs: 3600000
accessLogging:
enabled: true
sourceRegion: "australia-southeast1"
detectGcpTraffic: true
clientRegionHeader: "x-client-region"
clientIpHeader: "x-forwarded-for"
2 changes: 2 additions & 0 deletions manifests/zcloud-prod3/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../base
patchesStrategicMerge:
- configmap.yaml
Loading
Loading