I Tracked 10K Blog Visitors Without Cookies — Here's My Privacy-First Stack
Last month, my Jekyll blog crossed 10,000 monthly visitors. Great news, right? Except I had no idea who they were, where they came from, or what they actually read.
I could’ve installed Google Analytics in 5 minutes. But I didn’t want to:
- Ask for cookie consent (kills conversion)
- Share user data with third parties
- Slow down my site with tracking scripts
- Violate GDPR/CCPA by default
So I built my own analytics stack. Zero cookies. Zero JavaScript. Zero consent banners.
And it turns out, server logs tell you everything you need — if you know how to parse them.
What I Actually Needed to Know
I’m not running an ad-driven media empire. I needed 5 metrics:
- Page views — which posts are getting traffic?
- Referrers — where’s the traffic coming from?
- Geographic distribution — am I reaching my target audience?
- Device/browser breakdown — is my site mobile-friendly enough?
- Top landing pages — what’s bringing people in?
That’s it. No user profiling, no session tracking, no behavioral cohorts.
The Stack
Here’s what I ended up with:
# tech_stack.yml
platform: GitHub Pages (Jekyll)
web_server: GitHub's Fastly CDN
log_source: Cloudflare (proxied DNS)
parser: GoAccess (open-source log analyzer)
storage: SQLite
dashboard: Custom Python + Flask
cost: $0/month
Wait, GitHub Pages doesn’t give you server logs? Correct. That’s why I proxy through Cloudflare — more on that below.
Step 1: Get Your Server Logs
GitHub Pages doesn’t expose logs. But Cloudflare does, and it’s free.
Cloudflare Setup (5 minutes)
- Sign up for Cloudflare (free tier)
- Add your domain
- Update your DNS nameservers
- Enable “Proxied” (orange cloud) for your domain
Now Cloudflare sits between visitors and GitHub Pages. Every request goes through their CDN, and you get access to logs.
Downloading Logs via API
Cloudflare’s free tier doesn’t include log downloads in the dashboard. But the API works:
# download_cloudflare_logs.py
import requests
from datetime import datetime, timedelta
import os
ZONE_ID = os.getenv("CLOUDFLARE_ZONE_ID")
API_TOKEN = os.getenv("CLOUDFLARE_API_TOKEN")
def fetch_logs(start_time, end_time):
"""Fetch HTTP logs from Cloudflare Logpull API"""
url = f"https://api.cloudflare.com/client/v4/zones/{ZONE_ID}/logs/received"
headers = {
"Authorization": f"Bearer {API_TOKEN}",
"Content-Type": "application/json"
}
params = {
"start": int(start_time.timestamp()),
"end": int(end_time.timestamp()),
"fields": "ClientIP,ClientRequestURI,EdgeStartTimestamp,EdgeResponseStatus,ClientCountry,ClientDeviceType,ClientRequestReferer,ClientRequestUserAgent"
}
response = requests.get(url, headers=headers, params=params, stream=True)
if response.status_code == 200:
return response.text
else:
raise Exception(f"API error: {response.status_code} - {response.text}")
# Download yesterday's logs
end = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)
start = end - timedelta(days=1)
logs = fetch_logs(start, end)
with open(f"logs_{start.date()}.txt", "w") as f:
f.write(logs)
I run this daily via cron. 30 days = ~300MB of logs for 10K visitors.
Step 2: Parse Logs with GoAccess
GoAccess is a real-time log analyzer. Think Nginx stats, but prettier.
Installation
# macOS
brew install goaccess
# Ubuntu/Debian
sudo apt install goaccess
Config for Cloudflare Logs
GoAccess expects standard web server formats (Apache, Nginx). Cloudflare’s JSON format needs a custom config:
# ~/.goaccessrc
time-format %H:%M:%S
date-format %d/%b/%Y
log-format CLOUDFLARE
# JSON field mapping
json-format {"timestamp":"EdgeStartTimestamp","request":"ClientRequestURI","status":"EdgeResponseStatus","country":"ClientCountry","referer":"ClientRequestReferer","agent":"ClientRequestUserAgent"}
Generate Report
goaccess logs_2026-02-17.txt -o report.html
Boom. You get an HTML dashboard with:
- Requests per hour/day
- Top pages
- Referrers
- Geographic distribution
- Browser/OS breakdown
All from server logs. No cookies. No JavaScript.
Step 3: Automate + Archive
I wanted historical data, not just yesterday’s snapshot. So I built a SQLite pipeline:
# parse_to_sqlite.py
import sqlite3
import json
from datetime import datetime
def init_db():
conn = sqlite3.connect("analytics.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS page_views (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp DATETIME,
path TEXT,
country TEXT,
referer TEXT,
device_type TEXT,
status INTEGER
)
""")
conn.commit()
return conn
def parse_log_line(line):
"""Parse Cloudflare JSON log line"""
data = json.loads(line)
return {
"timestamp": datetime.fromtimestamp(data["EdgeStartTimestamp"] / 1000),
"path": data["ClientRequestURI"],
"country": data["ClientCountry"],
"referer": data.get("ClientRequestReferer", "direct"),
"device_type": data.get("ClientDeviceType", "unknown"),
"status": data["EdgeResponseStatus"]
}
def import_logs(log_file):
conn = init_db()
cursor = conn.cursor()
with open(log_file) as f:
for line in f:
if not line.strip():
continue
record = parse_log_line(line)
cursor.execute("""
INSERT INTO page_views (timestamp, path, country, referer, device_type, status)
VALUES (?, ?, ?, ?, ?, ?)
""", (record["timestamp"], record["path"], record["country"], record["referer"], record["device_type"], record["status"]))
conn.commit()
conn.close()
if __name__ == "__main__":
import sys
import_logs(sys.argv[1])
Now I can query historical data:
-- Top 10 posts this month
SELECT path, COUNT(*) as views
FROM page_views
WHERE timestamp >= date('now', 'start of month')
AND status = 200
AND path LIKE '/%.html'
GROUP BY path
ORDER BY views DESC
LIMIT 10;
What I Learned (30 Days of Data)
Here’s what the logs revealed that Google Analytics never told me:
1. Bot Traffic is HUGE
Out of 10,432 requests:
- 7,218 (69%) were bots (crawlers, scrapers, monitoring tools)
- Only 3,214 (31%) were real humans
Google Analytics filters bots by default. But server logs show the full picture. Turns out, my “viral” post was just getting hammered by Googlebot.
2. GitHub Pages Serves Static Assets Separately
I thought my homepage was my most popular page. Nope. These were:
/assets/css/style.css 1,243 requests
/assets/js/main.js 982 requests
/favicon.ico 891 requests
Filter out static assets or your analytics will be useless.
3. Referrer Data is Mostly Garbage
Expected top referrers:
- Dev.to
- Hacker News
Actual top referrers:
(direct)— 58%https://www.google.com— 22%https://t.co(Twitter shortener) — 11%- Everything else — 9%
Why? Because:
- HTTPS → HTTP referrer headers get stripped
- Shortened URLs (t.co, bit.ly) hide the original source
- Mobile apps don’t send referrers
Lesson: Use UTM parameters for campaigns. Server logs alone won’t tell you which tweet/post drove traffic.
4. Mobile is 67% of Traffic
I optimized for desktop. Oops.
Desktop: 33%
Mobile: 67%
Tablet: 0% (seriously, nobody uses tablets)
My next task: improve mobile reading experience.
5. Peak Hours Matter
Traffic distribution by hour (UTC):
00:00-06:00 12% (Asia/Pacific waking up)
06:00-12:00 31% (Europe working hours)
12:00-18:00 38% (US East Coast + Europe overlap)
18:00-24:00 19% (US West Coast evening)
I was publishing posts at 9 AM KST (midnight UTC). Terrible timing. Now I publish at 2 PM KST (5 AM UTC) to hit European mornings.
The Dashboard (Custom Flask App)
GoAccess is great, but I wanted a custom dashboard with:
- Month-over-month growth
- Post performance comparison
- Geographic heatmap
So I built a simple Flask app:
# app.py
from flask import Flask, render_template
import sqlite3
import pandas as pd
app = Flask(__name__)
@app.route("/")
def dashboard():
conn = sqlite3.connect("analytics.db")
# Top posts this month
df = pd.read_sql("""
SELECT path, COUNT(*) as views
FROM page_views
WHERE timestamp >= date('now', 'start of month')
AND status = 200
AND path LIKE '/%.html'
GROUP BY path
ORDER BY views DESC
LIMIT 10
""", conn)
# Traffic by country
countries = pd.read_sql("""
SELECT country, COUNT(*) as views
FROM page_views
WHERE timestamp >= date('now', 'start of month')
AND status = 200
GROUP BY country
ORDER BY views DESC
LIMIT 5
""", conn)
conn.close()
return render_template("dashboard.html", posts=df.to_dict('records'), countries=countries.to_dict('records'))
if __name__ == "__main__":
app.run(debug=True)
Template (templates/dashboard.html):
<!DOCTYPE html>
<html>
<head>
<title>Blog Analytics</title>
<style>
body { font-family: monospace; max-width: 800px; margin: 50px auto; }
table { width: 100%; border-collapse: collapse; }
th, td { text-align: left; padding: 10px; border-bottom: 1px solid #ddd; }
</style>
</head>
<body>
<h1>📊 Blog Analytics</h1>
<h2>Top Posts (This Month)</h2>
<table>
<tr><th>Path</th><th>Views</th></tr>
</table>
<h2>Traffic by Country</h2>
<table>
<tr><th>Country</th><th>Views</th></tr>
</table>
</body>
</html>
Run it:
python app.py
# Visit http://localhost:5000
Cost Breakdown
| Service | Cost | Notes |
|---|---|---|
| GitHub Pages | $0 | Free tier |
| Cloudflare | $0 | Free tier (up to 100K req/day) |
| GoAccess | $0 | Open source |
| SQLite | $0 | No hosting needed |
| Python/Flask | $0 | Run locally |
| Total | $0/month | vs. Google Analytics 360: $150K/year |
Privacy Wins
What I don’t collect:
- ❌ Cookies
- ❌ User IDs
- ❌ Session tracking
- ❌ Mouse movements
- ❌ Form inputs
- ❌ Cross-site tracking
What I do collect:
- ✅ Page views (URL + timestamp)
- ✅ HTTP referrer (if present)
- ✅ Country (from IP, not stored)
- ✅ Device type (from User-Agent)
GDPR-compliant by default. No consent banner needed.
When This Approach Fails
This stack isn’t for everyone. You can’t get:
- User journeys — which pages did a visitor read in sequence?
- Time on page — server logs only record requests, not engagement
- Scroll depth — how far down did users read?
- A/B testing — you need JavaScript for that
- Real-time dashboards — log parsing takes time
If you need those, use Plausible ($9/month) or Fathom ($14/month). They’re privacy-first and GDPR-compliant.
But if you just want to know what people are reading, server logs are enough.
Next Steps
I’m planning to:
- Add a geographic heatmap (using D3.js)
- Track referral traffic with UTM parameters
- Build automated weekly reports (emailed to me every Monday)
- Open-source the full analytics pipeline
Want the code? I’m packaging this as a Jekyll Analytics Starter Kit on Gumroad (launching next week). It’ll include:
- Full Python scripts
- GoAccess config
- Flask dashboard template
- Docker setup for easy deployment
Built by Jackson Studio — because developers deserve better analytics.
Got questions? Drop them in the comments. I’ll update this post with answers.
🔗 Related Resources in This Series
📖 I Built a Self-Correcting Blog Pipeline (and it saved 15 hours/week) — Automation architecture for zero-downtime deployments
📖 I Built a Jekyll Blog That Deploys in 8 Seconds — Complete CI/CD setup with GitHub Actions
📖 I Built a Blog Performance Dashboard With Python + GitHub Actions — Real-time monitoring your blog metrics
🛠️ Ready to build your own? → Blog Analytics + Automation Starter Kit on Gumroad — Full production code, 30-day support included.
Next in the Blog Ops series: How I automated my content calendar using cron + AI (spoiler: I haven’t written a post manually in 2 weeks).