Adham's Portfolio

End-to-End Property Data Solutions

A production-style Python pipeline for scraping, normalizing, and serving Properti123 listing data

View Documentation

Project Introduction

This project delivers an end-to-end automation flow for property listing data from Properti123. Running on an Ubuntu VPS, Python scripts collect listing URLs, scrape detailed property attributes, normalize address fields, and store everything in PostgreSQL tables. The same dataset is then exposed through Flask API endpoints for downstream analysis or visualization.

The implementation focuses on practical reliability: HTTP retry with exponential backoff, status flags in property_link.available_data, batch-style processing scripts, fuzzy address matching with TheFuzz, and notification integration via messaging bots. The codebase demonstrates applied backend/data engineering skills using Python, SQL, and Linux scheduling.

Below is the schematics graph for reference:

Key Features

  • 1 Automated Web Scraping
  • 2 URL discovery and deduplication into property_link
  • 3 Detailed listing extraction into property_detil
  • 4 Address normalization into property_address
  • 5 Telegram/LINE notification for filtered listings
  • 6 Flask API endpoints for data consumption

Pipeline Stages

The pipeline in this repository follows concrete Python modules and SQL operations from link crawling until API-ready data.

1

Data Collection

Retrieve listing URLs from https://properti123.com using PropertyLinkUpdater.py.
  • Requests + BeautifulSoup crawl paginated listing pages
  • Duplicate check against public.property_link
  • Insert new links with default available_data=0
  • Stop early when one page is already fully known
2

Data Validation & Cleaning

Use status codes and parsing fallbacks to keep failed records traceable and retryable.

  • URL-level status tracking in property_link.available_data
  • Retry with backoff for transient request failures
  • Missing-field fallback to Not Found!
  • Stored procedure call to normalize page date fields
3

Data Transformation

Convert raw listing attributes and non-standard addresses into structured records.

  • Extract property metadata from page JSON/script blocks
  • Parse price fields and listing attributes
  • Fuzzy match province/kabupaten/kecamatan from ref_address
  • Save matching confidence score to property_address.score
4

Data Storage

Persist each pipeline stage in dedicated PostgreSQL tables for traceability.

  • property_link for discovered URLs and processing status
  • property_detil for detailed listing attributes
  • property_address for normalized location fields
  • ref_address as reference dictionary for matching
5

Analytics & Reporting

Serve filtered property outputs to notification channels and API consumers.

  • Filter listings by location/type/price criteria
  • Send summary notifications to Telegram/LINE
  • Expose API endpoints from Flask for 5,000-row snapshots
  • Support downstream chart/report generation

Database Structure

The schema mirrors the production PostgreSQL database: property_link stores every discovered URL, property_detil stores the full scraped listing keyed by URL, property_address stores normalized geography keyed by propertyid, and ref_address provides the master list of Indonesian admin areas used for fuzzy matching.

property_link url (PK) status available_data created_at property_detil url (PK, FK property_link.url) propertyid (FK propertyid) price_real address title, currency, contact_person, ... property_address propertyid (PK, FK propertyid) address_nonstandard province kabupatenkota kecamatan score ref_address provinsi kabupatenkota kecamatan desakel

Database schema showing key tables and their relationships

property_link

Tracks every listing URL discovered from Properti123

Column Type Description
url TEXT Primary key, unique listing URL
status TEXT Listing type (e.g., JUAL)
available_data INTEGER Scraping status / completeness flag for this URL
created_at TIMESTAMP When the URL was first discovered

property_detil

Stores the full scraped attributes for each property

Column Type Description
url TEXT Primary key, foreign key to property_link.url (one-to-one per URL)
propertyid TEXT Property identifier on Properti123, foreign key from property_address.propertyid
propertylistingid INTEGER Internal listing ID from Properti123
propertytipe TEXT Property type (e.g., RUMAH DIJUAL)
title TEXT Listing title as shown on Properti123
currency TEXT Currency code for the listing price
price_real BIGINT Parsed numeric price in base currency
price_short TEXT Human-readable price label (e.g., "1,2 M")
price_monthly TEXT Monthly price (for rental listings), if available
price_psm TEXT Price per square meter, if available
address TEXT Raw address string from the listing
contact_person TEXT Contact person name on the listing
phone_number TEXT Contact phone number
agency TEXT Agency or broker name, if present
kondisi_bangunan TEXT Building condition (e.g., "Baru", "Bekas")
luas_bangunan TEXT Building area text as scraped
luas_tanah TEXT Land area text as scraped
jumlah_lantai TEXT Number of floors
floor_loc TEXT Floor location (for apartments), if applicable
certificate TEXT Certificate type (e.g., SHM, HGB)
interior TEXT Interior condition (e.g., Furnished, Semi Furnished)
main_bedroom TEXT Main bedroom count
bathroom TEXT Bathroom count
secondary_bedroom TEXT Secondary bedroom count
saluran_telepon TEXT Phone line availability
listrik TEXT Electrical power specification
air_pam TEXT PAM (municipal water) availability
air_tanah TEXT Groundwater / well availability
jalur_mobil TEXT Car access (e.g., 1 mobil / 2 mobil)
garasi TEXT Garage availability / capacity
carport TEXT Carport availability / capacity
direction TEXT Building facing direction
about TEXT Free-text description ("about this property" section)
domain TEXT Source domain (e.g., properti123.com)
page_created_at TEXT Listing page created-at text as scraped
created_at TIMESTAMP When the record was written to the database
page_date TIMESTAMP Normalized page date parsed from the text
page_date_accuration TEXT How accurate the parsed page date is (e.g., "day", "month")

property_address

Normalized geographic components derived from raw addresses

Column Type Description
propertyid TEXT Primary/unique key, foreign key to property_detil.propertyid
address_nonstandard TEXT Original, non-standardized address string
province TEXT Matched province name
kabupatenkota TEXT Matched city / regency
kecamatan TEXT Matched district
score REAL Fuzzy matching confidence score (0-100)

ref_address

Reference master data for Indonesian administrative areas

Column Type Description
negara TEXT Country (e.g., INDONESIA)
provinsi TEXT Province name
kabupatenkota TEXT City / regency name
kecamatan TEXT District name
desakel TEXT Village / urban village name
addressdetil TEXT Additional address detail or concatenated components

Key Relationships

  • property_link to property_detil

    Each discovered URL can have at most one detailed record once it has been scraped

    One-to-One
    property_detil.url (PK, FK) property_link.url (PK)
  • property_detil to property_address

    Each scraped property can have at most one normalized address row

    One-to-One
    property_address.propertyid (UNIQUE, FK) property_detil.propertyid
  • property_address to ref_address

    Normalized address components map back to the reference master data

    Many-to-One
    (province, kabupatenkota, kecamatan, desakel) ref_address rows
  • End-to-end flow

    URLs flow from discovery, to detailed scrape, to standardized geography, ready for analytics and alerting

    One-to-Many
    property_link property_detil property_address ref_address

Database Integrity

Operational consistency in this project is maintained through SQL rules and processing-state updates:

  • 1 URL deduplication with conflict-safe insert on property_link.url
  • 2 Per-URL processing status in available_data for retry and backfill control
  • 3 Address insert protection with conflict handling on property_address.propertyid
  • 4 Controlled extraction fallback for missing page fields
  • 5 Stored procedure call for page-date normalization after detail inserts

Technology Stack

This section reflects the tools and libraries currently used in the implementation.

PostgreSQL

Primary relational database for links, details, normalized addresses, and reference data

property_link

Queue-like table that tracks discovered URLs and processing status codes

property_detil

Main table storing detailed listing attributes from scraper output

property_address + ref_address

Normalized address output and master reference for fuzzy matching

Python

Core language for scraping, transformation, matching, and orchestration scripts

Flask

Project web integration layer in this repository; scraper core itself is script-driven

requests + BeautifulSoup

HTTP retrieval and HTML parsing for Properti123 listing pages

psycopg2

Direct SQL execution for inserts, updates, fetches, and stored procedure calls

HTML/CSS/JavaScript

Template-driven portfolio pages and static interaction components

Jinja Templates

Server-rendered pages integrated into the Flask application

Font Awesome

Icon set used for visual communication in sections and cards

Custom CSS + JS

Project-specific layout, tabs, and accordion behavior

Ubuntu VPS

Runtime host for scheduled scraper scripts and Flask API service

crontab

Scheduler for recurring ingestion and update jobs

Public Domain Hosting

Deployed web/API access through husein-workpace.my.id

Repository Variants

Parallel script versions in main, github, and project reincarnation folders

TheFuzz

Fuzzy matching for province, kabupaten/kota, and kecamatan normalization

SQL Joins and Filters

Selection logic for alerts based on price, property type, and target locations

Python Data Parsing

JSON/script block extraction and field transformation from listing pages

Message-ready Summaries

Formatted text payloads for notification channels from filtered query results

Retry + Backoff

Network resilience for scraper requests to reduce transient failures

Status Codes

Track success/failure states in available_data for reprocessing

Batch Backfill Scripts

run_scraper variants support partial offset processing and recovery

Messaging Alerts

Automated outbound notifications to Telegram/LINE channels

Challenges & Solutions

The implementation prioritizes practical scraper reliability, database consistency, and repeatable processing.

Challenges Overview

Building an end-to-end property data solution presented several significant technical and operational challenges. Below are the key challenges we faced and how we addressed them.

Data Quality and Consistency

Solved Property data from different sources often had inconsistent formats, missing values, and...

Challenge:

Listing pages are not perfectly uniform, and some attributes are absent or incomplete.

Solution:

Implemented defensive extraction with field-level fallback values and URL status updates to mark parse failures for later retry.

Impact:

Pipeline remains stable even when some listings are malformed, without halting full batch execution.

Processing at Scale

Solved The system needed to process millions of property records daily while maintaining performance...

Challenge:

Large URL volumes can make one-pass scraping slow and difficult to recover after interruption.

Solution:

Added batch scripts with LIMIT/OFFSET processing and bulk insert/update patterns to process data in controlled chunks.

Impact:

Backfill and rerun operations are easier to resume and monitor from intermediate offsets.

Real-time Data Requirements

Solved Certain use cases required near real-time data updates, which conflicted with our batch...

Challenge:

Frequent network instability can cause intermittent scraping failures during long-running jobs.

Solution:

Implemented request retry with exponential backoff and periodic cooldown intervals after each processing block.

Impact:

Improved resilience against transient failures without adding distributed infrastructure complexity.

Data Privacy and Compliance

Solved Property data often contains sensitive information subject to various regulations...

Challenge:

Address text from listings is non-standard and difficult to aggregate for location-based analysis.

Solution:

Built a fuzzy matching step against ref_address to produce standardized province, kabupaten/kota, and kecamatan values.

Impact:

Location-level grouping and filtering becomes consistent and usable for downstream queries.

Integration with Legacy Systems

In Progress Needed to integrate with several legacy systems that lacked modern APIs...

Challenge:

Scraped data needs to reach end users quickly with practical filtering rules.

Solution:

Added notification modules that select candidate listings from SQL and push summaries to Telegram/LINE.

Impact:

Users receive curated listing updates without manually querying the database.

Complex Analytical Requirements

In Progress Users needed to perform complex spatial and temporal analyses...

Challenge:

Downstream tools require stable access to processed records, not raw scraper internals.

Solution:

Exposed Flask API endpoints for property detail and address datasets with controlled query limits.

Impact:

Front-end and external consumers can retrieve ready-to-use JSON from a single service layer.

Automation and Scheduling

Automation is organized around script-based jobs that can be scheduled with crontab on the VPS.

Link Discovery Job

Scheduled via crontab

Runs PropertyLinkUpdater.fetch_new_links() to crawl listing pages and insert unseen URLs.

JOB OUTPUT

Source properti123.com/properti-jual
Insert Target public.property_link
Status Flag available_data = 0

HTTP Resilience

Applied each request

All fetches use retry and exponential backoff before marking retrieval failure.

REQUEST BEHAVIOR

Retries 3 attempts
Backoff 0.5 * 2^attempt
Timeout 15 seconds

Incremental Stop Logic

During link crawl

Crawler stops when a page is fully known, reducing unnecessary requests.

CONTROL LOGIC

Page Strategy Ascending page loop
Duplicate Check IN query against existing URLs
Early Exit Break when all links already exist

Detail Scraping Job

Scheduled via crontab

Runs PropertyDetilUpdater.update_property_details() to scrape and insert full listing attributes.

JOB OUTPUT

Insert Target public.property_detil
Status Update available_data = 1 (success)
Date Normalization CALL public.ad_update_pagedate()

Address Matching Job

Scheduled via crontab

Runs AddressMatching.py to normalize non-standard addresses using fuzzy matching.

MATCHING OUTPUT

Source property_detil.address
Reference public.ref_address
Output public.property_address + score

Orchestrated Pipeline Run

Triggered script flow

PropertyCompiler.py chains link update, detail scrape, notification, and address processing.

EXECUTION FLOW

Step 1 fetch_new_links(5)
Step 2 update_property_details(new_links)
Step 3+ notification_message + address run

Backfill Script Variants

Manual or scheduled

run_scraper scripts provide offset-based recovery and batch-size tuning for failed ranges.

MAINTENANCE CONTROLS

Batch Insert execute_values for property_detil
Batch Status Update executemany to property_link
Offset Control Resume from selected range

Status-Based Retry Queue

Continuous process pattern

Failed or partial records can be selected by status and retried in dedicated runs.

QUEUE PATTERN

Source Field property_link.available_data
Retry Targets status 0, 7, or custom set
Loop Behavior Process until no rows remain

Connection Recovery

On psycopg2 error

Address processor catches DB errors, waits, reconnects, and resumes processing.

RECOVERY BEHAVIOR

Error Type psycopg2 database exceptions
Cooldown 60-second wait before reconnect
Action Reconnect and continue loop

Pipeline Failure Alerts

On job failure

Script logs and status updates identify failed URLs and failed parsing batches.

ALERT SIGNALS

Failure Marker available_data status transition
Console Trace Per-URL processing logs
Recovery Path Retry with backfill scripts

Property Push Notification

After detail update

Send filtered listing summaries to Telegram/LINE based on SQL criteria.

MESSAGE FILTERS

Price Rule price_real < 700000000
Type Rule selected sale categories
Location Rule Jakarta/Bogor/Depok/Tangerang/Bekasi keywords

API Consumption Layer

On demand

Processed records can be consumed from Flask JSON endpoints in fixed-size snapshots.

ENDPOINT OUTPUT

Endpoint A /api/property_detil
Endpoint B /api/property_address
Query Limit LIMIT 5000 per endpoint