A CLI tool for pulling Singapore Parliament Hansard data into structured CSV/JSON files — built for those who want to work with parliamentary speech data without wading through the Hansard website or setting up cloud infrastructure.
Singapore's Parliament publishes Hansard transcripts online, but they're not easy to work with at scale. If you want to know what a minister said about housing across 20 sittings, or compare how much airtime different MPs get on a topic, you're stuck clicking through pages one by one.
parleh-mate/singapore-parliament-speeches solved the scraping problem but built it as a cloud pipeline — BigQuery, Google Drive, Docker, Cloud Functions. This fork strips all of that out and gives you a single command that dumps the data to your laptop.
Run the scraper on a sitting date and you get four CSV files:
| File | What's in it |
|---|---|
speeches.csv |
Every speech paragraph — who said it, party, gender, what they said, word/syllable/sentence counts, chairing flag, appointment flag, noise flag |
topics.csv |
What was discussed — oral answers (OA), bills (BI), motions (MO), etc. |
attendance.csv |
Which MPs showed up and which didn't, with party and gender. Corrected using speech data. |
sittings.csv |
Sitting metadata — parliament number, session, start/end time, duration |
The scraper cleans up the raw Hansard HTML: it identifies speakers, strips procedural boilerplate, standardises MP names (so "The Minister for Health (Mr Ong Ye Kung)" becomes "Ong Ye Kung"), and merges consecutive paragraphs from the same speaker into single entries.
Each speech also gets basic text metrics — word count, character count, sentence count, syllable count — which you can use for readability analysis or just to gauge how much someone said.
Party and gender data comes from seeds/member.csv, which tracks each MP per parliament term. Coverage is comprehensive for the 12th–15th Parliaments (2011 onwards) and selective for earlier parliaments (opposition MPs, key PAP figures, NMPs).
| Column | Example |
|---|---|
date |
2024-05-07 |
speech_id |
2024-05-07-T-001-S-00002 |
topic_id |
2024-05-07-T-001 |
member_name_original |
The Minister for Health (Mr Ong Ye Kung) |
member_name |
Ong Ye Kung |
text |
Mr Speaker, Sir, my response today will also address... |
num_words |
186 |
num_sentences |
9 |
party |
PAP |
gender |
M |
is_chairing |
False |
is_noise |
False |
is_appointment |
True |
pip install -r requirements.txt# Single sitting date
python -m sgparl --date 2024-05-07
# Multiple dates
python -m sgparl --date 2024-05-07 2024-02-05
# Date range (uses seeds/dates.csv for known sitting dates)
python -m sgparl --from 2024-01-01 --to 2024-03-31
# Save to a specific folder (default: data/)
python -m sgparl --date 2024-05-07 --output my-data/
# Output as JSON instead of CSV, or both
python -m sgparl --date 2024-05-07 --format json
python -m sgparl --date 2024-05-07 --format both
# Update the list of known sitting dates (needed for --from/--to)
python -m sgparl --update-seedsWhen scraping multiple dates, all results are combined into a single set of files.
After improving the name parser (sgparl/utils.py:get_mp_name) or the role-title resolver (sgparl/enrich.py), re-apply to existing data:
# Reparse the main dataset
python -m sgparl.reparse
# Reparse a specific file
python -m sgparl.reparse data/all/speeches_post2012.csv
# Reparse all speech CSVs and rebuild speeches_all.csv
python -m sgparl.reparse --allThis re-runs get_mp_name() on every row's member_name_original column (which is preserved from the original scrape) and applies role-title resolution. No API calls.
The main CLI (python -m sgparl) only works for post-2012 sittings. For pre-2012 data (1955-2011), use:
python -m sgparl.pre2012This uses the same getHansardReport endpoint, but parses the htmlFullContent field (raw HTML of the full sitting report) instead of the structured JSON that post-2012 returns. Uses HTML comments (<!-- MP_NAME:Name -->) as the primary speaker source, with bold tags as fallback. Outputs to data/all/pre2012_v2/. Takes ~5 hours for all 1,333 dates. Checkpoints every 20 dates.
Why a separate scraper? The Parliament API uses the same endpoint for all dates, but the response format changed in September 2012. Post-2012 returns structured JSON (metadata, attendanceList, takesSectionVOList). Pre-2012 returns a single htmlFullContent field containing the entire sitting as concatenated HTML blocks. The parsers are different because the data format is different.
--from/--to uses seeds/dates.csv (a list of known sitting dates going back to 1955) to figure out which days Parliament actually sat. To bring this list up to date:
python -m sgparl --update-seedsThis scans every weekday from the last known date to today, checks the API, and appends any new sitting dates it finds.
Both tracks call the same API endpoint (getHansardReport/?sittingDate=) but parse different response formats:
seeds/dates.csv
(1,747 sitting dates, 1955-2026)
|
+--------------------+--------------------+
| |
Post-2012 dates Pre-2012 dates
(Version 2, 414 dates) (Version 1, 1,333 dates)
| |
python -m sgparl python -m sgparl.pre2012
| |
getHansardReport returns: getHansardReport returns:
- metadata (JSON) - htmlFullContent (raw HTML)
- attendanceList (JSON array) Contains ALL topics as
- takesSectionVOList (JSON array) concatenated <html> blocks,
with per-topic content HTML plus PRESENT/ABSENT attendance
| |
sgparl/parse.py sgparl/pre2012.py
(structured JSON parser) (HTML parser)
| |
data/all/ data/all/pre2012_v2/
speeches_post2012.csv speeches.csv
topics_post2012.csv topics.csv
attendance_post2012.csv attendance.csv
sittings_post2012.csv sittings.csv
| |
+--------------------+--------------------+
|
python -m sgparl.reparse --all
(shared post-processing)
|
+--------+--------+
| |
get_mp_name() resolve_role_titles()
(sgparl/utils.py) (sgparl/enrich.py)
| |
+--------+--------+
|
speeches_all.csv (merged, cleaned)
topics_all.csv
After scraping, both tracks go through the same cleaning:
1. Name parsing -> sgparl/utils.py:get_mp_name()
Handles 20+ prefixes (Mr/Dr/BG/Tun/Maj/etc.)
Strips titles, constituencies, initials
2. Role resolution -> sgparl/enrich.py:resolve_role_titles()
"The Prime Minister" -> Lee Kuan Yew (by date)
Flags is_appointment, is_chairing, is_noise
3. Multi-speaker -> Split blocks with 2+ inline speakers
splitting Stitch orphan fragments to previous speaker
4. Deduplication -> Remove exact (date + speaker + text) duplicates
5. Party enrichment -> Match against seeds/member.csv
Verified against parliament.gov.sg official list
6. Section type -> Normalise pre-2012 codes (OR->OA, WR->WA, etc.)
normalisation via section_type_normalised column
Then merge and re-enrich:
python -m sgparl.reparse --all
-> re-applies get_mp_name + resolve_role_titles
-> outputs speeches_all.csv, topics_all.csv
| File | Purpose |
|---|---|
sgparl/api.py |
API client — fetches Hansard reports from sprs.parl.gov.sg |
sgparl/parse.py |
Parses API responses into DataFrames (sittings, attendance, topics, speeches) |
sgparl/utils.py |
Name cleaning (get_mp_name), syllable counting, text metrics |
sgparl/enrich.py |
Post-processing: resolves role titles to names by date, flags appointment-capacity speeches |
sgparl/cli.py |
CLI entry point, member enrichment, attendance correction |
sgparl/reparse.py |
Re-applies name parsing + enrichment to existing CSVs without rescraping |
sgparl/pre2012.py |
Pre-2012 scraper using getHansardReport/htmlFullContent (full coverage, includes attendance) |
seeds/dates.csv |
Known sitting dates (1955-2026) |
seeds/member.csv |
MP party/gender lookup by parliament term |
The get_mp_name() function in sgparl/utils.py handles:
- Standard prefixes: Mr, Mrs, Ms, Mdm, Miss, Dr, Prof
- Military/honorary: BG, RAdm, Tun, Dato, Tuan, Er, Ir
- Gendered: Madam, Encik, Inche, Cik
- Academic: Assoc. Prof.
- Initialled names: "Mr D. S. Marshall (Cairnhill)" → "D. S. Marshall"
- Ministerial titles: "The Minister for Health (Mr Ong Ye Kung)" → "Ong Ye Kung"
- Constituency stripping: "Mr Pritam Singh (Aljunied GRC)" → "Pritam Singh"
- Two-speaker merges: "Mr Wee Toon Boon and Dr Toh Chin Chye" → "Wee Toon Boon"
Role-title resolution (sgparl/enrich.py) additionally maps:
- "The Prime Minister" → Lee Kuan Yew / Goh Chok Tong / Lee Hsien Loong / Lawrence Wong (by date)
- "The Chief Minister" → David Marshall / Lim Yew Hock (pre-independence, by date)
- All ministerial/appointment speeches flagged with
is_appointment = True
- All dates use
YYYY-MM-DDformat. --from/--todate ranges rely onseeds/dates.csv. Run--update-seedsto bring it up to date.- Pre-2012 HTML uses a different format (inline
<b>tags rather than<p><strong>blocks). After multi-speaker splitting, orphan stitching, and HTML refetch recovery, only 0.03% of total words remain unrecoverable — mostly Ministry Addenda, Budget section headers, and short fragments. - Attendance is corrected using speech data. The Hansard attendance list is a roll call at the start of the sitting. Ministers who arrive late are marked absent even if they speak later — 22% of all "absent" records are contradicted by speech data. The scraper overrides
is_presenttoTruefor any MP who spoke that day. - Deputy Speaker chairing speeches are flagged. When an MP chairs proceedings as Deputy Speaker, Hansard records their procedural utterances under their personal name with a
[Deputy Speaker (Mr X) in the Chair]tag. These are flagged withis_chairing = True. Without this, Christopher de Souza's word count is inflated by 42K words (15%), Charles Chong's by 53K (76%). - "The Chairman" speeches are flagged but not name-resolved. During Committee of Supply debates, "The Chairman" is whoever is chairing — typically the Deputy Speaker or an appointed MP, NOT the Speaker of Parliament. These are flagged
is_chairing = Trueandis_appointment = Truebutmember_nameis left empty because the role rotates and we lack per-sitting chair data. - Only leading colons are stripped from speech text. The colon after a speaker's name (the separator) is removed, but colons within the text content (times like "1:20 pm", ratios) are preserved.
- Anonymous interjections are flagged as noise. "An hon. Member", "Some hon. Members" = unidentified shouts from the chamber. "ADJOURNMENT", "(Motion)", "ANNUAL BUDGET STATEMENT" = procedural markers. These are flagged with
is_noise = True. Filter withspeeches[~speeches['is_noise']]. - Multi-speaker blocks are split and stitched. The pre-2012 HTML parser often lumped entire debates into a single row. These are detected by finding 2+ inline speaker patterns (e.g. "Mr Foo (West Coast):") in the text, then split into individual speeches. Split speech IDs use a
-SPLIT-{N}suffix. Orphan text fragments produced by imperfect splitting (where the regex cut at a quoted reference like "The Prime Minister said..." rather than a new speaker) are stitched back to the previous speaker — 98,311 fragments (13M words) recovered this way. - Pre-2012 data uses full HTML content. The
pre2012scraper bypasses the API's broken search pagination by using thehtmlFullContentendpoint, which returns the complete sitting report as raw HTML. This provides full coverage including all topics per sitting.
Built on the parsing logic from parleh-mate/singapore-parliament-speeches by Jeremy Chia (@jeremychia), Jon Foong (@jonfoong), and Royce Hoe (@roycehoe). See the upstream repo for the full cloud pipeline, dbt models, and research references.