This project develops a Retrieval Augmented Generation (RAG) system for querying Taiwanese occupational safety and health laws. The content of occupational safety regulations is highly specialized and complex, making it difficult for frontline workers to understand. Many are unable to afford costly professional legal consultation, resulting in a high barrier to accessing essential regulatory knowledge. The primary goal is to provide an intelligent assistant capable of automatically parsing regulations and providing answers with corresponding legal references, in order to promote information equity for frontline workers.
The system encompasses several key components:
- Web Crawling: Automatically extracts legal documents, including structured articles from
law.moj.gov.twand related PDF documents, ensuring a comprehensive and up-to-date knowledge base. - Data Processing: Utilizes advanced text splitting techniques and state-of-the-art
SentenceTransformermodels (specificallyintfloat/multilingual-e5-large) to convert raw legal text into semantic vector embeddings. - Vector Database: Stores these processed text chunks along with their high-dimensional vector representations in a PostgreSQL database, enhanced with the
pgvectorextension for efficient similarity search. - Retrieval System: Implements a semantic search mechanism that, given a user query, retrieves the most relevant legal provisions by comparing the query's vector embedding with the stored law chunk embeddings.
- API and Chatbot Integration: Provides various user-friendly interfaces, including a FastAPI-based API and integrations with popular messaging platforms like LINE Bot and Telegram Bot, enabling interactive querying and information retrieval.
- Evaluation and Learning Support: Includes components for demonstrating and evaluating the system's performance, with a potential application in preparing for occupational safety basic examinations.
To get a local copy up and running, follow these steps.
- Python 3.11 or higher: The primary programming language for the project.
- uv: A fast Python package installer and resolver. Install it via
pip install uv. - .env file: A
.envfile will be needed insrc/laws_database/to configure PostgreSQL connection details. An example.envcontent:PG_HOST=localhost PG_PORT=5432 PG_DATABASE=lawdb PG_USER=postgres PG_PASSWORD=postgres
-
Clone the repository:
git clone https://github.com/dddanielliu/DSP_Project.git cd DSP_Project -
Navigate to the
src/laws_databasedirectory and set up the database: refer to src/laws_database -
Install Python dependencies for all modules: use
uvto install dependencies.uv sync --extra rocm
-
Crawl Legal Data:
cd src/web_crawl python crawler.pyThis script will crawl laws from
law.moj.gov.twbased on URLs inlinks.txt, creatinglaws(CSV files of structured articles) andpdfs(downloaded PDF documents) directories withinsrc/web_crawl. -
Create Vector Embeddings and Populate the Database:
cd ../laws_database python create_vector.pyThis script processes the crawled
.csvand.pdffiles, generates semantic vector embeddings for the text chunks, and inserts them into the PostgreSQLlaw_chunkstable. -
Run a Similarity Search Demo:
python demo_similarity_search.py
You will be prompted to enter queries, and the system will return the most semantically relevant law chunks from the database.
-
Run Evaluation/API/Chatbot (Optional): Refer to the specific documentation or scripts within the
src/evaluationdirectory for instructions on running the API or chatbot integrations (e.g.,main.py,apidemo.py,line_bot.py,telegram_bot.py).
The project is organized into several key directories:
laws_database/:laws/: Contains raw crawled.csvfiles of Taiwanese occupational safety and health regulations.pdfs/: Contains raw crawled.pdfdocuments related to the laws.
src/: Main source code directory, organized by functionality.src/web_crawl/:- Purpose: Scripts for web crawling legal documents from
law.moj.gov.tw. - Key Files:
crawler.py: The main script responsible for parsing web pages and extracting law content into CSVs and downloading PDFs.links.txt: A plain text file containing URLs of legal documents to be crawled.
- Purpose: Scripts for web crawling legal documents from
src/laws_database/:- Purpose: Scripts for processing crawled data, generating vector embeddings, and managing the PostgreSQL database.
- Key Files:
create_vector.py: Script to generate vector embeddings from the crawled data and populate thepgvectorenabled PostgreSQL database.demo_similarity_search.py: A demonstration script to perform semantic similarity searches against the populated database.init.sql: SQL script to initialize the PostgreSQL database schema, including thelaw_chunkstable andpgvectorextension.pyproject.toml: Python project configuration and dependency management for this module.
src/evaluation/:- Purpose: Contains scripts for evaluating the system's performance, API integration, and chatbot implementations.
- Key Files:
main.py,apidemo.py,demo.py: Likely main entry points for API services or system demonstrations.line_bot.py,telegram_bot.py: Implementations for integrating the search functionality with LINE and Telegram chatbots.test.ipynb,view_result.ipynb: Jupyter notebooks for testing and visualizing evaluation results.pyproject.toml: Python project configuration and dependency management for this module.
src/question_crawl/:- Purpose: Potentially for crawling and processing legal questions or further extracting information from PDF documents.
- Key Files:
crawl.py: Script related to crawling or extracting data for questions.loadpdf.py: Script for loading and processing PDF content.pyproject.toml: Python project configuration and dependency management for this module.
DSP Project (shared).pdf: This document provides the project vision, outlines the problem of occupational safety incidents, and includes sample test questions related to occupational safety regulations, suggesting the project's application in educational or assessment contexts.
The core analysis method in this project revolves around semantic similarity search using vector embeddings.
- Text Preprocessing: Legal documents, sourced from the Occupational Safety and Health Section of the Tainan City Government, are first processed to break down lengthy articles and PDF content into smaller, semantically coherent "chunks." The text is split into chunks of 500 characters with a 200-character overlap. This ensures that embeddings are generated for focused pieces of information.
- Embedding Generation: Each text chunk is then transformed into a 1024-dimensional dense vector embedding using the
intfloat/multilingual-e5-largemodel from thesentence_transformerslibrary, with separate "query" and "passage" prefixes. This model is chosen for its effectiveness in multilingual text understanding and its ability to capture the semantic meaning of the text. - Vector Database and Indexing: The generated embeddings, along with their corresponding text chunks and metadata (law name, chapter, article number), are stored in a PostgreSQL database configured with the
pgvectorextension. An HNSW (Hierarchical Navigable Small Worlds) index is created on the embedding column (CREATE INDEX ON law_chunks USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);) to enable highly efficient k-nearest neighbors (k-NN) search. - Semantic Search: When a user inputs a query, it is first embedded into a vector using the same
intfloat/multilingual-e5-largemodel. This query embedding is then used to perform a similarity search against the stored law chunk embeddings in the database. Thepgvectorextension calculates the L2 (Euclidean) distance between the query embedding and all stored embeddings, retrieving thetop_kmost relevant law chunks. - Filtering: The search results are filtered to exclude entries marked as deleted (
content <> '(刪除)') and to prioritize actual content chunks (chunk_index IS NOT NULL), ensuring the relevance and quality of the retrieved information.
This approach allows the system to understand the semantic intent of user queries, rather than just keyword matching, and retrieve legal provisions that are conceptually related, even if they don't share exact wording.
To evaluate the system, the model's accuracy was tested using occupational safety exam questions (Source: http://www.osh-soeasy.com/exam.html). The system achieved a 73% answer accuracy in this evaluation.
The results demonstrate that the RAG-based approach is significantly more effective than a non-RAG approach, proving its feasibility for professional query scenarios and its capability of responding to most occupational safety regulation inquiries.
![]() |
![]() |
This study successfully developed a regulation-oriented intelligent query system powered by RAG technology. In the future, we plan to expand the coverage of regulatory sources and further enhance model performance. We also aim to explore the use of knowledge graphs to represent regulatory relationships, allowing articles, definitions, responsibilities, and penalties to be structured more clearly. This will strengthen the interconnectedness of legal provisions and improve the interpretability and visualization of regulatory knowledge.
- 劉宸均
- 黃柏淵
- 李承祐
- 徐鍵睿
DSP Chia-Kai Liu
NCCU Chung-pei Pien
http://www.osh-soeasy.com/exam.html
https://law.moj.gov.tw/LawClass/LawAll.aspx?PCODE=N0060010
https://law.moj.gov.tw/LawClass/LawAll.aspx?PCODE=N0060027
https://law.moj.gov.tw/LawClass/LawAll.aspx?PCODE=N0060065
https://law.moj.gov.tw/LawClass/LawAll.aspx?PCODE=N0060066
https://law.moj.gov.tw/LawClass/LawAll.aspx?PCODE=N0070017
https://law.moj.gov.tw/Law/LawSearchResult.aspx?cur=Ln&ty=LAW&kw=%E5%8B%9E%E5%B7%A5%E4%BF%9D%E9%9A%AA%E6%A2%9D%E4%BE%8B


