This guide provides detailed instructions for installing and configuring IPFS Datasets Python in various environments.
- System Requirements
- Basic Installation
- Development Installation
- Installing with Optional Dependencies
- Docker Installation
- IPFS Setup
- Configuration
- Troubleshooting
- Python 3.7 or higher
- pip (Python package manager)
- 4GB RAM
- 2GB free disk space
- Python 3.9 or higher
- 8GB RAM
- 20GB free disk space
- IPFS daemon (version 0.12.0 or higher)
- CUDA-compatible GPU for faster vector operations (optional)
- Linux (Ubuntu 18.04+, Debian 10+, CentOS 7+)
- macOS (10.15 Catalina or newer)
- Windows 10 (with Windows Subsystem for Linux recommended)
The simplest way to install IPFS Datasets Python is via pip:
pip install ipfs-datasets-pyTo verify the installation:
python -c "import ipfs_datasets_py; print(ipfs_datasets_py.__version__)"For development or to use the latest features:
# Clone the repository
git clone https://github.com/your-organization/ipfs_datasets_py.git
cd ipfs_datasets_py
# Install in development mode
pip install -e .
# Install development dependencies
pip install -r requirements-dev.txtIPFS Datasets Python offers several optional dependency groups for specific functionality:
pip install ipfs-datasets-py[vector]This includes:
- faiss-cpu (or faiss-gpu for CUDA support)
- sentence-transformers
- numpy
- scipy
pip install ipfs-datasets-py[graphrag]This includes:
- spacy
- networkx
- huggingface-hub
- transformers
- torch
pip install ipfs-datasets-py[webarchive]This includes:
- archivenow
- warcio
- requests
- beautifulsoup4
pip install ipfs-datasets-py[all]For GPU-accelerated vector operations:
# For CUDA 11.x
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install faiss-gpu
# For CUDA 10.x
pip install torch==1.10.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
pip install faiss-gpuFor a containerized setup, you can use the provided Docker image:
# Pull the Docker image
docker pull yourorga/ipfs-datasets-py:latest
# Run a container
docker run -it --name ipfs-datasets-py \
-v $(pwd)/data:/data \
-p 8080:8080 \
yourorga/ipfs-datasets-py:latestTo build your own Docker image:
git clone https://github.com/your-organization/ipfs_datasets_py.git
cd ipfs_datasets_py
docker build -t ipfs-datasets-py:custom .
docker run -it --name ipfs-datasets-py-custom \
-v $(pwd)/data:/data \
-p 8080:8080 \
ipfs-datasets-py:customWhile IPFS Datasets Python can work without a local IPFS daemon, having one enables full functionality.
# Download the latest release
wget https://dist.ipfs.io/go-ipfs/v0.12.0/go-ipfs_v0.12.0_linux-amd64.tar.gz
tar -xvzf go-ipfs_v0.12.0_linux-amd64.tar.gz
# Install
cd go-ipfs
sudo bash install.sh
# Initialize IPFS repository
ipfs init- Download the Windows binary from IPFS Downloads
- Extract the archive
- Add the extracted directory to your PATH
- Open Command Prompt or PowerShell and run:
ipfs init
To start the IPFS daemon:
ipfs daemonFor background running (Linux/macOS):
nohup ipfs daemon > ipfs.log 2>&1 &To enable API access:
ipfs config Addresses.API /ip4/127.0.0.1/tcp/5001
ipfs config --json API.HTTPHeaders.Access-Control-Allow-Origin '["*"]'
ipfs config --json API.HTTPHeaders.Access-Control-Allow-Methods '["PUT", "GET", "POST"]'IPFS Datasets Python uses a configuration file for customization.
- Linux/macOS:
~/.ipfs_datasets/config.toml - Windows:
%USERPROFILE%\.ipfs_datasets\config.toml
Create a configuration file with your preferred settings:
mkdir -p ~/.ipfs_datasets
cat > ~/.ipfs_datasets/config.toml << EOF
[ipfs]
api_endpoint = "/ip4/127.0.0.1/tcp/5001"
gateway_url = "http://localhost:8080/ipfs/"
pin = true
[storage]
cache_dir = "~/.ipfs_datasets/cache"
temp_dir = "/tmp/ipfs_datasets"
max_cache_size_gb = 10
[vector_index]
default_dimension = 768
default_metric = "cosine"
index_location = "~/.ipfs_datasets/indexes"
use_memory_mapping = true
[embedding_models]
default = "sentence-transformers/all-MiniLM-L6-v2"
[security]
encryption_enabled = true
require_authentication = false
EOFYou can also configure settings programmatically:
from ipfs_datasets_py.config import set_config_value, save_config
# Set individual values
set_config_value("vector_index.default_dimension", 1024)
set_config_value("embedding_models.default", "sentence-transformers/all-mpnet-base-v2")
# Save configuration
save_config()Issue: ImportError: No module named 'xxx'
Solution:
pip install ipfs-datasets-py[all]
# Or for specific dependency
pip install xxxIssue: ConnectionRefusedError: [Errno 111] Connection refused
Solutions:
- Ensure IPFS daemon is running:
ipfs daemon - Check API endpoint configuration
- Verify firewall settings
Issue: ImportError: libcudart.so.xx.x: cannot open shared object file
Solution:
# Install CUDA toolkit
# Then reinstall with correct CUDA version
pip uninstall torch faiss-gpu
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install faiss-gpuIf you encounter issues not covered here:
- Check the GitHub Issues for similar problems
- Read the FAQ for common questions
- Join the Community Discussion
- File a new issue with detailed information about your problem