A robotic system for visually localizing QWERTY keyboard keys and pressing them with an SO-101 robotic arm, combining computer vision, VLM/OCR-based localization, inverse kinematics, spline trajectory generation, and gravity-compensated control.
Example of a task:
small_showcase.mp4
SO-101 Keyboard Typing Robot is an applied robotics project designed to automate physical keyboard interaction using a serial robotic manipulator. The system captures images from a camera, localizes target keys, estimates their 3D position in the robot world frame, and generates smooth trajectories to reach and press each selected key.
The project provides an end-to-end pipeline for:
- initial key localization using cloud vision-language models, such as OpenAI or Gemini, or local OCR;
- visual tracking of keyboard targets while the robot is moving;
- 3D key position estimation through camera-ray and keyboard-plane intersection;
- forward and inverse kinematics based on the SO-101 URDF model;
- cubic-spline trajectory generation for
home -> hover -> press -> homemotions; - outer-loop PID control with gravity feed-forward compensation;
- execution of predefined tasks, individual words, or text files containing multiple typing runs.
The repository is intended as a publishable research prototype for robot learning, manipulator control, visual servoing, and human-interface automation experiments on real hardware.
| Component | Description | Notes |
|---|---|---|
| Robotic arm | SO-101 follower arm | Controlled through LeRobot/Feetech |
| Actuators | Feetech STS3215 or compatible servos | Position-controlled motors |
| RGB camera | USB/OpenCV camera | Camera index configured in cfg/main_pipeline.yaml |
| Keyboard | Physical QWERTY keyboard | Modeled as a planar surface in world coordinates |
| Workstation | Linux/WSL recommended | Micromamba environment: rl-project |
| Calibration | Robot calibration, camera intrinsics, and camera-to-robot transform | Must be generated for the specific hardware setup |
Before running on the real robot, verify the serial port, clear the workspace, calibrate the connected robot, set a safe home pose for the local keyboard placement, and confirm that the camera intrinsics plus hand-eye/nonlinear refinement belong to that exact camera/gripper setup.
| Module | File/Directory | Responsibility |
|---|---|---|
| Main pipeline | main_pipeline.py |
Task parsing, robot initialization, tracking, and key pressing |
| Configuration | cfg/main_pipeline.yaml |
Camera, robot, VLM, trajectory, and clustering parameters |
| Kinematics | src/kinematics.py |
LeRobot FK/IK and Pinocchio dynamics |
| Control | src/controller.py |
SO-101 hardware interface and gravity-compensated PID |
| Trajectory generation | src/traj_generation.py |
Cubic splines, hover/press motions, and home return |
| 3D tracking | src/tracker.py |
Pixel localization, visual tracking, and world-frame estimation |
| VLM/OCR localization | src/gemini_keyboard_localizer.py, src/ocr_keyboard_localizer.py |
OpenAI/Gemini/EasyOCR localization backends |
| Target clustering | src/keyboard_cluster.py |
Nearby-key tracking, freezing, and retracking logic |
| Calibration | camera_calib/ |
Camera and hand-eye calibration scripts/results |
The complete runtime pipeline is summarized below, from task setup and visual key localization to calibrated 3D estimation, trajectory planning, feedback control, and repeated physical key presses.
- Ubuntu/Linux or WSL2
- Python 3.12
- Micromamba or Conda
- Git
- USB camera accessible through OpenCV
- Serial permissions for real robot control
sudo apt-get update
sudo apt-get install -y git build-essential ffmpeg
sudo usermod -a -G dialout $USERAfter adding the user to the dialout group, log out and back in, or restart the session.
| Category | Packages |
|---|---|
| Robotics | lerobot, pinocchio |
| Computer vision | opencv-python |
| Scientific computing | numpy, scipy, matplotlib |
| VLM/OCR | openai, google-genai, optional easyocr |
| Configuration | pyyaml |
To use cloud-based localization, configure at least one provider:
export OPENAI_API_KEY="<your_openai_api_key>"
export GOOGLE_CLOUD_PROJECT="<your_google_cloud_project>"
export GOOGLE_CLOUD_LOCATION="global"For Gemini, Google Cloud application-default credentials may also be required:
gcloud auth application-default logingit clone <repository_url>
cd robot_learning_group_taskThe recommended installation path is through setup/environment.yml, which defines the project environment and Python dependencies.
micromamba env create -f setup/environment.yml
micromamba activate rl-projectIf the environment already exists:
micromamba env update -f setup/environment.yml --prune
micromamba activate rl-projectQuick verification:
python -c "import cv2, pinocchio, lerobot, openai; print('Environment OK')"Review and update cfg/main_pipeline.yaml before running:
tasks:
1:
provider: openai
model: gpt-5.5
list_path: key_sequence/task_1.txt
2:
provider: gemini
model: gemini-3-flash-preview
list_path: key_sequence/task_2.txt
3:
provider: openai
model: gpt-5.5
list_path: key_sequence/task_3.txt
robot:
port: /dev/ttyACM0
calibration_path: cfg/<your_follower_name>.json # or cfg/calibration/follower/<robot_id>.json
camera:
index: 5
backend: auto
keyboard_height: 0.02
kinematics:
urdf_path: cfg/arm_model/so101_new_calib.urdf
press_ee_frame: key_contact_frame_link
tracking:
disable_klt_for: [SPACE]
cluster:
excluded_letters: [SPACE]Make sure that:
- the serial port matches the connected robot;
robot.calibration_pathpoints to the calibration file for the connected SO-101 follower;home_position_degis set for the local setup: usesrc/utils/read_joints.pyto read a safe pose that keeps the wrist-mounted camera looking at the keyboard and leaves the full keyboard area reachable;- the URDF file is available under
cfg/arm_model/; camera.index,camera.backend, andcamera.keyboard_heightmatch the local camera and keyboard placement;- camera intrinsics and the refined camera-to-robot transform in
camera_calib/calibrations/were produced for this exact camera/gripper calibration, including the nonlinear refinement step; - the keyboard is placed in the calibrated workspace.
This repository is currently structured as a Python-first robotics project. No compilation step is required for the main pipeline.
python -m compileall main_pipeline.py src camera_calibActivate the environment:
micromamba activate rl-projectRun task 1 from the file configured in cfg/main_pipeline.yaml:
python main_pipeline.py --config cfg/main_pipeline.yaml --task-1Task key files live under key_sequence/:
key_sequence/task_1.txt
key_sequence/task_2.txt
key_sequence/task_3.txt
Type a word or a sequence of letters:
python main_pipeline.py \
--config cfg/main_pipeline.yaml \
--word C A T \
--camera 5 \
--robot-port /dev/ttyACM0 \
--provider openai \
--model gpt-5.5Run a task from a text file:
python main_pipeline.py \
--task 2 \
--list-path key_sequence/task_2.txt \
--provider gemini \
--model gemini-3-flash-previewNumbered tasks default to their configured files under key_sequence/.
The provider/model defaults for each task come from cfg/main_pipeline.yaml; command-line flags still override them.
Use local OCR when available:
python main_pipeline.py \
--word HELLO \
--ocr \
--camera 5| Parameter | Description | Default |
|---|---|---|
--config |
Pipeline YAML configuration file | cfg/main_pipeline.yaml |
--camera |
OpenCV camera index | 5 |
--robot-port |
SO-101 serial port | /dev/ttyACM0 |
--provider |
Cloud localization provider | openai |
--model |
Vision-language model | gpt-5.5 |
--urdf-path |
Robot URDF path | cfg/arm_model/so101_new_calib.urdf |
--hover-height |
Offset above the target key | Configured in YAML |
--press-depth |
Key press depth | Configured in YAML |
--disable-klt-for |
Keys held after initial localization instead of KLT tracking | tracking.disable_klt_for |
--cluster-excluded-letters |
Keys handled alone instead of grouped into clusters | cluster.excluded_letters |
--hover-offset-xy |
XY hover offset before pressing | trajectory.hover_offset_xy |
--task-1 |
Shortcut for --task 1 |
tasks.1.list_path |
The example calibration values in this repository are not portable across
robots. Each hardware setup must provide its own robot calibration, camera
intrinsics, hand-eye transform, nonlinear hand-eye refinement, keyboard height,
and home_position_deg before running the eval scripts.
See camera_calib/CALIBRATION_SETUP.md for the step-by-step calibration
checklist.
The scripts under camera_calib/ support camera calibration and hand-eye
transform refinement:
python camera_calib/camera_calibration.py
python camera_calib/hand_eye_calibration.py
python camera_calib/refine_handeye_from_keyboard.pyRelevant files:
| File | Purpose |
|---|---|
cfg/calibration/follower/<robot_id>.json |
SO-101 follower calibration for the connected robot |
camera_calib/calibrations/camera_calibration.npz |
Camera intrinsics for the mounted camera |
camera_calib/calibrations/rigid_nonlinear_refined.npy |
Refined camera-to-robot transform for that same camera/gripper setup |
camera_calib/stats/nonlinear_handeye_report.txt |
Calibration report |
Contributions are welcome. To propose a change:
- Fork the repository.
- Create a dedicated branch.
- Implement the change while keeping the existing structure and style.
- Test on hardware when applicable.
- Open a Pull Request with a clear description, test results, and safety notes.
git checkout -b feature/<feature-name>
git commit -m "Add <description>"
git push origin feature/<feature-name>