feat(stock): 매매알람 쿨다운 중복억제 + 종목명 해석

- 쿨다운(TRADE_ALERT_COOLDOWN_HOURS 기본 6h): 같은 종목·조건 해제→재발화 오실레이션 시 재알림 억제(set_alert_firing mark_fired=False로 firing 유지·발동시각 미갱신, suppressed 카운트). - 종목명: 워커 firing에 name 없어도 NAS가 watchlist→portfolio→krx_master로 해석해 알림·이력에 포함.
feat(agent-office): 매매알람에 조건별 '왜 매수/매도' 한 줄 근거(💡) 추가
2026-07-03 16:14:51 +09:00 · 2026-07-03 16:14:51 +09:00 · 2026-07-03 11:01:24 +09:00 · 2026-07-03 10:48:17 +09:00 · 2026-07-03 10:45:45 +09:00 · 2026-07-03 10:37:57 +09:00
71 changed files with 7686 additions and 216 deletions
--- a/.mcp.json
+++ b/.mcp.json
@@ -0,0 +1,9 @@
+{
+  "mcpServers": {
+    "co-gahusb": {
+      "type": "http",
+      "url": "https://gahusb.synology.me/api/co/mcp",
+      "headers": { "Authorization": "Bearer ${CO_BUS_KEY}" }
+    }
+  }
+}
--- a/CHECK_POINT.md
+++ b/CHECK_POINT.md
@@ -1,209 +1,121 @@
 # web-backend CHECK_POINT

-> NAS Docker 11 컨테이너(9 백엔드 + frontend + deployer). Synology Celeron J4025 (2C 2.0GHz) 18GB.
-> 2026-05-18 작성 — uvicorn CPU 폭주 진단 결과 정리.
-
-## 🔴 즉시 (오늘, 총 1시간 5분)
-
-### 1. 09:00 cron 5분 스태거링 ⭐ 가장 큰 효과
-
-**파일**: `agent-office/app/scheduler.py:72-76`
-```python
-# 변경 전 — 09:00 동시 실행 (CPU 폭주 원인 #1)
-scheduler.add_job(_run_insta_trends_collect, "cron", hour=9, minute=0)
-scheduler.add_job(_run_lotto_schedule, "cron", day_of_week="mon", hour=9, minute=0)
-scheduler.add_job(_run_youtube_research, "cron", hour=9, minute=0)
-
-# 변경 후 — 5분 스태거링
-scheduler.add_job(_run_insta_trends_collect, "cron", hour=9, minute=0,  id="insta_trends")
-scheduler.add_job(_run_lotto_schedule,       "cron", day_of_week="mon", hour=9, minute=5, id="lotto_curate")
-scheduler.add_job(_run_youtube_research,     "cron", hour=9, minute=10, id="youtube_research")
-```
-
-**파일**: `realestate-lab/app/main.py:51`
-```python
-# 변경 전
-scheduler.add_job(scheduled_collect, "cron", hour=9, minute=0, id="collect")
-
-# 변경 후
-scheduler.add_job(scheduled_collect, "cron", hour=9, minute=15, id="collect")
-```
-
- [x] agent-office scheduler.py 수정 (2026-05-18)
- [x] realestate-lab main.py 수정 (2026-05-18)
- [ ] git commit + push (Gitea Webhook 자동 빌드)
+> NAS Docker (Synology Celeron J4025 2C 2.0GHz, 18GB). 16+ 컨테이너(14 서비스 + Redis + frontend + deployer).
+> 2026-06-12 갱신 — 5/18 CPU 진단·NAS↔Windows 분산부터 6/12 음악 파이프라인 신뢰성까지 반영.
+> 운영 세부(DB·스케줄러·env·함정)는 `memory/service_<name>.md`가 authoritative. 이 파일은 **무엇이 끝났고 다음에 뭘 하나**의 보드.

 ---

-### 2. insta-lab Playwright Semaphore(1) ⭐
+## ✅ 완료 타임라인 (5/18 → 6/12)

-**파일**: `insta-lab/app/main.py` (모듈 레벨 추가)
-```python
-import asyncio
+### 5/18~22 — CPU 진단 + NAS↔Windows 분산 + 로또 자율화
+- **CPU 폭주 즉시 5건**: 09:00 cron 5분 스태거링(insta/lotto/youtube/realestate) · lotto Monte Carlo 08:30 이동 · insta Playwright Semaphore(1) · healthcheck 60s · uvicorn `--workers 1` · realestate 수집 병렬화
+- **Redis 분산** (박재오 7결정): Redis 컨테이너 신설(7-alpine 256MB AOF) · insta/music/video-lab을 `queue:*-render` push 게이트웨이로 전환(렌더는 Windows web-ai 워커) · internal webhook + nginx 3-layer 차단 · stock webai_cache TTLCache
+- **video-lab 신설** (18801) — Windows video-render의 NAS 짝 (sora/veo/kling/seedance)
+- **로또 능동 시그널 v1** — lotto_signals/baselines, z-score, urgent/digest 텔레그램, cron 4종
+- **weight-evolver 자율 학습 v2** — weight_trials/auto_picks, 주간 generate→apply→evaluate 루프

-# 모듈 레벨에 한 번만 선언
-RENDER_SEMAPHORE = asyncio.Semaphore(1)  # Chromium 동시 실행 1개로 제한
+### 5/25~26 — tarot/saju 분리·신설 + UI
+- **tarot-lab 분리** (18250) — agent-office에서 독립, Claude 3-card
+- **saju-lab 신설** (18300) — saju-web TS→Python 포팅, lunar↔solar 내장, 궁합 포함
+- **saju UI v1 + v2 리디자인** + fortune_scores/lucky/monthly_flow 추가
+- image-lab public gateway + `/media/image/` 정적 서빙 · tarot max_tokens 2800 truncation fix

-# 카드 렌더 백그라운드 함수에 감싸기
-async def _bg_render(task_id: str, slate_id: int):
-    async with RENDER_SEMAPHORE:
-        await card_renderer.render_slate(slate_id, ...)
-```
+### 5/28 — 공유 로그 인프라
+- **`_shared/access_log` 공용 모듈** (lotto/stock/music/insta/realestate 5종) — ring buffer + middleware + `/logs/recent`
+- agent-office `/agents/{id}/logs`가 서비스 로그 merge · 매일 03:00 agent_logs 90일 retention

- [x] card_renderer.render_slate를 Semaphore(1)로 감쌈 (2026-05-18, lazy init)
- [ ] 동시 2개 요청 테스트 (curl 동시 2회 → 순차 처리되는지 확인)
+### 5/31 — 자율 인텔리전스 2종 (스마트에이전트 1·2번)
+- **로또 자가학습 백테스트·캘리브레이션 v3** — backtest_runs/winner_calibration, forward 가상구매 3전략, ε-게이팅 lift 학습, 일요회고 cron. 역대 캘리브레이션 백필 1197/1197 (6/11)
+- **주식 보유종목 인텔리전스** — holdings_signals, market_events/news_issues/portfolio_health, decide_action 매트릭스, EOD(16:50)+브리핑(08:30) cron
+
+### 6/01~06 — 보안 + 인스타 카드뉴스
+- nginx CVE 대응 (CVE-2026-42945 · CVE-2026-9256 → 1.30.2)
+- **인스타 카드뉴스 품질 고도화 v2** + zip 패키지(10 PNG + caption.txt) + 글자수 가이드
+
+### 6/11 — 자율 발급 + 오버사이트 (스마트에이전트 3번)
+- **인스타 자율 카드 발급** — 4신호 선별(selection.py) + Claude Haiku 카드가치 판단 + 승인 게이트 + 발행 상태머신. 텔레그램 issue_approve/reject/regen 콜백. **autonomous_issue 기본 OFF**
+- **에이전트 횡단 오버사이트(백엔드)** — `GET /api/agent-office/activity` 통합 피드 + 필터(agent_id/type/status/days). main `2c2828c` 배포
+- CLAUDE.md 카탈로그 슬림화(966→484, 서비스별 메모리 분담) · packs jti SQLite 영속화 · lotto deep CuratorError fallthrough fix
+
+### 6/12 — 음악 파이프라인 신뢰성·복구 (직전 작업)
+- **자동 재시도**: orchestrator step 3회 backoff 재시도(publish 제외 — 업로드 비멱등)
+- **수동 재개**: `POST /api/music/pipeline/{id}/retry` — 실패 step 판별·재개, retrying 레이스 가드, publish+업로드완료 시 409
+- **실패 알림**: agent-office youtube_publisher가 신규 failed 감지 → 텔레그램 `⚠️실패` + `[🔄재시도]` 인라인 버튼 → music-lab retry 프록시
+- 커밋·push·자동배포 완료 (main = origin/main)
+
+> **스마트에이전트 3종 전부 가동**: stock(보유종목) · insta(자율발급) · lotto(진화). CEO 오버사이트(통합 활동 피드) 백엔드 완료.

 ---

-### 3. healthcheck interval 60s
+## 🔴 즉시 — 진행 중 / 대기

-**파일**: `docker-compose.yml` (모든 9 컨테이너)
-```yaml
-# 변경 전
-healthcheck:
-  interval: 30s
+### 1. ✅ agent oversight 프론트 NAS 배포 — 완료 (2026-06-12)
+- web-ui `ActivityTimeline`(AgentOffice 우측 기본 패널) main 머지(`d0bf5fd`) → NAS 라이브 반영·검증 완료 (index.html 갱신 + AgentOffice 번들 nginx 200)
+- **배포 방법**: Z: 매핑이 `!` TTY로 안 돼서 **SSH 직접 배포**(`bgg8988@gahusb.synology.me:2300`, tar + `scp -O` → assets 교체). Synology SFTP off라 `scp -O` 필수, images/videos는 불변이라 미러 제외. 상세 → `memory/feedback_windows_frontend_ssh_deploy.md`

-# 변경 후
-healthcheck:
-  interval: 60s
-```
-
- [x] docker-compose.yml 10개 healthcheck 일괄 변경 (9 백엔드 + frontend, 2026-05-18)
- [ ] `docker compose up -d` 재기동
- [ ] `docker stats` 로 CPU 5% 정도 감소 확인
+### 2. 운영 검증 (분산·자율 학습)
+- [ ] Redis 분산 E2E (NAS push → Windows 워커 → webhook 전체 흐름)
+- [ ] lotto weight-evolver 주간 사이클(월 generate+apply → 토 evaluate) 정상 동작 + evolution report 텔레그램(토 22:15)

 ---

-### 4. uvicorn --workers 1 명시
+## 🟡 미완성 큰 기능

-**모든 Dockerfile CMD**:
-```dockerfile
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
-```
-
-영향 9 파일 (모두 2026-05-18 적용):
- [x] lotto/Dockerfile
- [x] stock/Dockerfile
- [x] music-lab/Dockerfile
- [x] insta-lab/Dockerfile
- [x] realestate-lab/Dockerfile
- [x] agent-office/Dockerfile
- [x] personal/Dockerfile
- [x] packs-lab/Dockerfile
- [x] travel-proxy/Dockerfile
-
-→ `docker compose build --no-cache` 후 재기동.
+### Video Studio 프론트 `/studio` — 백엔드 완료, UI 미구현
+- **백엔드 완료·배포**: image-lab(NAS 18802) ✅ + image-render(Windows web-ai) ✅ + video-lab(기존) ✅ (`plans/2026-05-23-video-studio-backend.md` 전부)
+- **빠진 것**: web-ui React Flow 노드 캔버스(ImageGenNode → ImageToVideoNode). 백엔드 plan이 "프론트는 Plan 2"로 미뤘으나 Plan 2 미생성
+- spec: `docs/superpowers/specs/2026-05-23-video-studio-node-canvas-design.md` (untracked — 커밋 필요)
+- 목적: 무신사·우리카드 AI 영상 공모전 실전 제작 도구

 ---

-### 5. lotto Monte Carlo 08:05 → 08:30
+## 🟡 후속 (직전 작업 범위 밖)

-**파일**: `lotto/app/main.py:86`
-```python
-# 변경 전 — stock 08:00과 5분 차이로 겹침
-scheduler.add_job(_run_simulation_job, "cron", hour="0,4,8,12,16,20", minute=5)
-
-# 변경 후 — 25분 분리
-scheduler.add_job(_run_simulation_job, "cron", hour="0,4,8,12,16,20", minute=30)
-```
-
- [x] lotto/app/main.py 수정 (2026-05-18)
+### music 파이프라인 stuck 감지
+- 6/12 신뢰성 작업이 명시적으로 남긴 갭: `*_running` hang · `*_pending` 방치 · retrying 중 컨테이너 재시작 시 stuck(현 retry 가드가 state=failed라 재retry 불가)
+- 상세: `memory/service_music.md` "파이프라인 신뢰성/복구 — 범위 밖"

 ---

-## 🟡 중기 (1~2주)
+## 🟢 백로그 아이디어

-### 6. Chromium Browser Pool 재설계 (insta-lab) ✅ 2026-05-18
- 매번 launch X → 1개 인스턴스 재사용
- 카드 10장 렌더 시간 30% 단축 기대
- [x] `card_renderer.py` 내부에 모듈 레벨 `_PLAYWRIGHT`/`_BROWSER` + `init_browser`/`shutdown_browser` 함수 (별도 모듈 분리 안 함, 같은 파일에 인접 배치)
- [x] `_render_slate_locked` 본체에서 `_get_browser()` 재사용 (crashed 시 lazy 재초기화)
- [x] `main.py` startup hook에서 `init_browser()`, shutdown hook에서 `shutdown_browser()`
+- **Redis 큐 통합 모니터링** — agent-office에 `queue:*-render`/`queue:paused` 길이·상태 패널 (NAS↔Windows 작업 흐름 가시화)
+- **weight-evolver 성과 대시보드** — auto_picks 적중 추이 + weight_base 진화 그래프 (자율 학습 실효성 검증)
+- **lotto-signals 패턴 확장** — adaptive baseline + z-score + urgent 텔레그램을 stock(이상치)·realestate(경쟁률 급변)에 재사용
+- **nginx internal 차단 표준화** — insta/music/video/image 3-layer 차단을 공통 include로 추출
+- **agent-office 레거시 정리** — tarot_readings 테이블 잔존(tarot-lab 분리 후), seed "blog" 죽은 에이전트

-### 7. stock 뉴스 스크랩 비동기화 — ⚠️ 보류 2026-05-18
- **재진단**: stock은 `BackgroundScheduler` 사용 중 → main loop 블로킹 없음 (이미 별도 thread)
- `fetch_market_news`의 4개 동기 `requests.get`은 network I/O wait라 CPU 거의 사용 안 함
- `to_thread`로 wrap해도 BackgroundScheduler 환경에서 사실상 의미 없음
- 진짜 효과를 보려면 AsyncIOScheduler 전환 + scraper.py 4개 fetch를 `aiohttp` 병렬로 — **큰 리팩토링 vs 효과 불명확**
- [ ] 박재오 판단: 큰 리팩토링 진행 여부
-
-### 8. realestate 수집 병렬화 ✅ 2026-05-18
- **파일**: `realestate-lab/app/main.py:scheduled_collect`
- `collect_all()` + `delete_old_completed_announcements()` 병렬
- BackgroundScheduler 환경이라 `asyncio.gather` 대신 `ThreadPoolExecutor(max_workers=2)` 사용 (효과 동일)
- 매칭은 순차 유지 (DB 일관성)
- [x] ThreadPoolExecutor 적용
-
-### 9. lotto Monte Carlo 시뮬레이션 빈도 검토
- 현재 6회/일 (00·04·08·12·16·20)
- 실제 필요 빈도 박재오 결정 — 3회/일(아침·점심·저녁)로 줄이면 CPU 50% 감소
- [ ] 박재오 의사결정 후 cron 변경
-
---
-
-## 🟢 장기 (1개월+)
-
-### 10. 무거운 작업 Windows AI 서버로 이전 ✅ 이미 적용 상태 (2026-05-18 확인)
- **확인 결과**: NAS `.env`가 이미 `LLM_PROVIDER=claude` + `OLLAMA_URL=http://192.168.45.59:11435`로 설정됨
- 실 운영은 Anthropic Claude (원격 API) — NAS Celeron에서 LLM 추론 안 함
- Ollama fallback 사용 시에도 Windows AI 서버로 통일
- stock 외 다른 컨테이너에 ollama/qwen 호출 코드 없음
- 결론: 코드/설정 변경 불필요
-
-### 11. 컨테이너 리소스 제한 — ❌ 진행 금지 (박재오 명시 2026-05-18)
- J4025 2C 환경에서 cpus 0.5 제한은 오히려 throughput 손해
- 향후 작업자 무심코 도입하지 말 것
-
-### 12. NAS 업그레이드 검토 — ⏸️ 보류 (박재오 명시 2026-05-18)
- 현재: Celeron J4025 (2C 2.0GHz)
- 대안: Ryzen N5105 (4C 2.0GHz) NAS — 4코어로 병렬성 2배
- 자금·우선순위 결정 대기
-
---
-
-## ✅ 최근 완료 (참고)
-
- 2026-05-15: insta-lab 신설 (포트 18700, Jinja2 + Playwright + Claude Sonnet)
- 2026-05-16: insta-lab Playwright 1080×1350 PNG 렌더 완성
- 2026-05-17: agent-office random idle 제거, ADMIN_API_KEY 강화 (stock)
- 2026-05-17: insta-lab minimal theme + design_importer 추가
- 2026-05-17: blog-lab 트랙 완전 폐기 (docker-compose에 없음, 위키 정정 완료)
- 2026-05-18: 🔴 즉시 5건 일괄 적용 — 09:00 cron 스태거링(insta/lotto/youtube/realestate), lotto Monte Carlo 08:30, insta-lab Semaphore(1), healthcheck 60s, uvicorn --workers 1 명시 (사용자 push + NAS deployer 재기동 대기)
- 2026-05-18: 🟡 중기 2건 적용 — #6 insta-lab Chromium Browser Pool (lifecycle hook), #8 realestate ThreadPoolExecutor 병렬 (collect/delete). #7 stock async는 BackgroundScheduler 사용 중이라 재진단 후 보류 (효과 미미). #9 Monte Carlo 빈도는 박재오 결정 대기.
- 2026-05-18: 🟢 장기 진단·결정 — #10은 이미 적용 상태 확인 (LLM_PROVIDER=claude, OLLAMA_URL=Windows AI). #11 컨테이너 리소스 제한 박재오 진행 금지. #12 NAS 업그레이드 보류. web-ai V1(:8000)+V2(:8001) 4개 process 종료 — NAS API polling 부담 즉시 감소.
+### 보류 유지 (박재오 판단 대기)
+- stock 뉴스 스크랩 비동기화 — BackgroundScheduler I/O wait라 CPU 미미, 큰 리팩토링 vs 효과 불명확
+- lotto Monte Carlo 빈도(6→3회/일) — CPU 50%↓ vs 자율 학습 정확도 trade-off
+- 컨테이너 리소스 제한 — ❌ 박재오 금지(J4025 2C throughput 손해) · NAS 업그레이드 ⏸️ 보류(Redis 분산으로 우선순위↓)

 ---

 ## 🔧 진단 커맨드 (NAS bash)

 ```bash
-# 실시간 CPU 사용 (상위 15)
-top -b -n 1 | head -25
-
-# 프로세스별 CPU 정렬
-ps aux --sort=-%cpu | head -15
-
-# uvicorn·chromium·python 프로세스만
-ps aux | grep -E "uvicorn|chromium|python" | grep -v grep
-
-# 스케줄러 실행 로그 (최근 50)
+top -b -n 1 | head -25                                  # CPU 상위
+docker stats --no-stream                                # 컨테이너별 CPU/메모리
+docker exec redis redis-cli PING                        # Redis 헬스
+docker exec redis redis-cli KEYS 'queue:*'              # 큐 키 목록
+docker exec redis redis-cli LLEN queue:insta-render     # 큐 길이
 docker logs agent-office 2>&1 | grep -E "APScheduler|executing" | tail -50
-
-# insta-lab Chromium 프로세스 개수
-docker exec insta-lab ps aux | grep chromium | wc -l
-
-# 컨테이너별 CPU/메모리 실시간
-docker stats --no-stream
+docker exec insta-lab ps aux | grep chromium | wc -l    # (분할 후 0이어야 정상)
 ```

 ---

 ## 📚 참고

- 진단 풀 보고서: `C:\Users\jaeoh\Documents\Obsidian Vault\raw\2026-05-18-NAS-uvicorn-CPU-진단-개선안.md`
- 위키 페이지: [[사업-개인-웹-플랫폼]] (CPU 부하 진단 섹션 + 컨테이너 표)
- docker-compose.yml: 본 디렉토리 루트
+- 메모리 인덱스: `memory/MEMORY.md` (14 서비스 × `service_<name>.md` authoritative)
+- Windows 워커 짝: web-ai 레포 (insta/music/video/image-render)
+- spec/plan: `docs/superpowers/specs|plans/`
+- docker-compose.yml: 루트

 ## 변경 이력

- 2026-05-18: 페이지 신설. 즉시 5건 + 중기 4건 + 장기 3건. 진단 커맨드.
+- 2026-05-18: 페이지 신설. CPU 진단 즉시 5건 + 7결정 분산 가이드.
+- 2026-05-22: 분산·자율화 구현 반영. Redis 분할·lotto 능동시그널·weight-evolver.
+- 2026-06-12: **5/25~6/12 전체 작업 반영** — tarot/saju 분리·신설, _shared 로그, lotto v3 백테스트, stock 보유종목 인텔, nginx CVE, insta 카드뉴스 v2 + 자율발급, 에이전트 오버사이트, music 파이프라인 신뢰성. 미완성 큰 기능(Video Studio 프론트) + 후속(music stuck 감지) + 백로그 재편. 현재 트랙(oversight 프론트 배포) 명시.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -21,7 +21,7 @@
 ## 1. 프로젝트 개요

 Synology NAS 기반의 개인 웹 플랫폼 백엔드 모노레포.
- **서비스 14개**: lotto, stock, music-lab, video-lab, image-lab, insta-lab, realestate-lab, agent-office, tarot-lab, saju-lab, personal, packs-lab, travel-proxy, deployer
+- **서비스 15개**: lotto, stock, music-lab, video-lab, image-lab, insta-lab, realestate-lab, agent-office, tarot-lab, saju-lab, personal, packs-lab, travel-proxy, co-gahusb, deployer
 - **공유 인프라**: `_shared/access_log` 모듈 (5개 서비스 공유), `redis` (music/video/image/insta-lab 큐 공유)
 - **렌더/생성 위임**: music/video/image/insta의 무거운 생성·렌더는 **Windows AI 워커**(`web-ai` 별도 레포)가 담당. NAS 서비스는 Redis 큐 push + 결과 webhook 수신만 한다.
 - **프론트엔드**: 별도 레포 (React + Vite SPA), 빌드 산출물만 NAS에 배포
@@ -80,7 +80,8 @@ Synology NAS 기반의 개인 웹 플랫폼 백엔드 모노레포.
 | `packs-lab` | 18950 | NAS 자료 다운로드 자동화 (DSM 공유 링크 + 5GB 업로드, Vercel SaaS와 HMAC 통신) |
 | `personal` | 18850 | 개인 서비스 (포트폴리오·블로그·투두 통합) |
 | `travel-proxy` | 19000 | 여행 사진 API + 썸네일 생성 |
-| `redis` | 6379 | 비동기 큐 (music/video/image/insta-lab 공유) |
+| `co-gahusb` | 18920 | 세션 간 협업 팀 버스 (FastMCP streamable-http + Redis, Bearer `CO_BUS_KEY`, DNS-rebinding 보호 off) |
+| `redis` | 6379 | 비동기 큐 (music/video/image/insta-lab + co-gahusb 공유) |
 | `frontend` (nginx) | 8080 | 정적 SPA 서빙 + API 리버스 프록시 |
 | `webpage-deployer` | 19010 | Gitea Webhook 수신 → 자동 배포 |

@@ -106,6 +107,7 @@ Synology NAS 기반의 개인 웹 플랫폼 백엔드 모노레포.
 | `/api/blog/` | `personal:8000` | 블로그 API |
 | `/api/profile/` | `personal:8000` | 포트폴리오 API |
 | `/api/agent-office/` | `agent-office:8000` | AI 에이전트 오피스 API + WebSocket (86400s) |
+| `/api/co/` | `co-gahusb:8000/` | MCP 팀 버스 (trailing-slash strip → `/mcp`, `Authorization` forward, `proxy_buffering off`, 3600s) |
 | `/api/packs/upload` | `packs-lab:8000` | 5GB multipart 업로드 (`client_max_body_size 5G`, `proxy_request_buffering off`, **1800s** timeout) |
 | `/api/packs/` | `packs-lab:8000` | 다운로드/list |
 | `/api/internal/insta/` | `insta-lab:8000` | Windows 워커 webhook (nginx IP 화이트리스트 + 앱 `X-Internal-Key`) |
@@ -244,6 +246,10 @@ docker compose up -d
 | GET | `/api/portfolio/snapshot/history` | 스냅샷 이력 (`days`) |
 | GET/POST | `/api/portfolio/sell-history` | 매도 내역 조회/저장 |
 | PUT/DELETE | `/api/portfolio/sell-history/{id}` | 매도 기록 수정/삭제 |
+| GET/POST/DELETE | `/api/stock/watchlist` (+ `/{ticker}`) | 실시간 매수 알람 관심종목 CRUD |
+| GET | `/api/stock/trade-alerts` | 매매 알람 이력 (`days`) |
+| GET | `/api/webai/trade-alert/monitor-set` | (워커) 감시대상 조립 = watchlist∪screener∪보유 + session/params (X-WebAI-Key) |
+| POST | `/api/webai/trade-alert/report` | (워커) 발화집합 수신 → edge diff → 신규만 텔레그램 push (X-WebAI-Key) |

 ### music-lab (music-lab/)
 듀얼 프로바이더 음악 생성(Suno + MusicGen) + YouTube 영상 자동화 파이프라인 + 시장 트렌드.
@@ -265,7 +271,8 @@ docker compose up -d
 | POST/GET | `/api/music/generate-batch` | 배치 생성 |
 | POST/GET | `/api/music/compile` (+ `/compiles/{id}/export`) | 컴파일 |
 | POST/GET/DELETE | `/api/music/video-project` (+ `/{id}/render`, `/export`) | 영상 프로젝트 |
-| ALL | `/api/music/pipeline` (생성/start/feedback/cancel/publish/telegram-msg/lookup) | YouTube 자동화 파이프라인 |
+| ALL | `/api/music/pipeline` (생성/start/feedback/cancel/publish/retry/telegram-msg/lookup) | YouTube 자동화 파이프라인. `POST /{id}/retry`=실패 step 재개(publish+업로드완료 시 409) |
+| DELETE | `/api/music/pipeline/{id}` | 파이프라인 행 하드 삭제(자식 jobs/feedback 포함, 전체 목록에서 제거). 없으면 404 |
 | GET/PUT | `/api/music/setup` | 파이프라인 설정 |
 | GET | `/api/music/youtube/auth-url`, `/callback`, `/status`; POST `/disconnect` | YouTube OAuth |
 | GET/POST/PUT/DELETE | `/api/music/revenue` (+ `/dashboard`) | 수익 기록 |
@@ -345,7 +352,7 @@ docker compose up -d

 ### agent-office (agent-office/)
 AI 에이전트 가상 오피스 — 기존 서비스 API를 프록시로 호출, 실시간 WebSocket + 텔레그램 봇.
- 핵심 파일: `main.py`, `db.py`, `config.py`, `websocket_manager.py`, `service_proxy.py`, `telegram_bot.py`, `scheduler.py`, `agents/`(stock/music/realestate/youtube/youtube_publisher/lotto/base)
+- 핵심 파일: `main.py`, `db.py`, `config.py`, `websocket_manager.py`, `service_proxy.py`, `telegram_bot.py`, `scheduler.py`, `node_monitor.py`(분산 워커 관측 집계+경보), `agents/`(stock/music/realestate/youtube/youtube_publisher/lotto/base)
 - 에이전트 7종 레지스트리. 명령 API body 필드명 → `reference_agent_office_command_api.md`
 - 📌 상세(DB 9테이블·FSM·전체 cron 목록·AGENT_CONTAINER_MAP·텔레그램 캐싱·env): **`service_agent_office.md`**

@@ -360,11 +367,13 @@ AI 에이전트 가상 오피스 — 기존 서비스 API를 프록시로 호출
 | POST | `/api/agent-office/telegram/webhook` | 텔레그램 Webhook (realestate_bookmark_* 콜백 포함) |
 | POST | `/api/agent-office/realestate/notify` | realestate-lab 전용 push 수신 → 텔레그램 |
 | GET | `/api/agent-office/states` | 전체 에이전트 상태 |
+| GET | `/api/agent-office/nodes` | 분산 워커(NAS↔Windows) 관측 — heartbeat 생사+큐깊이+dead-letter 집계 (web-ui `/infra` Three.js 시각화 소비). 상세 → `infra_distributed_workers.md` |
 | GET | `/api/agent-office/activity` | 전 에이전트 통합 활동 피드 (tasks+logs UNION). 필터 `agent_id`/`type`(task\|log)/`status`/`days` + `limit`/`offset` |
 | GET | `/api/agent-office/conversation/stats` | 텔레그램 대화 토큰·캐시 통계 (`days`) |
 | POST/GET | `/api/agent-office/youtube/research` (+ `/status`) | YouTube 트렌드 수집 트리거/상태 |
 | GET | `/api/agent-office/lotto/signals`, `/lotto/baselines` | 로또 시그널 이력·baseline |
 | POST | `/api/agent-office/lotto/signal-check` | 로또 시그널 평가 트리거 (light/sim/deep) |
+| POST | `/api/agent-office/stock/trade-alert` | stock에서 push된 매매 알람 → 텔레그램(너+아내). 봇 명령 `/watch`·`/unwatch`·`/watchlist`로 watchlist 관리 |

 ### tarot-lab (tarot-lab/)
 타로 카드 해석 (Claude Sonnet, agent-office에서 2026-05-25 독립).
@@ -483,5 +492,17 @@ Gitea Webhook 수신 → 자동 배포. HMAC SHA256 검증(`X-Gitea-Signature`
 - **공휴일 목록**: `stock/app/holidays.json` 매년 수동 갱신 (KRX 기준)
 - **Windows AI 서버 IP**: `192.168.45.59` (DHCP 고정 예약). Tailscale은 Synology userspace 모드라 TCP 불가 → 로컬 IP 사용
 - **렌더/생성 워커 분리**: music/video/image/insta 무거운 작업은 Windows `web-ai` 워커. NAS 코드의 `*_provider.py`/`card_renderer.py`가 DEPRECATED stub면 실 로직은 web-ai 쪽이 authoritative
+- **[팀 규칙] 모든 WSL(docker) 워커는 `/infra`에서 관측 가능해야 한다**: 새 워커 추가 시 필수 3단계 — ① 워커가 `worker:<name>:heartbeat`(EX45, ~15초) 발신 ② BE가 `agent-office/app/node_monitor.py`의 `WORKER_REGISTRY`에 `{name,kind,queue}` 등재 ③ → `/api/agent-office/nodes`·web-ui `/infra` 노출 + 다운/복구/dead-letter 텔레그램 경보. 미준수 = "사일런트 사망"(insta-render 2주 무관측 사고) 재발 위험. 워커 신규/변경 PR 머지 게이트. web-ai/web-ui repo CLAUDE.md에도 동일 규칙 명시 필요. 상세는 `infra_distributed_workers.md` 메모리(관측 계약 2)
 - **Playwright Dockerfile**: bookworm 고정 + 수동 chromium deps, `--with-deps` 금지 (`feedback_playwright_dockerfile.md`)
 - **lab 네이밍**: `-lab`은 개발/연구 단계에만, 정식 서비스엔 미사용 (`feedback_lab_naming.md`)
+
+---
+
+## 협업 팀 버스 (co-gahusb) — 이 세션의 역할: **BE**
+
+이 세션은 백엔드(BE) 역할이다. co-gahusb MCP 툴로 다른 세션(FE/AI/Producer)과 협업한다.
+- **소유권**: 이 세션은 `web-backend` repo만 쓴다(FE=web-ui, AI=web-ai).
+- **공유 리소스 변경 전 반드시 `acquire_lock(resource, "BE")`**: 대상 = `nas-deploy`, `stock-db-schema`, `lotto-db-schema`, `memory-mirror`, `nginx-conf`, `compose`. 점유 중이면 대기, 긴 작업은 `heartbeat_lock`, 끝나면 `release_lock`.
+- **모든 툴 호출에 `role="BE"`** (또는 `from_role`/`created_by`에 BE).
+- **수신**: `/loop`로 주기적으로 `read_inbox("BE", after_id=<last>)` + `list_tasks(assignee_role="BE")` 확인.
+- 키 `CO_BUS_KEY`는 환경변수로 주입(커밋 금지).
--- a/README.md
+++ b/README.md
@@ -115,6 +115,7 @@ curl http://localhost:18500/health
 - **실계좌**: Windows AI 서버(192.168.45.59:8000) 프록시 → KIS Open API (잔고/주문)
 - **포트폴리오**: 종목·예수금·매도 히스토리 관리, 현재가 자동 조회
 - **자산 스냅샷**: 평일 15:40 자동 저장 (KRX 공휴일 판별, `holidays.json` 매년 갱신)
+- **실시간 매매 알람** (2026-07-02): 장중(+시간외) 1분 폴링으로 매수(watchlist ∪ 스크리너 후보, TA 시그널)·매도(보유종목, exit 룰 + 트레일링 스톱) 조건 충족 시 텔레그램(본인+아내) 알람. **TA 계산은 Windows `trade-monitor` WSL2 docker 워커**, NAS는 감시대상 조립 + edge 중복판정(영속) + 발송 담당. 관심종목은 `/api/stock/watchlist` CRUD 또는 텔레그램 `/watch` 봇 명령. webai 계약: `GET /api/webai/trade-alert/monitor-set` · `POST /report`. 워커/프론트 탭은 web-ai/web-ui repo (설계: `docs/superpowers/specs/2026-07-02-realtime-trade-alerts-design.md`)

 **LLM provider 전환** — `LLM_PROVIDER` 환경변수
 - `claude` (기본): Anthropic Messages API (`claude-haiku-4-5`)
@@ -169,6 +170,8 @@ AI 에이전트 가상 오피스 — 2D 픽셀아트 사무실에서 4명의 에
 - **텔레그램 연동**: 양방향 알림 + 인라인 키보드 승인
  - 봇이 작업 결과를 텔레그램으로 푸시, 명령은 텔레그램에서 바로 에이전트에 전달
  - Webhook 검증 후 `chat.id` 기준 라우팅
+  - **실시간 매매 알람 수신**: `POST /api/agent-office/stock/trade-alert` (stock이 edge 판정한 알람 push) → 텔레그램 본인+아내 발송. 봇 명령 `/watch`·`/unwatch`·`/watchlist`로 관심종목 관리
+- **분산 워커 관측**: `GET /api/agent-office/nodes`가 `worker:<name>:heartbeat`를 집계 → web-ui `/infra` 시각화 + 다운/복구/dead-letter 텔레그램 경보. WSL docker 워커는 `node_monitor.WORKER_REGISTRY` 등재 필수(위 주의사항 팀 규칙)

 #### 에이전트 구성

@@ -283,11 +286,11 @@ git push → Gitea → X-Gitea-Signature (HMAC SHA256)
 | DB | 소유 서비스 | 주요 테이블 |
 |----|------------|-----------|
 | `lotto.db` | lotto | draws, recommendations, simulation_runs/candidates, best_picks, purchase_history, strategy_performance/weights, weekly_reports, lotto_briefings |
-| `stock.db` | stock | articles, portfolio, broker_cash, asset_snapshots, sell_history |
+| `stock.db` | stock | articles, portfolio, broker_cash, asset_snapshots, sell_history, holdings_signals, news_sentiment, **watchlist, trade_alert_state, trade_alert_history** (실시간 매매 알람) |
 | `music.db` | music-lab | music_tasks, music_library (provider, lyrics, image_url, suno_id, file_hash, cover_images, wav_url, video_url, stem_urls), video_projects, revenue_records, market_trends, trend_reports |
 | `insta.db` | insta-lab | news_articles, trending_keywords (source 컬럼), card_slates, card_assets, generation_tasks, prompt_templates, account_preferences |
 | `realestate.db` | realestate-lab | announcements, announcement_models, user_profile, match_results, collect_log |
-| `agent_office.db` | agent-office | agent_config, agent_tasks, agent_logs, telegram_state, conversation_messages |
+| `agent_office.db` | agent-office | agent_config, agent_tasks, agent_logs, telegram_state, conversation_messages, youtube_research_jobs, lotto_signals/baselines, notified_failed_pipelines (파이프라인 실패 알림 dedup) |
 | `personal.db` | personal | profile, careers, projects, skills, introductions, todos, blog_posts |
 | `travel.db` | travel-proxy | photos (album, filename, mtime, has_thumb), album_covers |
 | `pack_files` (외부 Supabase) | packs-lab | filename, host_path, mime, byte_size, sha256, deleted_at |
@@ -384,6 +387,52 @@ PORTFOLIO_EDIT_PASSWORD=
 - **Suno CDN** — `cdn1.suno.ai` URL은 임시 만료 → 생성 즉시 로컬 다운로드 필수
 - **LLM provider 롤백** — Claude API 장애 시 `.env`의 `LLM_PROVIDER=ollama`로 전환 후 `docker compose up -d`
 - **시뮬레이션 교체 방식** — `best_picks`는 교체형 (`is_active=0` 비활성화 후 신규 입력)
+- **[팀 규칙] 모든 WSL docker 워커는 `/infra` 관측 필수** — 새 워커는 ① `worker:<name>:heartbeat`(EX45) 발신 ② BE가 `agent-office/app/node_monitor.py`의 `WORKER_REGISTRY`에 등재 ③ → `/api/agent-office/nodes`·web-ui `/infra` 노출 + 다운/복구/dead-letter 경보. 미준수 = 사일런트 사망 재발(insta-render 2주 사고). 워커 PR 머지 게이트
+- **Alpine + tzdata 함정** — stock 컨테이너는 `python:3.12-alpine` + tzdata 미설치라 `TZ=Asia/Seoul`이 무효 → `date.today()`가 UTC. KST 날짜는 `_today_kst()`(=`utcnow()+9h`) 명시 변환 필수 (아침 스케줄 리포트 하루 밀림 방지)
+
+---
+
+## 하네스 엔지니어링 (Claude Code 제어)
+
+이 레포는 Claude Code 세션의 동작을 `.claude/` 설정으로 **제어(harness engineering)** 한다. 모든 산출물은 git 추적되어 이 체크아웃의 모든 세션(co-gahusb 팀버스의 BE 역할 포함)에 공유된다.
+
+### 제어 표면 (무엇을 통제하는가)
+
+| 레이어 | 메커니즘 | 위치 | 역할 |
+|--------|---------|------|------|
+| 컨텍스트 주입 | CLAUDE.md 계층 + 서비스 메모리 | `CLAUDE.md`, `memory/service_*.md` | 항상 로딩되는 카탈로그(불변) ↔ 관련 시 recall(가변) 2계층 |
+| 권한 가드 | permissions allow/deny/ask | `.claude/settings.json` | 읽기전용 명령 무프롬프트 / 시크릿·DB 차단 / push·reset 확인 |
+| 행동 강제 | PreToolUse·PostToolUse·SessionStart hook | `.claude/hooks/` | CLAUDE.md 주석 규칙을 하네스가 실제 차단·환기 |
+| 반복 워크플로우 | slash commands | `.claude/commands/` | `/co-inbox`, `/svc`, `/harness-audit` |
+| 전문 역할 | subagents | `.claude/agents/` | `be-developer`, `evaluator` |
+| 협업 버스 | MCP 서버 | `.mcp.json` | co-gahusb 팀버스(세션 간 메시지·작업·락) |
+
+### 적용된 가드 (hook)
+
+| hook | 이벤트 / matcher | 동작 | 근거 |
+|------|-----------------|------|------|
+| `pretooluse-guard.sh` | PreToolUse · `Bash\|PowerShell` | **차단** 로컬 docker 변경(`up/down/build/restart/exec…`; ps·logs·config·images는 허용) | `feedback_docker_nas` |
+| 〃 | 〃 | **차단** `git commit --amend` · `git push --force`(`--force-with-lease`는 허용) | `feedback_concurrent_session_git_collision` |
+| 〃 | 〃 | **차단** PowerShell `>`/`>>` 파일 리다이렉트(UTF-16 BOM; `2>$null`·`> $null`은 허용) | `feedback_powershell_redirect_encoding` |
+| `posttooluse-memory.sh` | PostToolUse · `Edit\|Write` | 서비스 `db.py`/`models.py`/스케줄러/`.sql` 편집 시 `service_<name>.md` 갱신 환기(비차단) | 메모리 디스플린 |
+| `session-start.sh` | SessionStart · `startup\|resume` | BE 역할 + 수신함/락 넛지 주입 | 협업 버스 프로토콜 |
+
+차단 판단 로직은 `.claude/hooks/_guard.py`(Python). 래퍼는 파서 부재 시 **fail-open**(통과)하고, 출력은 UTF-8로 고정한다.
+
+### slash commands
+
+| 커맨드 | 용도 |
+|--------|------|
+| `/co-inbox` | co-gahusb 팀버스 BE 수신함(inbox + tasks + locks) 일괄 확인 |
+| `/svc <name>` | 해당 `service_<name>.md` 메모리 + 핵심 파일 위치를 즉시 로드 |
+| `/harness-audit` | 서브에이전트 fan-out으로 CLAUDE.md 카탈로그 ↔ 실제 코드 드리프트 감사 |
+
+### 확장 / 유지보수
+
+- **hook 경로는 이 머신 기준 절대경로**(`/c/Users/jaeoh/Desktop/workspace/web-backend/.claude/hooks/…`)다. 레포를 다른 경로로 클론하면 `settings.json`의 3개 hook command 경로를 갱신해야 한다.
+- 가드 패턴 추가/수정은 `_guard.py`만 고치면 된다(설정 변경 불필요).
+- hook은 새 세션에서 자동 로드된다. 진행 중 세션에 즉시 반영하려면 `/hooks` 메뉴를 열거나 재시작한다.
+- 메모리 디스플린: 코드 구조가 바뀌면 **CLAUDE.md(불변 카탈로그)** 가 아니라 **`service_*.md`(가변 상세)** 를 갱신한다.

 ---

@@ -391,3 +440,4 @@ PORTFOLIO_EDIT_PASSWORD=

 - `CLAUDE.md` — Claude Code 작업용 상세 컨텍스트 (API 전체 목록, 테이블 스키마 등)
 - `docs/` — 서비스별 기획·설계 문서
+- `.claude/` — 하네스 설정(settings·hooks·commands·agents). 위 "하네스 엔지니어링" 섹션 참조
--- a/agent-office/app/agents/youtube_publisher.py
+++ b/agent-office/app/agents/youtube_publisher.py
@@ -4,7 +4,12 @@ import logging
 from .base import BaseAgent
 from . import classify_intent
 from .. import service_proxy
-from ..db import add_log
+from ..db import (
+    add_log,
+    get_notified_failed_pipelines,
+    add_notified_failed_pipeline,
+    prune_notified_failed_pipelines,
+)
 from ..telegram.messaging import send_raw

 logger = logging.getLogger("agent-office.youtube_publisher")
@@ -25,6 +30,8 @@ class YoutubePublisherAgent(BaseAgent):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
+        # 진행 중(*_pending) 승인 요청 dedup — 인메모리 유지(의도적).
+        # 재시작 시 살아있는 파이프라인 승인 재알림은 유용한 리마인더라 스팸 아님.
        self._notified_state_per_pipeline: dict[int, tuple] = {}

    async def poll_state_changes(self) -> None:
@@ -48,6 +55,35 @@ class YoutubePublisherAgent(BaseAgent):
                    await self._notify_step(p)
                    self._notified_state_per_pipeline[pid] = key

+        try:
+            failed = await service_proxy.list_failed_pipelines()
+        except Exception as e:
+            # 일시적 폴링 실패를 "failed 없음"으로 오해하면 원장을 비워 재알림 스팸이 남.
+            # → 원장을 건드리지 않고 조용히 종료(다음 폴링에서 재시도).
+            logger.warning("failed 폴링 실패: %s", e)
+            return
+        notified = get_notified_failed_pipelines()
+        for p in failed:
+            pid = p.get("id")
+            if pid is None:
+                continue
+            if pid not in notified:
+                await self._notify_failed(p)
+                add_notified_failed_pipeline(pid)
+        # 재개되어 failed에서 벗어난 파이프라인은 재알림 가능하도록 원장에서 제거
+        failed_ids = {p.get("id") for p in failed if p.get("id") is not None}
+        prune_notified_failed_pipelines(failed_ids)
+
+    async def _notify_failed(self, p: dict) -> None:
+        reason = p.get("failed_reason") or "?"
+        step = reason.split(":", 1)[0].strip()
+        title = p.get("track_title") or f"Pipeline #{p['id']}"
+        text = f"⚠️ [{title}] 파이프라인 #{p['id']} '{step}' 실패\n사유: {reason}"
+        kb = {"inline_keyboard": [[{"text": "🔄 재시도", "callback_data": f"ytpub_retry_{p['id']}"}]]}
+        sent = await send_raw(text=text, reply_markup=kb)
+        if sent.get("ok"):
+            add_log(self.agent_id, f"pipeline {p['id']} 실패 알림", "warning")
+
    async def _notify_step(self, pipeline: dict) -> None:
        state = pipeline["state"]
        title_name, step = _STEP_TITLES[state]
--- a/agent-office/app/config.py
+++ b/agent-office/app/config.py
@@ -51,3 +51,9 @@ AGENT_CONTAINER_MAP: dict[str, tuple[str, int, _re.Pattern]] = {
    "insta":      ("insta-lab",      8000, _re.compile(r"^/api/insta")),
    "realestate": ("realestate-lab", 8000, _re.compile(r"^/api/realestate")),
 }
+
+# Redis (node monitor)
+REDIS_URL = os.getenv("REDIS_URL", "redis://redis:6379")
+NODE_ALERT_DEADLETTER_THRESHOLD = int(os.getenv("NODE_ALERT_DEADLETTER_THRESHOLD", "1"))
+# heartbeat TTL(45s)의 2배 — 키가 남아있어도 age>90s면 dead 판정
+NODE_STALE_THRESHOLD_SEC = int(os.getenv("NODE_STALE_THRESHOLD_SEC", "90"))
--- a/agent-office/app/db.py
+++ b/agent-office/app/db.py
@@ -158,6 +158,12 @@ def init_db() -> None:
            CREATE INDEX IF NOT EXISTS idx_tarot_favorite
            ON tarot_readings(favorite, created_at DESC)
        """)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS notified_failed_pipelines (
+                pipeline_id INTEGER PRIMARY KEY,
+                notified_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+            )
+        """)
        # Seed default agent configs
        for agent_id, name in [
            ("stock", "주식 트레이더"),
@@ -826,6 +832,47 @@ def get_all_baselines() -> List[Dict[str, Any]]:
    return out


+# --- notified_failed_pipelines (파이프라인 실패 알림 dedup 원장, 재시작 지속) ---
+
+def get_notified_failed_pipelines() -> set:
+    """이미 실패 알림을 발송한 pipeline_id 집합."""
+    with _conn() as conn:
+        rows = conn.execute(
+            "SELECT pipeline_id FROM notified_failed_pipelines"
+        ).fetchall()
+    return {r["pipeline_id"] for r in rows}
+
+
+def add_notified_failed_pipeline(pipeline_id: int) -> None:
+    with _conn() as conn:
+        conn.execute(
+            "INSERT OR IGNORE INTO notified_failed_pipelines(pipeline_id) VALUES(?)",
+            (pipeline_id,),
+        )
+
+
+def prune_notified_failed_pipelines(active_failed_ids) -> None:
+    """현재 failed 목록에 없는 pipeline_id를 원장에서 제거.
+
+    재개되어 failed에서 벗어난 파이프라인이 다시 실패하면 재알림 가능하도록 함.
+    (기존 인메모리 `_notified_failed &= failed_ids`의 영속 버전)
+    """
+    keep = set(active_failed_ids)
+    with _conn() as conn:
+        existing = {
+            r["pipeline_id"]
+            for r in conn.execute(
+                "SELECT pipeline_id FROM notified_failed_pipelines"
+            ).fetchall()
+        }
+        stale = existing - keep
+        for pid in stale:
+            conn.execute(
+                "DELETE FROM notified_failed_pipelines WHERE pipeline_id=?",
+                (pid,),
+            )
+
+
 def get_tasks_by_agent_date_kind(agent_id: str, date_iso: str, task_type: str) -> List[Dict[str, Any]]:
    """같은 (agent, date, task_type)으로 이미 생성된 task 조회. 멱등 guard."""
    with _conn() as conn:
--- a/agent-office/app/main.py
+++ b/agent-office/app/main.py
@@ -187,6 +187,11 @@ async def telegram_webhook(data: dict):
 def all_states():
    return {"agents": get_all_agent_states()}

+@app.get("/api/agent-office/nodes")
+async def nodes_status():
+    from .node_monitor import collect_status
+    return await collect_status()
+
@app.get("/api/agent-office/agents/{agent_id}/token-usage")
 def agent_token_usage(agent_id: str, days: int = 1):
    from .db import get_token_usage_stats
@@ -273,3 +278,19 @@ async def trigger_signal_check(source: str = "light"):
    if not agent:
        raise HTTPException(status_code=503, detail="lotto agent not registered")
    return await agent.run_signal_check(source=source)
+
+
+# --- Trade Alert Notify Endpoint ---
+
+class TradeAlertBody(BaseModel):
+    alerts: List[Dict[str, Any]] = []
+
+
+@app.post("/api/agent-office/stock/trade-alert")
+async def stock_trade_alert(body: TradeAlertBody):
+    from .notifiers.telegram_trade import send_trade_alerts
+    from .db import add_log
+    res = await send_trade_alerts(body.alerts)
+    for a in body.alerts:
+        add_log("stock", f"매매알람 {a.get('kind')} {a.get('ticker')} {a.get('condition')}", "info")
+    return res
--- a/agent-office/app/node_monitor.py
+++ b/agent-office/app/node_monitor.py
@@ -0,0 +1,148 @@
+"""분산 워커 상태 집계 (read-only). Global Constraints 계약 2 스키마 생성."""
+from __future__ import annotations
+import datetime as dt, json, logging
+import redis.asyncio as aioredis
+from .config import REDIS_URL, NODE_ALERT_DEADLETTER_THRESHOLD, NODE_STALE_THRESHOLD_SEC
+
+logger = logging.getLogger("agent-office.node_monitor")
+
+_node_state: dict[str, bool] = {}   # name -> 직전 alive
+_dl_notified: dict[str, int] = {}   # name -> 직전 알린 dead_letter 수
+
+WORKER_REGISTRY = [
+    {"name": "music-render", "kind": "render", "queue": "queue:music-render"},
+    {"name": "video-render", "kind": "render", "queue": "queue:video-render"},
+    {"name": "image-render", "kind": "render", "queue": "queue:image-render"},
+    {"name": "insta-render", "kind": "render", "queue": "queue:insta-render"},
+    {"name": "task-watcher",  "kind": "watcher", "queue": None},
+    {"name": "ai_trade",      "kind": "trader",  "queue": None},
+    {"name": "trade-monitor", "kind": "trader",  "queue": None},
+]
+
+_redis = None
+def _get_redis():
+    global _redis
+    if _redis is None:
+        _redis = aioredis.from_url(REDIS_URL, decode_responses=False)
+    return _redis
+
+
+def _beat_age(ts_str, now):
+    try:
+        beat = dt.datetime.fromisoformat(ts_str.replace("Z", "+00:00"))
+        return max(0, int((now - beat).total_seconds()))
+    except Exception:
+        return None
+
+
+def _render_link_status(w):
+    if not w["alive"]:
+        return "down"
+    if w["state"] == "paused":
+        return "paused"
+    if w["dead_letter"] > 0:
+        return "degraded"
+    return "healthy"
+
+
+async def collect_status(redis=None) -> dict:
+    r = redis or _get_redis()
+    now = dt.datetime.now(dt.timezone.utc)
+    out = {"redis_ok": True, "paused": False, "paused_reason": None,
+           "generated_at": now.strftime("%Y-%m-%dT%H:%M:%SZ"),
+           "workers": [], "links": []}
+    try:
+        out["paused"] = (await r.get("queue:paused")) == b"1"
+    except Exception:
+        logger.exception("redis 접근 실패")
+        out["redis_ok"] = False
+        return out
+
+    for w in WORKER_REGISTRY:
+        try:
+            info = {"name": w["name"], "kind": w["kind"], "alive": False, "state": None,
+                    "last_beat_age_s": None, "queue_depth": 0, "dead_letter": 0,
+                    "processing": 0, "jobs_done": 0, "jobs_failed": 0, "last_job_at": None}
+            raw = await r.get(f"worker:{w['name']}:heartbeat")
+            if raw:
+                try:
+                    hb = json.loads(raw)
+                    age = _beat_age(hb.get("ts") or "", now)
+                    info["last_beat_age_s"] = age
+                    info["alive"] = age is not None and age <= NODE_STALE_THRESHOLD_SEC
+                    info["state"] = hb.get("state")
+                    info["jobs_done"] = hb.get("jobs_done", 0)
+                    info["jobs_failed"] = hb.get("jobs_failed", 0)
+                    info["last_job_at"] = hb.get("last_job_at")
+                    if w["kind"] == "watcher" and hb.get("mode"):
+                        out["paused_reason"] = hb["mode"]
+                except (json.JSONDecodeError, UnicodeDecodeError):
+                    logger.warning("heartbeat JSON 파싱 실패 name=%s", w["name"])
+            if w["queue"]:
+                info["queue_depth"] = await r.llen(w["queue"])
+                info["dead_letter"] = await r.llen(f"dead_letter:{w['queue']}")
+                proc = 0
+                async for key in r.scan_iter(match=f"processing:{w['queue']}:*"):
+                    proc += await r.llen(key)
+                info["processing"] = proc
+            out["workers"].append(info)
+        except Exception:
+            logger.exception("워커 상태 수집 실패 name=%s", w["name"])
+            out["redis_ok"] = False
+            break
+
+    for w in out["workers"]:
+        if w["kind"] == "trader":
+            out["links"].append({"from": w["name"], "to": "nas-stock", "type": "http-pull",
+                                 "status": "healthy" if w["alive"] else "down"})
+        elif w["kind"] == "render":
+            out["links"].append({"from": "nas", "to": w["name"], "type": "redis-queue",
+                                 "status": _render_link_status(w)})
+    if out["paused"] and not out["paused_reason"]:
+        out["paused_reason"] = "trading"
+    return out
+
+
+async def check_and_alert(status=None) -> list[str]:
+    """워커 상태를 점검해 다운/복구/dead-letter 전이를 텔레그램으로 경보한다.
+
+    첫 관측(prev=None)엔 경보 없음 — 부팅 시 false alarm 방지.
+    반환값: 실제로 전송된 경보 텍스트 목록 (테스트용).
+    """
+    from .telegram.messaging import send_raw
+    from .db import add_log
+    try:
+        st = status or await collect_status()
+    except Exception:
+        logger.exception("collect_status 예외")
+        return []
+    sent: list[str] = []
+    for w in st["workers"]:
+        name = w["name"]
+        alive = w.get("alive", False)
+        prev = _node_state.get(name)
+        transition_send_failed = False
+        if prev is True and not alive:
+            text = f"🔴 [{name}] 워커 다운"
+            if (await send_raw(text=text)).get("ok"):
+                add_log("node_monitor", f"{name} 다운", "warning"); sent.append(text)
+            else:
+                transition_send_failed = True
+        elif prev is False and alive:
+            text = f"🟢 [{name}] 워커 복구"
+            if (await send_raw(text=text)).get("ok"):
+                add_log("node_monitor", f"{name} 복구", "info"); sent.append(text)
+            else:
+                transition_send_failed = True
+        if not transition_send_failed:
+            _node_state[name] = alive
+        dl = w.get("dead_letter", 0)
+        if dl >= NODE_ALERT_DEADLETTER_THRESHOLD and dl != _dl_notified.get(name, 0):
+            text = f"❌ [{name}] 실패 누적 {dl}건 (dead-letter)"
+            if (await send_raw(text=text)).get("ok"):
+                add_log("node_monitor", f"{name} dead-letter {dl}", "warning")
+                sent.append(text)
+                _dl_notified[name] = dl
+        elif dl == 0:
+            _dl_notified.pop(name, None)
+    return sent
--- a/agent-office/app/notifiers/telegram_trade.py
+++ b/agent-office/app/notifiers/telegram_trade.py
@@ -0,0 +1,61 @@
+"""매매 알람 텔레그램 포맷+전송 (본인+아내 각각)."""
+import logging
+from typing import Any, Dict, List
+
+from ..telegram.messaging import send_raw
+from ..config import TELEGRAM_CHAT_ID, TELEGRAM_WIFE_CHAT_ID
+
+logger = logging.getLogger("agent-office")
+
+_KIND_LABEL = {"buy": "🟢 매수", "sell": "🔴 매도"}
+_COND_LABEL = {
+    "buy_ma20_pullback": "지지선 되돌림", "buy_breakout": "돌파", "buy_rsi_bounce": "RSI 과매도 반등",
+    "sell_stop_loss": "손절", "sell_ma_break": "이평 이탈", "sell_take_profit": "익절",
+    "sell_climax": "급등 소진", "sell_trailing_stop": "트레일링 스톱",
+}
+# 조건별 "왜 이 시점에 매수/매도인가" 한 줄 근거
+_COND_REASON = {
+    "buy_ma20_pullback": "상승추세 중 MA20 지지선 눌림목 반등 — 저가 진입 기회",
+    "buy_breakout":      "전고점·저항 돌파 + 거래량 증가 — 추세 상승 진입 신호",
+    "buy_rsi_bounce":    "RSI 과매도(30↓)에서 반등 — 단기 낙폭과대 되돌림",
+    "sell_stop_loss":    "평단 대비 손절선 도달 — 추가 하락 리스크 차단",
+    "sell_ma_break":     "주요 이평선(MA50/200) 이탈 — 추세 훼손, 보유 재검토",
+    "sell_take_profit":  "목표 수익 도달 — 이익 실현 구간",
+    "sell_climax":       "거래량 급증 + 윗꼬리(고점 대비 하락 마감) — 분산·소진 의심",
+    "sell_trailing_stop":"보유기간 고점 대비 하락 — 수익 반납 방어(트레일링 스톱)",
+}
+
+
+def format_trade_alert(a: Dict[str, Any]) -> str:
+    kind = _KIND_LABEL.get(a["kind"], a["kind"])
+    cond = _COND_LABEL.get(a["condition"], a["condition"])
+    reason = _COND_REASON.get(a["condition"], "")
+    name = a.get("name") or a["ticker"]
+    price = a.get("price")
+    price_s = f"{int(price):,}원" if price else "-"
+    lines = [f"{kind} 알람", f"<b>{name}</b> ({a['ticker']})", f"조건: {cond}"]
+    if reason:
+        lines.append(f"💡 {reason}")
+    lines.append(f"현재가: {price_s}")
+    return "\n".join(lines)
+
+
+async def send_trade_alerts(alerts: List[Dict[str, Any]]) -> dict:
+    """알람마다 본인+아내 chat_id 각각으로 send_raw. 실패해도 계속 진행."""
+    sent = 0
+    all_ok = True
+    chat_ids = [c for c in (TELEGRAM_CHAT_ID, TELEGRAM_WIFE_CHAT_ID) if c]
+    for a in alerts:
+        text = format_trade_alert(a)
+        for cid in chat_ids:
+            try:
+                r = await send_raw(text, chat_id=cid)
+            except Exception as e:
+                logger.warning(f"[telegram_trade] send failed (chat_id={cid}): {e}")
+                all_ok = False
+                continue
+            if r.get("ok"):
+                sent += 1
+            else:
+                all_ok = False
+    return {"sent": sent, "ok": all_ok}
--- a/agent-office/app/scheduler.py
+++ b/agent-office/app/scheduler.py
@@ -4,6 +4,7 @@ from apscheduler.schedulers.asyncio import AsyncIOScheduler

 from .agents import AGENT_REGISTRY
 from .db import delete_old_logs
+from . import node_monitor

 scheduler = AsyncIOScheduler(timezone="Asia/Seoul")

@@ -98,6 +99,9 @@ async def _poll_pipelines():
    if agent:
        await agent.poll_state_changes()

+async def _run_node_health_check():
+    await node_monitor.check_and_alert()
+
 def _cleanup_old_logs():
    n = delete_old_logs(days=90)
    if n:
@@ -142,5 +146,6 @@ def init_scheduler():
    scheduler.add_job(_run_youtube_research,     "cron", hour=9, minute=10, id="youtube_research")
    scheduler.add_job(_send_youtube_weekly_report, "cron", day_of_week="mon", hour=8, minute=0, id="youtube_weekly_report")
    scheduler.add_job(_poll_pipelines, "interval", seconds=30, id="pipeline_poll")
+    scheduler.add_job(_run_node_health_check, "interval", seconds=60, id="node_health_check", replace_existing=True)
    scheduler.add_job(_cleanup_old_logs, "cron", hour=3, minute=0, id="cleanup_old_logs", replace_existing=True)
    scheduler.start()
--- a/agent-office/app/service_proxy.py
+++ b/agent-office/app/service_proxy.py
@@ -111,6 +111,29 @@ async def stock_holdings_brief() -> Dict[str, Any]:
    return resp.json()


+# --- stock watchlist (실시간 매매 알람) ---
+
+async def watchlist_add(ticker: str) -> Dict[str, Any]:
+    """stock의 관심종목 추가 (POST, 이미 존재하면 멱등하게 갱신)."""
+    resp = await _client.post(f"{STOCK_URL}/api/stock/watchlist", json={"ticker": ticker})
+    resp.raise_for_status()
+    return resp.json()
+
+
+async def watchlist_remove(ticker: str) -> Dict[str, Any]:
+    """stock의 관심종목 삭제."""
+    resp = await _client.delete(f"{STOCK_URL}/api/stock/watchlist/{ticker}")
+    resp.raise_for_status()
+    return resp.json()
+
+
+async def watchlist_list() -> Dict[str, Any]:
+    """stock의 관심종목 목록 조회 → {"watchlist": [...]}."""
+    resp = await _client.get(f"{STOCK_URL}/api/stock/watchlist")
+    resp.raise_for_status()
+    return resp.json()
+
+
 async def generate_music(payload: dict) -> Dict[str, Any]:
    resp = await _client.post(f"{MUSIC_LAB_URL}/api/music/generate", json=payload)
    resp.raise_for_status()
@@ -352,6 +375,25 @@ async def list_active_pipelines() -> list[dict]:
        return resp.json().get("pipelines", [])


+async def list_failed_pipelines() -> list[dict]:
+    async with httpx.AsyncClient(timeout=10) as client:
+        resp = await client.get(f"{MUSIC_LAB_URL}/api/music/pipeline?status=failed")
+        resp.raise_for_status()
+        data = resp.json()
+        return data if isinstance(data, list) else data.get("items", data.get("pipelines", []))
+
+
+async def pipeline_retry(pid: int) -> dict:
+    async with httpx.AsyncClient(timeout=15) as client:
+        resp = await client.post(f"{MUSIC_LAB_URL}/api/music/pipeline/{pid}/retry")
+        out = {"status_code": resp.status_code}
+        try:
+            out.update(resp.json())
+        except Exception:
+            pass
+        return out
+
+
 async def get_pipeline(pid: int) -> dict:
    async with httpx.AsyncClient(timeout=15) as client:
        resp = await client.get(f"{MUSIC_LAB_URL}/api/music/pipeline/{pid}")
--- a/agent-office/app/telegram/webhook.py
+++ b/agent-office/app/telegram/webhook.py
@@ -1,6 +1,7 @@
 """텔레그램 Webhook 이벤트 처리."""
 from typing import Optional

+from .. import service_proxy
 from ..db import get_telegram_callback, mark_telegram_responded
 from .client import _enabled, api_call

@@ -23,12 +24,43 @@ async def handle_webhook(data: dict, agent_dispatcher=None) -> Optional[dict]:
    if message:
        chat = message.get("chat", {})
        print(f"[TG-WEBHOOK] chat.id={chat.get('id')} type={chat.get('type')} text={message.get('text')!r}", flush=True)
+    if message and message.get("text"):
+        if await handle_watch_command(message):
+            return None
    if message and message.get("text") and agent_dispatcher is not None:
        return await _handle_message(message, agent_dispatcher)

    return None


+async def handle_watch_command(message: dict) -> bool:
+    """/watch /unwatch /watchlist 명령을 처리해 stock watchlist API로 프록시.
+
+    처리했으면(응답 전송 포함) True, 매칭되지 않는 텍스트면 False."""
+    text = (message.get("text") or "").strip()
+    chat_id = message.get("chat", {}).get("id")
+    parts = text.split()
+    cmd = parts[0].lower() if parts else ""
+
+    if cmd == "/watch" and len(parts) >= 2:
+        await service_proxy.watchlist_add(parts[1])
+        reply = f"관심종목 추가: {parts[1]}"
+    elif cmd == "/unwatch" and len(parts) >= 2:
+        await service_proxy.watchlist_remove(parts[1])
+        reply = f"관심종목 삭제: {parts[1]}"
+    elif cmd == "/watchlist":
+        res = await service_proxy.watchlist_list()
+        items = res.get("watchlist", [])
+        reply = "관심종목:\n" + (
+            "\n".join(f"- {w.get('name') or ''} ({w['ticker']})" for w in items) or "(없음)"
+        )
+    else:
+        return False
+
+    await api_call("sendMessage", {"chat_id": chat_id, "text": reply})
+    return True
+
+
 async def _handle_callback(callback_query: dict) -> Optional[dict]:
    """승인/거절 및 realestate 북마크 콜백 처리."""
    callback_id = callback_query.get("data", "")
@@ -43,6 +75,9 @@ async def _handle_callback(callback_query: dict) -> Optional[dict]:
    if callback_id.startswith("issue_"):
        return await _handle_insta_issue(callback_query, callback_id)

+    if callback_id.startswith("ytpub_retry_"):
+        return await _handle_ytpub_retry(callback_query, callback_id)
+
    cb = get_telegram_callback(callback_id)
    if not cb:
        return None
@@ -169,6 +204,30 @@ async def _handle_insta_issue(callback_query: dict, callback_id: str) -> dict:
        return {"ok": False, "error": str(e)}


+async def _handle_ytpub_retry(callback_query: dict, callback_id: str) -> dict:
+    """ytpub_retry_{pipeline_id} 콜백 → music-lab pipeline retry 프록시."""
+    from .. import service_proxy
+    from .messaging import send_raw
+
+    await api_call(
+        "answerCallbackQuery",
+        {"callback_query_id": callback_query["id"], "text": "재시도 요청 중..."},
+    )
+
+    try:
+        pid = int(callback_id.removeprefix("ytpub_retry_"))
+    except (ValueError, AttributeError):
+        return {"ok": False, "error": "invalid_callback_data"}
+
+    res = await service_proxy.pipeline_retry(pid)
+    sc = res.get("status_code")
+    if sc in (200, 202):
+        await send_raw(text=f"🔄 파이프라인 #{pid} 재개: {res.get('retrying_step', '?')}")
+    else:
+        await send_raw(text=f"⚠️ 재개 불가 (#{pid}): {res.get('detail', sc)}")
+    return {"ok": True}
+
+
 async def _handle_message(message: dict, agent_dispatcher) -> Optional[dict]:
    """슬래시 명령 메시지 처리."""
    from .router import parse_command, resolve_agent_command, HELP_TEXT
--- a/agent-office/app/test_db.py
+++ b/agent-office/app/test_db.py
@@ -18,9 +18,11 @@ from app.db import (
 def test_init_and_seed():
    init_db()
    agents = get_all_agents()
-    assert len(agents) == 2, f"Expected 2 agents, got {len(agents)}"
    ids = {a["agent_id"] for a in agents}
-    assert ids == {"stock", "music"}, f"Unexpected agent ids: {ids}"
+    # 시드된 핵심 에이전트 존재 검증 — 레지스트리 확장(insta/lotto/realestate/youtube 등)에 견고하도록
+    # 고정 개수/집합이 아닌 subset으로 단언 (이전 len==2/{stock,music} 고정 단언은 stale였음).
+    assert {"stock", "music"} <= ids, f"core agents missing: {ids}"
+    assert len(agents) >= 2
    print("  [PASS] test_init_and_seed")


--- a/agent-office/requirements.txt
+++ b/agent-office/requirements.txt
@@ -7,3 +7,4 @@ respx>=0.21
 pytest-asyncio>=0.23
 google-api-python-client>=2.100.0
 pytrends>=4.9.2
+redis>=5.0
--- a/agent-office/tests/test_node_monitor.py
+++ b/agent-office/tests/test_node_monitor.py
@@ -0,0 +1,208 @@
+# agent-office/tests/test_node_monitor.py
+import datetime as dt
+import json, pytest
+from app import node_monitor
+import app.node_monitor as nm
+
+class FakeRedis:
+    """worker heartbeat + queue llen + scan_iter 흉내."""
+    def __init__(self, kv=None, lists=None):
+        self._kv = kv or {}           # key(str) -> bytes
+        self._lists = lists or {}     # key(str) -> length(int)
+    async def get(self, key):
+        return self._kv.get(key)
+    async def llen(self, key):
+        return self._lists.get(key, 0)
+    async def scan_iter(self, match=None):
+        prefix = match.rstrip("*")
+        for k in list(self._lists):
+            if k.startswith(prefix):
+                yield k
+
+def _hb(name, kind, state, ts=None, **extra):
+    """heartbeat 페이로드 생성. ts 기본값은 현재 시각(신선한 heartbeat)."""
+    if ts is None:
+        ts = dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+    return json.dumps({"name": name, "kind": kind, "state": state, "ts": ts,
+                       "last_job_at": None, "jobs_done": 0, "jobs_failed": 0, **extra}).encode()
+
+@pytest.mark.asyncio
+async def test_alive_worker_healthy_link():
+    r = FakeRedis(kv={"worker:image-render:heartbeat": _hb("image-render","render","idle")})
+    st = await node_monitor.collect_status(redis=r)
+    img = next(w for w in st["workers"] if w["name"] == "image-render")
+    assert img["alive"] is True and img["state"] == "idle"
+    link = next(l for l in st["links"] if l["to"] == "image-render")
+    assert link["status"] == "healthy" and link["type"] == "redis-queue"
+
+@pytest.mark.asyncio
+async def test_missing_heartbeat_is_dead_and_down():
+    r = FakeRedis()  # heartbeat 없음
+    st = await node_monitor.collect_status(redis=r)
+    img = next(w for w in st["workers"] if w["name"] == "image-render")
+    assert img["alive"] is False
+    link = next(l for l in st["links"] if l["to"] == "image-render")
+    assert link["status"] == "down"
+
+@pytest.mark.asyncio
+async def test_dead_letter_makes_degraded():
+    r = FakeRedis(kv={"worker:video-render:heartbeat": _hb("video-render","render","idle")},
+                  lists={"dead_letter:queue:video-render": 2})
+    st = await node_monitor.collect_status(redis=r)
+    vid = next(w for w in st["workers"] if w["name"] == "video-render")
+    assert vid["dead_letter"] == 2
+    link = next(l for l in st["links"] if l["to"] == "video-render")
+    assert link["status"] == "degraded"
+
+@pytest.mark.asyncio
+async def test_paused_reason_from_watcher():
+    r = FakeRedis(kv={"queue:paused": b"1",
+                      "worker:task-watcher:heartbeat": _hb("task-watcher","watcher","trading",mode="trading")})
+    st = await node_monitor.collect_status(redis=r)
+    assert st["paused"] is True and st["paused_reason"] == "trading"
+
+@pytest.mark.asyncio
+async def test_trader_http_pull_link():
+    r = FakeRedis(kv={"worker:ai_trade:heartbeat": _hb("ai_trade","trader","market_open")})
+    st = await node_monitor.collect_status(redis=r)
+    link = next(l for l in st["links"] if l["from"] == "ai_trade")
+    assert link["type"] == "http-pull" and link["status"] == "healthy"
+
+@pytest.mark.asyncio
+async def test_trade_monitor_registered_and_own_link():
+    """WSL 워커 trade-monitor가 registry에 있어 /nodes에 노출되고, 링크 from은
+    ai_trade 하드코딩이 아니라 자기 이름(trade-monitor)이어야 한다 (다중 trader 구분)."""
+    r = FakeRedis(kv={"worker:trade-monitor:heartbeat": _hb("trade-monitor", "trader", "market_open")})
+    st = await node_monitor.collect_status(redis=r)
+    tm = next(w for w in st["workers"] if w["name"] == "trade-monitor")
+    assert tm["alive"] is True and tm["kind"] == "trader"
+    link = next(l for l in st["links"] if l["from"] == "trade-monitor")
+    assert link["type"] == "http-pull" and link["to"] == "nas-stock" and link["status"] == "healthy"
+
+
+@pytest.mark.asyncio
+async def test_paused_no_watcher_heartbeat_fallback_reason():
+    """paused=True인데 watcher heartbeat 없으면 paused_reason == 'trading' 폴백."""
+    r = FakeRedis(kv={"queue:paused": b"1"})  # watcher heartbeat 없음
+    st = await node_monitor.collect_status(redis=r)
+    assert st["paused"] is True
+    assert st["paused_reason"] == "trading"
+
+@pytest.mark.asyncio
+async def test_processing_count_image_render():
+    """processing:<queue>:<worker_id> 리스트가 있으면 processing 필드에 합산된다."""
+    worker_id = "abc123"
+    proc_key = f"processing:queue:image-render:{worker_id}"
+    r = FakeRedis(
+        kv={"worker:image-render:heartbeat": _hb("image-render", "render", "busy")},
+        lists={proc_key: 3},
+    )
+    st = await node_monitor.collect_status(redis=r)
+    img = next(w for w in st["workers"] if w["name"] == "image-render")
+    assert img["processing"] == 3
+
+@pytest.mark.asyncio
+async def test_llen_exception_returns_redis_ok_false():
+    """워커 루프 중 llen 예외 발생 시 예외를 전파하지 않고 redis_ok=False 반환 (Blocker 회귀)."""
+    class BrokenLlenRedis(FakeRedis):
+        async def llen(self, key):
+            raise ConnectionError("Redis 연결 끊김")
+
+    r = BrokenLlenRedis(
+        kv={"worker:music-render:heartbeat": _hb("music-render", "render", "idle")}
+    )
+    st = await node_monitor.collect_status(redis=r)
+    assert st["redis_ok"] is False
+
+
+@pytest.mark.asyncio
+async def test_alert_on_alive_to_dead(monkeypatch):
+    sent = []
+    async def fake_send_raw(text, **kw): sent.append(text); return {"ok": True}
+    monkeypatch.setattr("app.telegram.messaging.send_raw", fake_send_raw)
+    monkeypatch.setattr("app.db.add_log", lambda *a, **k: None)
+    nm._node_state.clear(); nm._dl_notified.clear()
+    alive = {"workers": [{"name":"image-render","alive":True,"dead_letter":0}], "links": []}
+    dead =  {"workers": [{"name":"image-render","alive":False,"dead_letter":0}], "links": []}
+    await nm.check_and_alert(status=alive)   # 첫 관측 — 경보 없음
+    assert sent == []
+    await nm.check_and_alert(status=dead)    # alive→dead 전이
+    assert any("다운" in t for t in sent)
+
+@pytest.mark.asyncio
+async def test_alert_on_dead_letter_growth(monkeypatch):
+    sent = []
+    async def fake_send_raw(text, **kw): sent.append(text); return {"ok": True}
+    monkeypatch.setattr("app.telegram.messaging.send_raw", fake_send_raw)
+    monkeypatch.setattr("app.db.add_log", lambda *a, **k: None)
+    nm._node_state.clear(); nm._dl_notified.clear()
+    s = {"workers": [{"name":"video-render","alive":True,"dead_letter":2}], "links": []}
+    await nm.check_and_alert(status=s)
+    assert any("dead-letter" in t for t in sent)
+
+@pytest.mark.asyncio
+async def test_dl_notified_not_updated_on_telegram_failure(monkeypatch):
+    """텔레그램 실패(ok=False) 시 _dl_notified 갱신 안 됨 → 다음 사이클에서 재시도."""
+    calls = []
+    async def fake_send_raw(text, **kw):
+        calls.append(text)
+        if len(calls) == 1:
+            return {"ok": False}   # 첫 호출: 텔레그램 다운
+        return {"ok": True}        # 두 번째 호출: 성공
+    monkeypatch.setattr("app.telegram.messaging.send_raw", fake_send_raw)
+    monkeypatch.setattr("app.db.add_log", lambda *a, **k: None)
+    nm._node_state.clear(); nm._dl_notified.clear()
+    s = {"workers": [{"name": "video-render", "alive": True, "dead_letter": 2}], "links": []}
+    # 첫 호출: 텔레그램 다운 → ok=False → _dl_notified 갱신 안 됨
+    result1 = await nm.check_and_alert(status=s)
+    assert result1 == []
+    assert nm._dl_notified.get("video-render", 0) == 0
+    # 두 번째 호출: 같은 dl=2 → _dl_notified 미갱신으로 조건 재만족 → 재시도 발송
+    result2 = await nm.check_and_alert(status=s)
+    assert any("dead-letter" in t for t in result2)
+    assert nm._dl_notified.get("video-render") == 2
+
+
+# ── I1: staleness 판정 신규 테스트 ─────────────────────────────────────────────
+
+@pytest.mark.asyncio
+async def test_stale_heartbeat_is_dead():
+    """heartbeat 키가 존재해도 ts가 90s 초과면 alive=False (staleness 판정)."""
+    stale_ts = (dt.datetime.now(dt.timezone.utc) - dt.timedelta(seconds=300)).strftime(
+        "%Y-%m-%dT%H:%M:%SZ"
+    )
+    r = FakeRedis(kv={"worker:image-render:heartbeat": _hb("image-render", "render", "idle", ts=stale_ts)})
+    st = await node_monitor.collect_status(redis=r)
+    img = next(w for w in st["workers"] if w["name"] == "image-render")
+    assert img["alive"] is False
+    link = next(l for l in st["links"] if l["to"] == "image-render")
+    assert link["status"] == "down"
+
+
+# ── I2: 전이 발송 실패 시 재시도 회귀 테스트 ──────────────────────────────────
+
+@pytest.mark.asyncio
+async def test_transition_send_failure_retries_next_cycle(monkeypatch):
+    """alive→dead 전이 시 send_raw 실패하면 _node_state 갱신 안 됨 → 다음 사이클 재시도."""
+    calls = []
+    async def fake_send_raw(text, **kw):
+        calls.append(text)
+        if len(calls) == 1:
+            return {"ok": False}   # 첫 호출: 텔레그램 다운
+        return {"ok": True}        # 두 번째 호출: 성공
+    monkeypatch.setattr("app.telegram.messaging.send_raw", fake_send_raw)
+    monkeypatch.setattr("app.db.add_log", lambda *a, **k: None)
+    nm._node_state.clear(); nm._dl_notified.clear()
+    alive = {"workers": [{"name": "music-render", "alive": True, "dead_letter": 0}], "links": []}
+    dead  = {"workers": [{"name": "music-render", "alive": False, "dead_letter": 0}], "links": []}
+    # 첫 관측: baseline 설정(전이 없음)
+    await nm.check_and_alert(status=alive)
+    assert nm._node_state.get("music-render") is True
+    # alive→dead 전이, send_raw 실패 → _node_state 갱신 안 됨
+    result1 = await nm.check_and_alert(status=dead)
+    assert result1 == []                                      # 경보 미발송
+    assert nm._node_state.get("music-render") is True        # 여전히 True
+    # 두 번째 사이클: 동일 dead, send_raw 성공 → 경보 발송
+    result2 = await nm.check_and_alert(status=dead)
+    assert any("다운" in t for t in result2)
+    assert nm._node_state.get("music-render") is False       # 이제 갱신
--- a/agent-office/tests/test_nodes_endpoint.py
+++ b/agent-office/tests/test_nodes_endpoint.py
@@ -0,0 +1,18 @@
+# agent-office/tests/test_nodes_endpoint.py
+import pytest
+from fastapi.testclient import TestClient
+
+@pytest.fixture
+def client(monkeypatch):
+    from app import main
+    async def fake_collect(redis=None):
+        return {"redis_ok": True, "paused": False, "paused_reason": None,
+                "generated_at": "2026-06-29T00:00:00Z", "workers": [], "links": []}
+    monkeypatch.setattr("app.node_monitor.collect_status", fake_collect)
+    return TestClient(main.app)
+
+def test_nodes_endpoint_returns_contract(client):
+    resp = client.get("/api/agent-office/nodes")
+    assert resp.status_code == 200
+    body = resp.json()
+    assert set(["redis_ok","paused","workers","links"]).issubset(body)
--- a/agent-office/tests/test_pipeline_polling.py
+++ b/agent-office/tests/test_pipeline_polling.py
@@ -40,6 +40,9 @@ async def test_poll_notifies_once_per_state():
    with patch(
        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
        new=AsyncMock(return_value=pipelines),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[]),
    ), patch(
        "app.agents.youtube_publisher.send_raw",
        new=AsyncMock(return_value={"ok": True, "message_id": 99}),
@@ -63,6 +66,8 @@ async def test_poll_renotifies_on_reject_regen(monkeypatch):
                      "track_title": "Test", "feedback_count_per_step": {"cover": 1}}]
    list_mock = AsyncMock(side_effect=[pipelines_v1, pipelines_v2])
    with patch("app.agents.youtube_publisher.service_proxy.list_active_pipelines", list_mock), \
+         patch("app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+                new=AsyncMock(return_value=[])), \
         patch("app.agents.youtube_publisher.send_raw",
                new=AsyncMock(return_value={"ok": True, "message_id": 99})), \
         patch("app.agents.youtube_publisher.service_proxy.save_pipeline_telegram_msg",
@@ -83,7 +88,7 @@ async def test_on_telegram_reply_approve_calls_feedback():
        new=AsyncMock(),
    ) as mock_fb, patch(
        "app.agents.youtube_publisher.send_raw",
-        new=AsyncMock(),
+        new=AsyncMock(return_value={"ok": True, "message_id": 1}),
    ):
        a = YoutubePublisherAgent()
        await a.on_telegram_reply(pipeline_id=42, step="cover", user_text="승인")
@@ -99,7 +104,7 @@ async def test_on_telegram_reply_reject_with_feedback():
        new=AsyncMock(),
    ) as mock_fb, patch(
        "app.agents.youtube_publisher.send_raw",
-        new=AsyncMock(),
+        new=AsyncMock(return_value={"ok": True, "message_id": 1}),
    ):
        a = YoutubePublisherAgent()
        await a.on_telegram_reply(pipeline_id=43, step="meta", user_text="반려, 제목 짧게")
--- a/agent-office/tests/test_trade_alert_notify.py
+++ b/agent-office/tests/test_trade_alert_notify.py
@@ -0,0 +1,67 @@
+import os
+import sys
+import tempfile
+
+_fd, _TMP = tempfile.mkstemp(suffix=".db")
+os.close(_fd)
+os.unlink(_TMP)
+os.environ["AGENT_OFFICE_DB_PATH"] = _TMP
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import pytest
+from unittest.mock import AsyncMock, patch
+
+
+@pytest.fixture(autouse=True)
+def _init_db(monkeypatch):
+    import gc
+    gc.collect()
+    # config.DB_PATH는 첫 import 시 1회 고정되므로, 다른 테스트 파일과 조합 실행 시
+    # db가 이 파일의 _TMP가 아닌 다른 경로를 쓸 수 있다. db.DB_PATH를 이 파일 전용으로
+    # 강제해 영속 테이블의 테스트 간 누수를 결정적으로 차단.
+    import app.db as _db
+    monkeypatch.setattr(_db, "DB_PATH", _TMP)
+    # WAL 사이드카(-wal/-shm)까지 지워야 영속 상태가 남지 않음
+    for suffix in ("", "-wal", "-shm"):
+        p = _TMP + suffix
+        if os.path.exists(p):
+            os.remove(p)
+    _db.init_db()
+    yield
+    gc.collect()
+
+
+@pytest.mark.asyncio
+async def test_send_trade_alerts_to_user_and_wife():
+    from app.notifiers import telegram_trade
+    alerts = [{"ticker": "005930", "name": "삼성전자", "kind": "buy",
+               "condition": "buy_breakout", "price": 71500, "detail": {}}]
+    with patch("app.notifiers.telegram_trade.send_raw",
+               new=AsyncMock(return_value={"ok": True})) as m, \
+         patch("app.notifiers.telegram_trade.TELEGRAM_CHAT_ID", "U"), \
+         patch("app.notifiers.telegram_trade.TELEGRAM_WIFE_CHAT_ID", "W"):
+        res = await telegram_trade.send_trade_alerts(alerts)
+    assert res["ok"] is True
+    chat_ids = {c.kwargs.get("chat_id") for c in m.await_args_list}
+    assert chat_ids == {"U", "W"}      # 둘 다 발송
+
+
+@pytest.mark.asyncio
+async def test_format_trade_alert_has_direction():
+    from app.notifiers.telegram_trade import format_trade_alert
+    txt = format_trade_alert({"ticker": "005930", "name": "삼성전자", "kind": "sell",
+                              "condition": "sell_stop_loss", "price": 60000, "detail": {}})
+    assert "매도" in txt and "삼성전자" in txt
+
+
+def test_format_trade_alert_includes_reason_line():
+    """조건별 '왜 매수/매도해야 하는지' 한 줄 이유(💡)가 메시지에 포함된다."""
+    from app.notifiers.telegram_trade import format_trade_alert
+    for cond in ("buy_breakout", "sell_stop_loss", "sell_trailing_stop"):
+        txt = format_trade_alert({"ticker": "005930", "name": "삼성전자", "kind": cond.split("_")[0],
+                                  "condition": cond, "price": 60000, "detail": {}})
+        assert "💡" in txt, f"{cond}: 이유 한 줄 누락"
+        # 이유 라인이 조건 라벨을 그대로 반복하지 않고 실제 설명을 담아야 함
+        reason_line = next(l for l in txt.split("\n") if l.startswith("💡"))
+        assert len(reason_line) > 6
--- a/agent-office/tests/test_watch_commands.py
+++ b/agent-office/tests/test_watch_commands.py
@@ -0,0 +1,93 @@
+import os
+import sys
+import tempfile
+
+_fd, _TMP = tempfile.mkstemp(suffix=".db")
+os.close(_fd)
+os.unlink(_TMP)
+os.environ["AGENT_OFFICE_DB_PATH"] = _TMP
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import pytest
+from unittest.mock import AsyncMock, patch
+
+
+@pytest.fixture(autouse=True)
+def _init_db(monkeypatch):
+    import gc
+    gc.collect()
+    # config.DB_PATH는 첫 import 시 1회 고정되므로, 다른 테스트 파일과 조합 실행 시
+    # db가 이 파일의 _TMP가 아닌 다른 경로를 쓸 수 있다. db.DB_PATH를 이 파일 전용으로
+    # 강제해 영속 테이블의 테스트 간 누수를 결정적으로 차단.
+    import app.db as _db
+    monkeypatch.setattr(_db, "DB_PATH", _TMP)
+    for suffix in ("", "-wal", "-shm"):
+        p = _TMP + suffix
+        if os.path.exists(p):
+            os.remove(p)
+    _db.init_db()
+    yield
+    gc.collect()
+
+
+@pytest.mark.asyncio
+async def test_watch_command_calls_add():
+    from app.telegram import webhook
+    msg = {"chat": {"id": 1}, "text": "/watch 005930"}
+    with patch("app.telegram.webhook.service_proxy.watchlist_add",
+               new=AsyncMock(return_value={"ok": True})) as m, \
+         patch("app.telegram.webhook.api_call", new=AsyncMock(return_value={"ok": True})):
+        handled = await webhook.handle_watch_command(msg)
+    assert handled is True
+    m.assert_awaited_once_with("005930")
+
+
+@pytest.mark.asyncio
+async def test_non_watch_text_ignored():
+    from app.telegram import webhook
+    msg = {"chat": {"id": 1}, "text": "안녕"}
+    assert await webhook.handle_watch_command(msg) is False
+
+
+@pytest.mark.asyncio
+async def test_unwatch_command_calls_remove():
+    from app.telegram import webhook
+    msg = {"chat": {"id": 1}, "text": "/unwatch 005930"}
+    with patch("app.telegram.webhook.service_proxy.watchlist_remove",
+               new=AsyncMock(return_value={"ok": True})) as m, \
+         patch("app.telegram.webhook.api_call", new=AsyncMock(return_value={"ok": True})) as sent:
+        handled = await webhook.handle_watch_command(msg)
+    assert handled is True
+    m.assert_awaited_once_with("005930")
+    sent.assert_awaited_once()
+
+
+@pytest.mark.asyncio
+async def test_watchlist_command_calls_list_and_formats_items():
+    from app.telegram import webhook
+    msg = {"chat": {"id": 1}, "text": "/watchlist"}
+    items = {"watchlist": [{"ticker": "005930", "name": "삼성전자"}]}
+    with patch("app.telegram.webhook.service_proxy.watchlist_list",
+               new=AsyncMock(return_value=items)) as m, \
+         patch("app.telegram.webhook.api_call", new=AsyncMock(return_value={"ok": True})) as sent:
+        handled = await webhook.handle_watch_command(msg)
+    assert handled is True
+    m.assert_awaited_once_with()
+    text = sent.await_args.args[1]["text"]
+    assert "005930" in text and "삼성전자" in text
+
+
+@pytest.mark.asyncio
+async def test_watch_command_reaches_handle_webhook_before_slash_dispatch():
+    """handle_webhook이 /watch 를 agent_dispatcher 호출 전에 가로채야 한다."""
+    from app.telegram import webhook
+    data = {"message": {"chat": {"id": 1}, "text": "/watch 005930"}}
+    dispatcher = AsyncMock(side_effect=AssertionError("agent_dispatcher가 호출되면 안 됨"))
+    with patch("app.telegram.webhook.service_proxy.watchlist_add",
+               new=AsyncMock(return_value={"ok": True})) as m, \
+         patch("app.telegram.webhook.api_call", new=AsyncMock(return_value={"ok": True})):
+        result = await webhook.handle_webhook(data, agent_dispatcher=dispatcher)
+    assert result is None
+    m.assert_awaited_once_with("005930")
+    dispatcher.assert_not_awaited()
--- a/agent-office/tests/test_youtube_publisher_retry.py
+++ b/agent-office/tests/test_youtube_publisher_retry.py
@@ -0,0 +1,287 @@
+import os
+import sys
+import tempfile
+
+_fd, _TMP = tempfile.mkstemp(suffix=".db")
+os.close(_fd)
+os.unlink(_TMP)
+os.environ["AGENT_OFFICE_DB_PATH"] = _TMP
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import pytest
+from unittest.mock import AsyncMock, patch
+
+
+@pytest.fixture(autouse=True)
+def _init_db(monkeypatch):
+    import gc
+    gc.collect()
+    # config.DB_PATH는 첫 import 시 1회 고정되므로, 다른 테스트 파일과 조합 실행 시
+    # db가 이 파일의 _TMP가 아닌 다른 경로를 쓸 수 있다. db.DB_PATH를 이 파일 전용으로
+    # 강제해 영속 테이블(notified_failed_pipelines 등)의 테스트 간 누수를 결정적으로 차단.
+    import app.db as _db
+    monkeypatch.setattr(_db, "DB_PATH", _TMP)
+    # WAL 사이드카(-wal/-shm)까지 지워야 영속 상태가 남지 않음
+    for suffix in ("", "-wal", "-shm"):
+        p = _TMP + suffix
+        if os.path.exists(p):
+            os.remove(p)
+    _db.init_db()
+    yield
+    gc.collect()
+
+
+@pytest.mark.asyncio
+async def test_failed_pipeline_notified_with_retry_button():
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    agent = YoutubePublisherAgent()
+    failed_pipeline = {
+        "id": 7,
+        "state": "failed",
+        "failed_reason": "video: boom",
+        "track_title": "T",
+    }
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[failed_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+
+    assert sent.await_count == 1
+    _, kwargs = sent.await_args
+    assert "실패" in (kwargs.get("text") or "")
+    assert kwargs["reply_markup"]["inline_keyboard"][0][0]["callback_data"] == "ytpub_retry_7"
+
+
+@pytest.mark.asyncio
+async def test_failed_pipeline_no_duplicate_notification():
+    """같은 failed 파이프라인은 두 번째 poll에서 알림 안 함."""
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    agent = YoutubePublisherAgent()
+    failed_pipeline = {
+        "id": 7,
+        "state": "failed",
+        "failed_reason": "video: boom",
+        "track_title": "T",
+    }
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[failed_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+        await agent.poll_state_changes()
+
+    # 중복 방지: 같은 failed 파이프라인에 대해 1회만 알림
+    assert sent.await_count == 1
+
+
+@pytest.mark.asyncio
+async def test_failed_pipeline_renotify_after_recovery():
+    """failed에서 벗어난 파이프라인이 다시 failed 되면 재알림."""
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    agent = YoutubePublisherAgent()
+    failed_pipeline = {
+        "id": 7,
+        "state": "failed",
+        "failed_reason": "video: boom",
+        "track_title": "T",
+    }
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    # 첫 번째 poll: failed 존재 → 알림
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[failed_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+
+    assert sent.await_count == 1
+
+    # 두 번째 poll: failed 목록에서 사라짐(재개됨) → _notified_failed에서 제거
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+
+    assert sent.await_count == 1  # 아직 추가 알림 없음
+
+    # 세 번째 poll: 다시 failed → 재알림 가능
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[failed_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+
+    assert sent.await_count == 2  # 재알림
+
+
+@pytest.mark.asyncio
+async def test_handle_ytpub_retry_calls_proxy():
+    from app import service_proxy
+    from app.telegram import webhook
+
+    retry = AsyncMock(return_value={"status_code": 202, "ok": True, "retrying_step": "video"})
+    fake_send = AsyncMock(return_value={"ok": True})
+    fake_api_call = AsyncMock(return_value={"ok": True})
+
+    with patch.object(service_proxy, "pipeline_retry", retry), \
+         patch("app.telegram.messaging.send_raw", fake_send), \
+         patch("app.telegram.webhook.api_call", fake_api_call):
+        res = await webhook._handle_ytpub_retry({"id": 1}, "ytpub_retry_7")
+
+    retry.assert_awaited_once_with(7)
+    assert res["ok"] is True
+
+
+@pytest.mark.asyncio
+async def test_handle_ytpub_retry_invalid_data():
+    from app.telegram import webhook
+
+    fake_send = AsyncMock(return_value={"ok": True})
+    fake_api_call = AsyncMock(return_value={"ok": True})
+
+    with patch("app.telegram.messaging.send_raw", fake_send), \
+         patch("app.telegram.webhook.api_call", fake_api_call):
+        res = await webhook._handle_ytpub_retry({"id": 1}, "ytpub_retry_abc")
+
+    assert res["ok"] is False
+
+
+@pytest.mark.asyncio
+async def test_failed_poll_exception_is_silent():
+    """list_failed_pipelines 예외 시 poll이 조용히 넘어감 (active 알림에 영향 없음)."""
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    agent = YoutubePublisherAgent()
+    active_pipeline = {
+        "id": 1,
+        "state": "cover_pending",
+        "cover_url": "/x.jpg",
+        "track_title": "Track",
+        "feedback_count_per_step": {},
+    }
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[active_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(side_effect=Exception("network error")),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.save_pipeline_telegram_msg",
+        new=AsyncMock(),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        await agent.poll_state_changes()
+
+    # active 알림은 정상 발송
+    assert sent.await_count == 1
+
+
+@pytest.mark.asyncio
+async def test_failed_notification_persists_across_restart():
+    """컨테이너 재시작(새 에이전트 인스턴스)해도 이미 알린 failed는 재알림하지 않음."""
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    failed_pipeline = {
+        "id": 3,
+        "state": "failed",
+        "failed_reason": "video: timeout",
+        "track_title": "beat music v2",
+    }
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=AsyncMock(return_value=[failed_pipeline]),
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        agent1 = YoutubePublisherAgent()
+        await agent1.poll_state_changes()
+        # 컨테이너 재시작 시뮬레이션: 완전히 새로운 인스턴스(인메모리 상태 소실)
+        agent2 = YoutubePublisherAgent()
+        await agent2.poll_state_changes()
+
+    # 재시작해도 DB 원장으로 중복 방지 → 1회만 알림
+    assert sent.await_count == 1
+
+
+@pytest.mark.asyncio
+async def test_transient_failed_poll_keeps_ledger():
+    """failed 폴링이 일시적으로 예외를 던져도 원장을 비우지 않아 다음 폴링에서 재알림하지 않음."""
+    from app.agents.youtube_publisher import YoutubePublisherAgent
+
+    failed_pipeline = {
+        "id": 3,
+        "state": "failed",
+        "failed_reason": "video: timeout",
+        "track_title": "beat music v2",
+    }
+    list_failed = AsyncMock(
+        side_effect=[[failed_pipeline], Exception("boom"), [failed_pipeline]]
+    )
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+
+    with patch(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        new=AsyncMock(return_value=[]),
+    ), patch(
+        "app.agents.youtube_publisher.service_proxy.list_failed_pipelines",
+        new=list_failed,
+    ), patch(
+        "app.agents.youtube_publisher.send_raw",
+        new=sent,
+    ):
+        agent = YoutubePublisherAgent()
+        await agent.poll_state_changes()  # #3 최초 알림
+        await agent.poll_state_changes()  # 예외 → 원장 유지되어야 (섣부른 정리 금지)
+        await agent.poll_state_changes()  # #3 여전히 failed → 재알림 없어야
+
+    assert sent.await_count == 1
--- a/co-gahusb/.gitignore
+++ b/co-gahusb/.gitignore
@@ -0,0 +1,3 @@
+.venv/
+__pycache__/
+*.pyc
--- a/co-gahusb/CLIENT_SETUP.md
+++ b/co-gahusb/CLIENT_SETUP.md
@@ -0,0 +1,19 @@
+# co-gahusb 클라이언트 설정
+
+## 공통
+1. `CO_BUS_KEY` 환경변수를 각 머신에 설정(서버 `.env`의 값과 동일).
+2. 해당 repo 루트 `.mcp.json`에 co-gahusb HTTP MCP 등록(이 repo의 예시 참고).
+3. CLAUDE.md 역할 블록의 `/loop` 폴링 규약을 따른다.
+
+## web-ai (다른 머신)
+web-ai 머신의 repo 루트에 아래 `.mcp.json` 생성, 역할 = **AI**:
+```json
+{ "mcpServers": { "co-gahusb": {
+  "type": "http",
+  "url": "https://gahusb.synology.me/api/co/mcp",
+  "headers": { "Authorization": "Bearer ${CO_BUS_KEY}" } } } }
+```
+web-ai CLAUDE.md에 역할 블록 추가(role="AI", 소유권=web-ai repo, 동일 락 규약).
+
+## Producer (오케스트레이터 세션)
+별도 repo 없이 조율 담당. `team_log()`로 전체 활동 감시, `create_task`로 분배, `acquire_lock`로 교차 작업 직렬화.
--- a/co-gahusb/Dockerfile
+++ b/co-gahusb/Dockerfile
@@ -0,0 +1,12 @@
+FROM python:3.12-slim-bookworm
+ENV PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir --timeout 600 --retries 5 -r requirements.txt
+
+COPY . .
+
+EXPOSE 8000
+CMD ["uvicorn", "app.server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
--- a/co-gahusb/app/init.py
+++ b/co-gahusb/app/init.py
--- a/co-gahusb/app/config.py
+++ b/co-gahusb/app/config.py
@@ -0,0 +1,21 @@
+# co-gahusb/app/config.py
+import os
+
+REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")
+CO_BUS_KEY = os.environ.get("CO_BUS_KEY", "")
+
+# 협업 역할 (세션별 1:1)
+ROLES = ("FE", "BE", "AI", "Producer")
+
+# 교차 리소스 어드바이저리 락 대상 (이 외 이름도 락은 가능하나, 규약상 명시 대상)
+LOCKABLE_RESOURCES = (
+    "nas-deploy",
+    "stock-db-schema",
+    "lotto-db-schema",
+    "memory-mirror",
+    "nginx-conf",
+    "compose",
+)
+
+DEFAULT_LOCK_TTL = 300
+TEAM_LOG_MAXLEN = 500
--- a/co-gahusb/app/locks.py
+++ b/co-gahusb/app/locks.py
@@ -0,0 +1,66 @@
+# co-gahusb/app/locks.py
+from redis.exceptions import WatchError
+
+LOCK_PREFIX = "co:lock:"
+
+
+async def acquire_lock(r, resource, role, ttl_sec=300):
+    key = LOCK_PREFIX + resource
+    ok = await r.set(key, role, nx=True, ex=ttl_sec)
+    if ok:
+        return {"acquired": True}
+    held_by = await r.get(key)
+    ttl = await r.ttl(key)
+    return {"acquired": False, "held_by": held_by, "ttl_remaining": max(ttl, 0)}
+
+
+async def release_lock(r, resource, role):
+    key = LOCK_PREFIX + resource
+    async with r.pipeline() as pipe:
+        while True:
+            try:
+                await pipe.watch(key)
+                owner = await pipe.get(key)
+                if owner != role:
+                    await pipe.unwatch()
+                    return {"released": False, "held_by": owner}
+                pipe.multi()
+                pipe.delete(key)
+                await pipe.execute()
+                return {"released": True}
+            except WatchError:
+                continue
+
+
+async def heartbeat_lock(r, resource, role, ttl_sec=300):
+    key = LOCK_PREFIX + resource
+    async with r.pipeline() as pipe:
+        while True:
+            try:
+                await pipe.watch(key)
+                owner = await pipe.get(key)
+                if owner != role:
+                    await pipe.unwatch()
+                    return {"renewed": False, "held_by": owner}
+                pipe.multi()
+                pipe.expire(key, ttl_sec)
+                await pipe.execute()
+                return {"renewed": True}
+            except WatchError:
+                continue
+
+
+async def list_locks(r):
+    keys = await r.keys(LOCK_PREFIX + "*")
+    out = []
+    for key in keys:
+        held_by = await r.get(key)
+        if held_by is None:
+            continue
+        ttl = await r.ttl(key)
+        out.append({
+            "resource": key[len(LOCK_PREFIX):],
+            "held_by": held_by,
+            "ttl_remaining": max(ttl, 0),
+        })
+    return {"locks": out}
--- a/co-gahusb/app/server.py
+++ b/co-gahusb/app/server.py
@@ -0,0 +1,138 @@
+# co-gahusb/app/server.py
+import logging
+
+import redis.asyncio as aioredis
+from mcp.server.fastmcp import FastMCP
+from mcp.server.transport_security import TransportSecuritySettings
+from starlette.applications import Starlette
+from starlette.middleware import Middleware
+from starlette.middleware.base import BaseHTTPMiddleware
+from starlette.responses import JSONResponse
+from starlette.routing import Mount, Route
+
+from app import config, locks, store
+
+log = logging.getLogger("co-gahusb")
+_auth_failed_logged = False
+
+_redis = aioredis.from_url(config.REDIS_URL, decode_responses=True)
+
+# DNS-rebinding 보호 비활성화: 실 보안은 nginx 앞단 Bearer 인증(MCP 도달 전 401)이다.
+# 원격 HTTPS + 정적키 모델이라 Host 화이트리스트는 보안가치 ~0이고, 도메인 변경 시 또 깨진다.
+mcp = FastMCP(
+    "co-gahusb",
+    transport_security=TransportSecuritySettings(enable_dns_rebinding_protection=False),
+)
+
+
+# ---- 메시지 ----
+@mcp.tool()
+async def post_message(from_role: str, to_role: str, body: str, thread_id: str = "") -> dict:
+    """다른 역할의 우편함에 메시지를 보낸다."""
+    res = await store.post_message(_redis, from_role, to_role, body, thread_id or None)
+    await store.log_event(_redis, "message", f"{from_role}→{to_role}: {body[:60]}")
+    return res
+
+
+@mcp.tool()
+async def read_inbox(role: str, after_id: int = 0, mark_read: bool = False) -> dict:
+    """내 역할 우편함을 커서 기반으로 읽는다."""
+    return await store.read_inbox(_redis, role, after_id, mark_read)
+
+
+# ---- 작업 ----
+@mcp.tool()
+async def create_task(title: str, assignee_role: str, created_by: str, detail: str = "") -> dict:
+    """작업을 만들어 특정 역할에 배정한다."""
+    res = await store.create_task(_redis, title, assignee_role, created_by, detail or None)
+    await store.log_event(_redis, "task", f"{created_by} created '{title}' → {assignee_role}")
+    return res
+
+
+@mcp.tool()
+async def claim_task(task_id: int, role: str) -> dict:
+    """open 작업을 점유(in_progress)한다. 이미 점유면 거부."""
+    res = await store.claim_task(_redis, task_id, role)
+    if res.get("ok"):
+        await store.log_event(_redis, "task", f"{role} claimed task#{task_id}")
+    return res
+
+
+@mcp.tool()
+async def update_task(task_id: int, status: str, role: str, note: str = "") -> dict:
+    """작업 상태를 갱신한다 (open/in_progress/blocked/done)."""
+    res = await store.update_task(_redis, task_id, status, role, note or None)
+    await store.log_event(_redis, "task", f"{role} set task#{task_id} → {status}")
+    return res
+
+
+@mcp.tool()
+async def list_tasks(status: str = "", assignee_role: str = "") -> dict:
+    """작업 목록을 조회한다(상태/담당 필터)."""
+    return await store.list_tasks(_redis, status or None, assignee_role or None)
+
+
+# ---- 락 ----
+@mcp.tool()
+async def acquire_lock(resource: str, role: str, ttl_sec: int = config.DEFAULT_LOCK_TTL) -> dict:
+    """공유 리소스 변경 전 어드바이저리 락을 획득한다. 점유 중이면 acquired=false."""
+    res = await locks.acquire_lock(_redis, resource, role, ttl_sec)
+    if res.get("acquired"):
+        await store.log_event(_redis, "lock", f"{role} acquired {resource}")
+    return res
+
+
+@mcp.tool()
+async def release_lock(resource: str, role: str) -> dict:
+    """소유한 락을 해제한다."""
+    res = await locks.release_lock(_redis, resource, role)
+    if res.get("released"):
+        await store.log_event(_redis, "lock", f"{role} released {resource}")
+    return res
+
+
+@mcp.tool()
+async def heartbeat_lock(resource: str, role: str, ttl_sec: int = config.DEFAULT_LOCK_TTL) -> dict:
+    """긴 작업 중 락 TTL을 갱신한다(소유자만)."""
+    return await locks.heartbeat_lock(_redis, resource, role, ttl_sec)
+
+
+@mcp.tool()
+async def list_locks() -> dict:
+    """현재 점유 중인 모든 락을 조회한다."""
+    return await locks.list_locks(_redis)
+
+
+# ---- 가시성 ----
+@mcp.tool()
+async def team_log(after_id: int = 0) -> dict:
+    """팀 전체 최근 활동 피드(메시지·작업·락)를 조회한다."""
+    return await store.read_team_log(_redis, after_id)
+
+
+# ---- Bearer 인증 미들웨어 ----
+class BearerAuth(BaseHTTPMiddleware):
+    async def dispatch(self, request, call_next):
+        global _auth_failed_logged
+        if request.url.path.startswith("/health"):
+            return await call_next(request)
+        expected = f"Bearer {config.CO_BUS_KEY}"
+        if not config.CO_BUS_KEY or request.headers.get("authorization") != expected:
+            if not _auth_failed_logged:
+                log.error("co-gahusb 인증 실패 (이후 동일 로그 생략)")
+                _auth_failed_logged = True
+            return JSONResponse({"error": "unauthorized"}, status_code=401)
+        return await call_next(request)
+
+
+async def _health(request):
+    return JSONResponse({"status": "ok"})
+
+
+_mcp_app = mcp.streamable_http_app()
+
+app = Starlette(
+    routes=[Route("/health", _health), Mount("/", app=_mcp_app)],
+    middleware=[Middleware(BearerAuth)],
+    lifespan=_mcp_app.router.lifespan_context,
+)
--- a/co-gahusb/app/store.py
+++ b/co-gahusb/app/store.py
@@ -0,0 +1,157 @@
+# co-gahusb/app/store.py
+import json
+import time
+
+from app.config import TEAM_LOG_MAXLEN
+
+MSG_SEQ = "co:msgseq"
+INBOX_PREFIX = "co:inbox:"      # list of message ids per role
+MSG_PREFIX = "co:msg:"          # hash per message
+READ_PREFIX = "co:read:"        # last-read cursor per role
+
+
+def _now_iso():
+    return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+
+
+async def post_message(r, from_role, to_role, body, thread_id=None):
+    mid = await r.incr(MSG_SEQ)
+    payload = {
+        "id": str(mid),
+        "from_role": from_role,
+        "to_role": to_role,
+        "body": body,
+        "thread_id": thread_id or "",
+        "ts": _now_iso(),
+    }
+    await r.set(MSG_PREFIX + str(mid), json.dumps(payload))
+    await r.rpush(INBOX_PREFIX + to_role, mid)
+    return {"message_id": mid}
+
+
+async def read_inbox(r, role, after_id=0, mark_read=False):
+    ids = await r.lrange(INBOX_PREFIX + role, 0, -1)
+    ids = [int(x) for x in ids if int(x) > int(after_id)]
+    messages = []
+    for mid in ids:
+        raw = await r.get(MSG_PREFIX + str(mid))
+        if raw:
+            d = json.loads(raw)
+            d["id"] = int(d["id"])
+            messages.append(d)
+    cursor = ids[-1] if ids else int(after_id)
+    if mark_read and ids:
+        await r.set(READ_PREFIX + role, cursor)
+    return {"messages": messages, "cursor": cursor}
+
+
+TASK_SEQ = "co:taskseq"
+TASK_PREFIX = "co:task:"        # hash per task
+TASK_SET = "co:tasks"           # set of task ids
+
+VALID_STATUS = ("open", "in_progress", "blocked", "done")
+
+
+async def create_task(r, title, assignee_role, created_by, detail=None):
+    tid = await r.incr(TASK_SEQ)
+    task = {
+        "id": str(tid),
+        "title": title,
+        "assignee_role": assignee_role,
+        "status": "open",
+        "detail": detail or "",
+        "created_by": created_by,
+        "note": "",
+        "ts": _now_iso(),
+    }
+    await r.hset(TASK_PREFIX + str(tid), mapping=task)
+    await r.sadd(TASK_SET, tid)
+    return {"task_id": tid}
+
+
+async def _get_task(r, task_id):
+    d = await r.hgetall(TASK_PREFIX + str(task_id))
+    if not d:
+        return None
+    d["id"] = int(d["id"])
+    return d
+
+
+async def claim_task(r, task_id, role):
+    key = TASK_PREFIX + str(task_id)
+    async with r.pipeline() as pipe:
+        while True:
+            try:
+                await pipe.watch(key)
+                status = await pipe.hget(key, "status")
+                if status is None:
+                    await pipe.unwatch()
+                    return {"ok": False, "error": "not_found"}
+                if status != "open":
+                    held = await pipe.hget(key, "assignee_role")
+                    await pipe.unwatch()
+                    return {"ok": False, "held_by": held}
+                pipe.multi()
+                pipe.hset(key, mapping={"status": "in_progress", "assignee_role": role})
+                await pipe.execute()
+                return {"ok": True, "task": await _get_task(r, task_id)}
+            except Exception as e:
+                from redis.exceptions import WatchError
+                if isinstance(e, WatchError):
+                    continue
+                raise
+
+
+async def update_task(r, task_id, status, role, note=None):
+    if status not in VALID_STATUS:
+        raise ValueError(f"invalid status: {status}")
+    key = TASK_PREFIX + str(task_id)
+    if not await r.exists(key):
+        return {"ok": False, "error": "not_found"}
+    mapping = {"status": status}
+    if note is not None:
+        mapping["note"] = note
+    await r.hset(key, mapping=mapping)
+    return {"ok": True, "task": await _get_task(r, task_id)}
+
+
+async def list_tasks(r, status=None, assignee_role=None):
+    ids = sorted(int(x) for x in await r.smembers(TASK_SET))
+    tasks = []
+    for tid in ids:
+        t = await _get_task(r, tid)
+        if t is None:
+            continue
+        if status and t["status"] != status:
+            continue
+        if assignee_role and t["assignee_role"] != assignee_role:
+            continue
+        tasks.append(t)
+    return {"tasks": tasks}
+
+
+LOG_SEQ = "co:logseq"
+LOG_LIST = "co:log"             # list of event ids (capped)
+LOG_PREFIX = "co:logitem:"
+
+
+async def log_event(r, kind, text):
+    eid = await r.incr(LOG_SEQ)
+    item = {"id": eid, "kind": kind, "text": text, "ts": _now_iso()}
+    await r.set(LOG_PREFIX + str(eid), json.dumps(item))
+    await r.rpush(LOG_LIST, eid)
+    await r.ltrim(LOG_LIST, -TEAM_LOG_MAXLEN, -1)
+    return {"event_id": eid}
+
+
+async def read_team_log(r, after_id=0, limit=100):
+    ids = [int(x) for x in await r.lrange(LOG_LIST, 0, -1)]
+    ids = [i for i in ids if i > int(after_id)]
+    ids = ids[-limit:]
+    events = []
+    for eid in ids:
+        raw = await r.get(LOG_PREFIX + str(eid))
+        if raw:
+            events.append(json.loads(raw))
+    cursor = ids[-1] if ids else int(after_id)
+    return {"events": events, "cursor": cursor}
--- a/co-gahusb/pytest.ini
+++ b/co-gahusb/pytest.ini
@@ -0,0 +1,3 @@
+[pytest]
+asyncio_mode = auto
+testpaths = tests
--- a/co-gahusb/requirements.txt
+++ b/co-gahusb/requirements.txt
@@ -0,0 +1,7 @@
+mcp>=1.2.0
+starlette>=0.37
+uvicorn[standard]==0.34.0
+redis>=5.0
+pytest>=8.0
+pytest-asyncio>=0.24
+fakeredis>=2.21
--- a/co-gahusb/tests/init.py
+++ b/co-gahusb/tests/init.py
--- a/co-gahusb/tests/conftest.py
+++ b/co-gahusb/tests/conftest.py
@@ -0,0 +1,11 @@
+# co-gahusb/tests/conftest.py
+import pytest_asyncio
+import fakeredis.aioredis
+
+
+@pytest_asyncio.fixture
+async def r():
+    client = fakeredis.aioredis.FakeRedis(decode_responses=True)
+    await client.flushall()
+    yield client
+    await client.aclose()
--- a/co-gahusb/tests/test_locks.py
+++ b/co-gahusb/tests/test_locks.py
@@ -0,0 +1,51 @@
+# co-gahusb/tests/test_locks.py
+from app import locks
+
+
+async def test_acquire_succeeds_then_blocks_other(r):
+    res = await locks.acquire_lock(r, "nas-deploy", "BE", ttl_sec=300)
+    assert res["acquired"] is True
+
+    res2 = await locks.acquire_lock(r, "nas-deploy", "FE", ttl_sec=300)
+    assert res2["acquired"] is False
+    assert res2["held_by"] == "BE"
+    assert res2["ttl_remaining"] > 0
+
+
+async def test_release_only_by_owner(r):
+    await locks.acquire_lock(r, "compose", "BE", ttl_sec=300)
+
+    bad = await locks.release_lock(r, "compose", "FE")
+    assert bad["released"] is False
+
+    ok = await locks.release_lock(r, "compose", "BE")
+    assert ok["released"] is True
+
+    again = await locks.acquire_lock(r, "compose", "FE", ttl_sec=300)
+    assert again["acquired"] is True
+
+
+async def test_heartbeat_only_by_owner_renews_ttl(r):
+    await locks.acquire_lock(r, "nginx-conf", "BE", ttl_sec=10)
+
+    bad = await locks.heartbeat_lock(r, "nginx-conf", "FE", ttl_sec=300)
+    assert bad["renewed"] is False
+
+    ok = await locks.heartbeat_lock(r, "nginx-conf", "BE", ttl_sec=300)
+    assert ok["renewed"] is True
+    assert await r.ttl("co:lock:nginx-conf") > 100
+
+
+async def test_expired_lock_is_reacquirable(r):
+    await locks.acquire_lock(r, "memory-mirror", "AI", ttl_sec=1)
+    await r.delete("co:lock:memory-mirror")
+    res = await locks.acquire_lock(r, "memory-mirror", "FE", ttl_sec=300)
+    assert res["acquired"] is True
+
+
+async def test_list_locks(r):
+    await locks.acquire_lock(r, "nas-deploy", "BE", ttl_sec=300)
+    await locks.acquire_lock(r, "compose", "FE", ttl_sec=300)
+    listed = await locks.list_locks(r)
+    held = {l["resource"]: l["held_by"] for l in listed["locks"]}
+    assert held == {"nas-deploy": "BE", "compose": "FE"}
--- a/co-gahusb/tests/test_messages.py
+++ b/co-gahusb/tests/test_messages.py
@@ -0,0 +1,47 @@
+# co-gahusb/tests/test_messages.py
+from app import store
+
+
+async def test_post_and_read_ordering(r):
+    id1 = (await store.post_message(r, "Producer", "BE", "first"))["message_id"]
+    id2 = (await store.post_message(r, "Producer", "BE", "second"))["message_id"]
+    assert id2 > id1
+
+    res = await store.read_inbox(r, "BE")
+    bodies = [m["body"] for m in res["messages"]]
+    assert bodies == ["first", "second"]
+    assert res["cursor"] == id2
+
+
+async def test_read_inbox_after_id(r):
+    id1 = (await store.post_message(r, "Producer", "BE", "first"))["message_id"]
+    await store.post_message(r, "Producer", "BE", "second")
+    res = await store.read_inbox(r, "BE", after_id=id1)
+    assert [m["body"] for m in res["messages"]] == ["second"]
+
+
+async def test_inboxes_isolated_per_role(r):
+    await store.post_message(r, "Producer", "BE", "for-be")
+    await store.post_message(r, "Producer", "FE", "for-fe")
+    be = await store.read_inbox(r, "BE")
+    fe = await store.read_inbox(r, "FE")
+    assert [m["body"] for m in be["messages"]] == ["for-be"]
+    assert [m["body"] for m in fe["messages"]] == ["for-fe"]
+
+
+async def test_mark_read_advances_cursor(r):
+    await store.post_message(r, "Producer", "BE", "first")
+    res = await store.read_inbox(r, "BE", mark_read=True)
+    last = res["cursor"]
+    await store.post_message(r, "Producer", "BE", "second")
+    res2 = await store.read_inbox(r, "BE", after_id=last)
+    assert [m["body"] for m in res2["messages"]] == ["second"]
+
+
+async def test_message_fields(r):
+    await store.post_message(r, "Producer", "BE", "hi", thread_id="t1")
+    res = await store.read_inbox(r, "BE")
+    m = res["messages"][0]
+    assert m["from_role"] == "Producer"
+    assert m["thread_id"] == "t1"
+    assert "ts" in m and "id" in m
--- a/co-gahusb/tests/test_server.py
+++ b/co-gahusb/tests/test_server.py
@@ -0,0 +1,54 @@
+# co-gahusb/tests/test_server.py
+import os
+os.environ["CO_BUS_KEY"] = "test-key"
+
+# config.CO_BUS_KEY는 import 시점에 한 번 읽히므로, 다른 테스트 모듈이 app.config를
+# 먼저 import하면 빈 값으로 굳는다. import 순서와 무관하게 모듈 속성을 직접 강제한다.
+from app import config
+config.CO_BUS_KEY = "test-key"
+
+from starlette.testclient import TestClient
+from app.server import app
+
+
+def test_health_open_without_auth():
+    client = TestClient(app)
+    res = client.get("/health")
+    assert res.status_code == 200
+    assert res.json()["status"] == "ok"
+
+
+def test_mcp_requires_bearer():
+    client = TestClient(app)
+    res = client.post("/mcp", json={})
+    assert res.status_code == 401
+
+
+def test_mcp_wrong_key_rejected():
+    client = TestClient(app)
+    res = client.post("/mcp", json={}, headers={"Authorization": "Bearer wrong"})
+    assert res.status_code == 401
+
+
+def test_mcp_valid_auth_passes_dns_host_check():
+    # 유효한 키는 인증 게이트를 통과하고, MCP DNS-rebinding Host 검증에 막혀선 안 된다.
+    # TestClient 기본 Host="testserver"는 localhost가 아니므로, 보호가 켜져 있으면 421.
+    # 컨텍스트 매니저로 써야 lifespan(세션 매니저 task group)이 기동되어 MCP 핸들러까지 도달.
+    with TestClient(app) as client:
+        res = client.post(
+            "/mcp",
+            headers={
+                "Authorization": "Bearer test-key",
+                "Content-Type": "application/json",
+                "Accept": "application/json, text/event-stream",
+            },
+            json={
+                "jsonrpc": "2.0", "id": 1, "method": "initialize",
+                "params": {
+                    "protocolVersion": "2024-11-05", "capabilities": {},
+                    "clientInfo": {"name": "smoke", "version": "0"},
+                },
+            },
+        )
+    assert res.status_code != 401  # 인증 통과
+    assert res.status_code != 421  # Host 검증에 막히면 안 됨
--- a/co-gahusb/tests/test_tasks.py
+++ b/co-gahusb/tests/test_tasks.py
@@ -0,0 +1,56 @@
+# co-gahusb/tests/test_tasks.py
+import pytest
+from app import store
+
+
+async def test_create_and_list(r):
+    res = await store.create_task(r, "deploy FE", "FE", created_by="Producer", detail="ship it")
+    tid = res["task_id"]
+    listed = await store.list_tasks(r)
+    t = [t for t in listed["tasks"] if t["id"] == tid][0]
+    assert t["title"] == "deploy FE"
+    assert t["assignee_role"] == "FE"
+    assert t["status"] == "open"
+    assert t["created_by"] == "Producer"
+
+
+async def test_claim_then_duplicate_claim_rejected(r):
+    tid = (await store.create_task(r, "x", "FE", created_by="Producer"))["task_id"]
+    ok = await store.claim_task(r, tid, "FE")
+    assert ok["ok"] is True
+    assert ok["task"]["status"] == "in_progress"
+
+    dup = await store.claim_task(r, tid, "BE")
+    assert dup["ok"] is False
+    assert dup["held_by"] == "FE"
+
+
+async def test_update_status(r):
+    tid = (await store.create_task(r, "x", "FE", created_by="Producer"))["task_id"]
+    await store.claim_task(r, tid, "FE")
+    res = await store.update_task(r, tid, "done", "FE", note="finished")
+    assert res["ok"] is True
+    assert res["task"]["status"] == "done"
+    assert res["task"]["note"] == "finished"
+
+
+async def test_list_filters(r):
+    t1 = (await store.create_task(r, "a", "FE", created_by="Producer"))["task_id"]
+    await store.create_task(r, "b", "BE", created_by="Producer")
+    await store.claim_task(r, t1, "FE")
+    fe = await store.list_tasks(r, assignee_role="FE")
+    assert [t["title"] for t in fe["tasks"]] == ["a"]
+    in_prog = await store.list_tasks(r, status="in_progress")
+    assert [t["title"] for t in in_prog["tasks"]] == ["a"]
+
+
+async def test_invalid_status_rejected(r):
+    tid = (await store.create_task(r, "x", "FE", created_by="Producer"))["task_id"]
+    with pytest.raises(ValueError):
+        await store.update_task(r, tid, "bogus", "FE")
+
+
+async def test_update_nonexistent_task_returns_not_found(r):
+    res = await store.update_task(r, 999, "done", "FE")
+    assert res["ok"] is False
+    assert res["error"] == "not_found"
--- a/co-gahusb/tests/test_teamlog.py
+++ b/co-gahusb/tests/test_teamlog.py
@@ -0,0 +1,25 @@
+# co-gahusb/tests/test_teamlog.py
+from app import store
+
+
+async def test_log_event_and_read(r):
+    await store.log_event(r, "message", "Producer→BE: hi")
+    await store.log_event(r, "lock", "BE acquired nas-deploy")
+    res = await store.read_team_log(r)
+    msgs = [e["text"] for e in res["events"]]
+    assert msgs == ["Producer→BE: hi", "BE acquired nas-deploy"]
+
+
+async def test_team_log_after_id(r):
+    e1 = (await store.log_event(r, "message", "a"))["event_id"]
+    await store.log_event(r, "message", "b")
+    res = await store.read_team_log(r, after_id=e1)
+    assert [e["text"] for e in res["events"]] == ["b"]
+
+
+async def test_team_log_capped(r):
+    for i in range(10):
+        await store.log_event(r, "message", f"m{i}")
+    res = await store.read_team_log(r, limit=3)
+    assert len(res["events"]) == 3
+    assert res["events"][-1]["text"] == "m9"
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -221,6 +221,25 @@ services:
      timeout: 5s
      retries: 3

+  co-gahusb:
+    build:
+      context: ./co-gahusb
+    container_name: co-gahusb
+    restart: unless-stopped
+    ports:
+      - "18920:8000"
+    environment:
+      - TZ=${TZ:-Asia/Seoul}
+      - REDIS_URL=${REDIS_URL:-redis://redis:6379}
+      - CO_BUS_KEY=${CO_BUS_KEY:-}
+    depends_on:
+      - redis
+    healthcheck:
+      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
+      interval: 60s
+      timeout: 5s
+      retries: 3
+
  agent-office:
    build:
      context: ./agent-office
@@ -249,6 +268,7 @@ services:
      - CONVERSATION_HISTORY_LIMIT=${CONVERSATION_HISTORY_LIMIT:-20}
      - CONVERSATION_RATE_PER_MIN=${CONVERSATION_RATE_PER_MIN:-6}
      - YOUTUBE_DATA_API_KEY=${YOUTUBE_DATA_API_KEY:-}
+      - REDIS_URL=${REDIS_URL:-redis://redis:6379}
    volumes:
      - ${RUNTIME_PATH:-.}/data/agent-office:/app/data
    depends_on:
@@ -256,6 +276,7 @@ services:
      - music-lab
      - insta-lab
      - realestate-lab
+      - redis
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
      interval: 60s
@@ -443,7 +464,7 @@ services:
      - "6379:6379"
    volumes:
      - ${RUNTIME_PATH}/redis-data:/data
-    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
+    command: redis-server --appendonly yes --save "" --stop-writes-on-bgsave-error no --maxmemory 256mb --maxmemory-policy allkeys-lru
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 60s
--- a/docs/superpowers/plans/2026-06-12-co-gahusb-team-bus.md
+++ b/docs/superpowers/plans/2026-06-12-co-gahusb-team-bus.md
--- a/docs/superpowers/plans/2026-06-12-music-pipeline-reliability.md
+++ b/docs/superpowers/plans/2026-06-12-music-pipeline-reliability.md
@@ -0,0 +1,556 @@
+# music/YouTube 파이프라인 신뢰성·복구 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 파이프라인 step 실패를 자동 재시도(일시적, publish 제외)로 흡수하고, 영구 실패는 terminal `failed`로 둔 뒤 실패 step부터 수동 재개(텔레그램 [🔄재시도])할 수 있게 한다.
+
+**Architecture:** music-lab `orchestrator.run_step`에 bounded 재시도 루프 + `POST /pipeline/{id}/retry` 재개 엔드포인트 + `db.get_last_failed_step`. agent-office `youtube_publisher`가 `failed` 감지 → 텔레그램 알림+버튼, `webhook`이 `ytpub_retry_{pid}` 콜백을 music-lab retry로 프록시.
+
+**Tech Stack:** Python 3.12 / FastAPI / SQLite / asyncio / pytest. 기존 패턴: `orchestrator.run_step`(BackgroundTask), `main.py` pipeline 엔드포인트(404/409 + `_db_module`), `service_proxy`(httpx + `MUSIC_LAB_URL`), `telegram/webhook.py`(callback prefix 디스패치).
+
+**Spec:** `docs/superpowers/specs/2026-06-12-music-pipeline-reliability-design.md`
+
+> **테스트 fixture 주의**: music-lab/agent-office 각 `tests/conftest.py`의 DB 격리 방식(`db.DB_PATH` monkeypatch + `init_db`)을 먼저 확인하고 아래 테스트의 fixture를 그 관례에 맞춰라. 아래 코드는 `db.DB_PATH`를 tmp로 monkeypatch하는 표준 패턴을 가정한다.
+
+---
+
+## File Structure
+
+| 파일 | 변경 | 책임 |
+|------|------|------|
+| `music-lab/app/db.py` | Modify | `get_last_failed_step(pid)` 추가 |
+| `music-lab/app/pipeline/orchestrator.py` | Modify | `_dispatch_step` 추출 + `run_step` 재시도 루프 |
+| `music-lab/app/main.py` | Modify | `POST /api/music/pipeline/{pid}/retry` |
+| `music-lab/tests/test_pipeline_retry.py` | Create | db + orchestrator + endpoint 테스트 |
+| `agent-office/app/service_proxy.py` | Modify | `pipeline_retry(pid)`, `list_failed_pipelines()` |
+| `agent-office/app/agents/youtube_publisher.py` | Modify | `failed` 감지 → 텔레그램 알림+버튼 |
+| `agent-office/app/telegram/webhook.py` | Modify | `ytpub_retry_` 디스패치 |
+| `agent-office/tests/test_youtube_publisher_retry.py` | Create | 알림 + 콜백 테스트 |
+| `web-backend/CLAUDE.md` + `memory/service_music.md` | Modify | API 표 + 메모리 |
+
+---
+
+## Task 1: music-lab db — `get_last_failed_step`
+
+**Files:** Modify `music-lab/app/db.py`; Test `music-lab/tests/test_pipeline_retry.py` (Create)
+
+- [ ] **Step 1: 실패 테스트 작성**
+
+`music-lab/tests/test_pipeline_retry.py` (fixture는 music-lab conftest 관례에 맞춰 조정):
+```python
+import pytest
+from app import db
+
+
+@pytest.fixture(autouse=True)
+def _tmp_db(tmp_path, monkeypatch):
+    monkeypatch.setattr(db, "DB_PATH", str(tmp_path / "music.db"))
+    db.init_db()
+
+
+def _make_pipeline_with_failed_step(step: str) -> int:
+    pid = db.create_pipeline(track_id=1)  # 시그니처는 conftest/db 확인 후 맞출 것
+    job = db.create_pipeline_job(pid, step)
+    db.update_pipeline_job(job, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason=f"{step}: boom")
+    return pid
+
+
+def test_get_last_failed_step_returns_step():
+    pid = _make_pipeline_with_failed_step("video")
+    assert db.get_last_failed_step(pid) == "video"
+
+
+def test_get_last_failed_step_none_when_no_failure():
+    pid = db.create_pipeline(track_id=1)
+    db.create_pipeline_job(pid, "cover")  # status 기본(running/succeeded), failed 아님
+    assert db.get_last_failed_step(pid) is None
+```
+
+- [ ] **Step 2: 실패 확인**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py::test_get_last_failed_step_returns_step -v`
+Expected: FAIL — `db.get_last_failed_step` 미존재. (create_pipeline 시그니처가 다르면 helper를 db의 실제 생성 함수에 맞춰 수정.)
+
+- [ ] **Step 3: 구현**
+
+`music-lab/app/db.py`의 pipeline_jobs 섹션(`list_pipeline_jobs` 근처)에 추가:
+```python
+def get_last_failed_step(pid: int) -> Optional[str]:
+    """파이프라인의 가장 최근 status='failed' pipeline_job의 step. 없으면 None."""
+    with _connect() as conn:   # music-lab의 커넥션 헬퍼 이름에 맞출 것
+        row = conn.execute(
+            "SELECT step FROM pipeline_jobs "
+            "WHERE pipeline_id = ? AND status = 'failed' "
+            "ORDER BY id DESC LIMIT 1",
+            (pid,),
+        ).fetchone()
+    return row["step"] if row else None
+```
+(`_connect`/`_conn` 등 실제 커넥션 컨텍스트매니저 이름은 db.py 상단 확인 후 일치시킬 것.)
+
+- [ ] **Step 4: 통과 확인**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py -v -k get_last_failed`
+Expected: 2 PASS.
+
+- [ ] **Step 5: 커밋**
+```bash
+git add music-lab/app/db.py music-lab/tests/test_pipeline_retry.py
+git commit -m "feat(music-lab): get_last_failed_step — 파이프라인 재개용 실패 step 판별
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 2: orchestrator 자동 재시도
+
+**Files:** Modify `music-lab/app/pipeline/orchestrator.py`; Test `music-lab/tests/test_pipeline_retry.py`
+
+- [ ] **Step 1: 실패 테스트 작성** (test_pipeline_retry.py에 추가)
+
+```python
+import asyncio
+from app.pipeline import orchestrator
+
+
+@pytest.fixture(autouse=True)
+def _no_backoff(monkeypatch):
+    monkeypatch.setattr(orchestrator, "STEP_RETRY_BACKOFF_SEC", [0, 0])
+
+
+@pytest.mark.asyncio
+async def test_retryable_step_retries_then_succeeds(monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    calls = {"n": 0}
+
+    async def flaky(step, p, ctx, feedback):
+        calls["n"] += 1
+        if calls["n"] < 3:
+            raise RuntimeError("transient")
+        return {"next_state": "video_pending", "fields": {}}
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", flaky)
+    monkeypatch.setattr(orchestrator, "_resolve_input", lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0})
+
+    await orchestrator.run_step(pid, "cover")
+    assert calls["n"] == 3
+    assert db.get_pipeline(pid)["state"] == "video_pending"
+
+
+@pytest.mark.asyncio
+async def test_retryable_step_exhausts_to_failed(monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+
+    async def always_fail(step, p, ctx, feedback):
+        raise RuntimeError("permanent")
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", always_fail)
+    monkeypatch.setattr(orchestrator, "_resolve_input", lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0})
+
+    await orchestrator.run_step(pid, "cover")
+    assert db.get_pipeline(pid)["state"] == "failed"
+
+
+@pytest.mark.asyncio
+async def test_publish_not_retried(monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    calls = {"n": 0}
+
+    async def fail_publish(step, p, ctx, feedback):
+        calls["n"] += 1
+        raise RuntimeError("upload error")
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", fail_publish)
+    monkeypatch.setattr(orchestrator, "_resolve_input", lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0})
+
+    await orchestrator.run_step(pid, "publish")
+    assert calls["n"] == 1          # 재시도 없음
+    assert db.get_pipeline(pid)["state"] == "failed"
+```
+
+- [ ] **Step 2: 실패 확인**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py -v -k "retry or publish_not"`
+Expected: FAIL — `_dispatch_step`/`STEP_RETRY_BACKOFF_SEC` 미존재.
+
+- [ ] **Step 3: 구현 — `_dispatch_step` 추출 + 재시도 루프**
+
+`orchestrator.py` 상단 상수 추가:
+```python
+STEP_MAX_RETRIES = 2          # 추가 재시도 횟수 (총 시도 = +1)
+STEP_RETRY_BACKOFF_SEC = [5, 15]
+NON_RETRY_STEPS = {"publish"}
+```
+
+기존 if/elif 분기(현재 `run_step` 내 lines 32-45)를 헬퍼로 추출:
+```python
+async def _dispatch_step(step: str, p: dict, ctx: dict, feedback: str) -> dict:
+    if step == "cover":
+        return await _run_cover(p, ctx, feedback)
+    if step == "video":
+        return await _run_video(p, ctx)
+    if step == "thumb":
+        return await _run_thumb(p, ctx, feedback)
+    if step == "meta":
+        return await _run_meta(p, ctx, feedback)
+    if step == "review":
+        return await _run_review(p, ctx)
+    if step == "publish":
+        return await _run_publish(p, ctx)
+    raise ValueError(f"unknown step: {step}")
+```
+
+`run_step`의 try 블록(step 실행부)을 재시도 루프로 교체:
+```python
+    try:
+        ctx = _resolve_input(p)
+    except ValueError as e:
+        db.update_pipeline_job(job_id, status="failed", error=str(e))
+        db.update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {e}")
+        return
+
+    attempts = 1 if step in NON_RETRY_STEPS else (STEP_MAX_RETRIES + 1)
+    last_err = None
+    for i in range(attempts):
+        try:
+            result = await _dispatch_step(step, p, ctx, feedback)
+            db.update_pipeline_job(job_id, status="succeeded")
+            db.update_pipeline_state(pipeline_id, result["next_state"], **result.get("fields", {}))
+            return
+        except Exception as e:
+            last_err = e
+            logger.exception("step %s 실패 (pipeline %s, attempt %d/%d)", step, pipeline_id, i + 1, attempts)
+            if i < attempts - 1:
+                await asyncio.sleep(STEP_RETRY_BACKOFF_SEC[min(i, len(STEP_RETRY_BACKOFF_SEC) - 1)])
+    db.update_pipeline_job(job_id, status="failed", error=str(last_err))
+    db.update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {last_err}")
+```
+(`asyncio`는 이미 import됨.)
+
+- [ ] **Step 4: 통과 확인**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py -v -k "retry or publish_not"`
+Expected: 3 PASS.
+
+- [ ] **Step 5: 커밋**
+```bash
+git add music-lab/app/pipeline/orchestrator.py music-lab/tests/test_pipeline_retry.py
+git commit -m "feat(music-lab): orchestrator step 자동 재시도 (publish 제외)
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 3: retry 엔드포인트
+
+**Files:** Modify `music-lab/app/main.py`; Test `music-lab/tests/test_pipeline_retry.py`
+
+- [ ] **Step 1: 실패 테스트 작성**
+
+```python
+from fastapi.testclient import TestClient
+
+
+@pytest.fixture
+def client(monkeypatch):
+    from app.main import app
+    return TestClient(app)
+
+
+def test_retry_failed_pipeline_retriggers(client, monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "video")
+    db.update_pipeline_job(job, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason="video: boom")
+
+    called = {}
+    from app.pipeline import orchestrator
+    async def fake_run(p, step, *a):
+        called["pid"], called["step"] = p, step
+    monkeypatch.setattr(orchestrator, "run_step", fake_run)
+
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code in (200, 202)
+    assert r.json()["retrying_step"] == "video"
+
+
+def test_retry_non_failed_409(client):
+    pid = db.create_pipeline(track_id=1)  # state='created'
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code == 409
+
+
+def test_retry_publish_with_video_id_rejected(client):
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "publish")
+    db.update_pipeline_job(job, status="failed", error="x")
+    db.update_pipeline_state(pid, "failed", failed_reason="publish: x", youtube_video_id="abc123")
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code == 409
+```
+
+- [ ] **Step 2: 실패 확인**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py -v -k retry_`
+Expected: FAIL — 라우트 404.
+
+- [ ] **Step 3: 구현**
+
+`music-lab/app/main.py`의 `cancel_pipeline` 아래에 추가:
+```python
+@app.post("/api/music/pipeline/{pid}/retry", status_code=202)
+async def retry_pipeline(pid: int, bg: BackgroundTasks):
+    p = _db_module.get_pipeline(pid)
+    if not p:
+        raise HTTPException(404)
+    if p["state"] != "failed":
+        raise HTTPException(409, f"재개 불가 (state={p['state']})")
+    failed_step = _db_module.get_last_failed_step(pid)
+    if not failed_step:
+        # 폴백: failed_reason "{step}: ..." prefix
+        reason = p.get("failed_reason") or ""
+        failed_step = reason.split(":", 1)[0].strip() or None
+    if not failed_step:
+        raise HTTPException(409, "실패 step을 판별할 수 없음")
+    if failed_step == "publish" and p.get("youtube_video_id"):
+        raise HTTPException(409, "이미 업로드됨 (중복 방지)")
+    bg.add_task(orchestrator.run_step, pid, failed_step)
+    return {"ok": True, "retrying_step": failed_step}
+```
+
+- [ ] **Step 4: 통과 확인 + 전체 회귀**
+
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/test_pipeline_retry.py -v` → 모두 PASS
+Run: `cd music-lab && PYTHONPATH=.. python -m pytest tests/ -q` → 회귀 0
+
+- [ ] **Step 5: 커밋**
+```bash
+git add music-lab/app/main.py music-lab/tests/test_pipeline_retry.py
+git commit -m "feat(music-lab): POST /pipeline/{id}/retry — 실패 step 수동 재개
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 4: agent-office service_proxy — pipeline_retry + list_failed
+
+**Files:** Modify `agent-office/app/service_proxy.py`
+
+> **먼저 확인**: `list_active_pipelines`가 호출하는 `GET /api/music/pipeline?status=active`가 failed를 포함하는지. 미포함이면 music-lab의 pipeline list 엔드포인트가 `status=failed`도 지원하는지 확인하고, 없으면 그 엔드포인트에 failed 필터를 추가(별도 작은 수정)하거나 `status` 화이트리스트에 'failed' 추가.
+
+- [ ] **Step 1: 헬퍼 추가** — 기존 `list_active_pipelines`/`post_pipeline_feedback` 패턴(async with httpx.AsyncClient + MUSIC_LAB_URL) 그대로:
+```python
+async def list_failed_pipelines() -> list[dict]:
+    async with httpx.AsyncClient(timeout=10) as client:
+        resp = await client.get(f"{MUSIC_LAB_URL}/api/music/pipeline?status=failed")
+        resp.raise_for_status()
+        data = resp.json()
+        return data if isinstance(data, list) else data.get("items", data.get("pipelines", []))
+
+
+async def pipeline_retry(pid: int) -> dict:
+    async with httpx.AsyncClient(timeout=15) as client:
+        resp = await client.post(f"{MUSIC_LAB_URL}/api/music/pipeline/{pid}/retry")
+        # 409(재개 불가/중복)도 본문 반환 위해 raise 안 함
+        return {"status_code": resp.status_code, **(resp.json() if resp.headers.get("content-type","").startswith("application/json") else {})}
+```
+(`list_active_pipelines`가 이미 failed를 포함하면 `list_failed_pipelines`는 생략하고 Task 5에서 active 목록에서 state=='failed' 필터.)
+
+- [ ] **Step 2: import sanity** — `cd agent-office && PYTHONPATH=.. python -c "from app import service_proxy; print('OK')"` → OK
+
+- [ ] **Step 3: 커밋**
+```bash
+git add agent-office/app/service_proxy.py
+git commit -m "feat(agent-office): service_proxy pipeline_retry + list_failed_pipelines
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 5: youtube_publisher — failed 감지 + 텔레그램 알림/버튼
+
+**Files:** Modify `agent-office/app/agents/youtube_publisher.py`; Test `agent-office/tests/test_youtube_publisher_retry.py` (Create)
+
+- [ ] **Step 1: 실패 테스트 작성**
+
+`agent-office/tests/test_youtube_publisher_retry.py` (DB fixture는 agent-office conftest 관례 따름):
+```python
+import pytest
+from unittest.mock import AsyncMock
+from app.agents.youtube_publisher import YoutubePublisherAgent
+
+
+@pytest.mark.asyncio
+async def test_failed_pipeline_notified_with_retry_button(monkeypatch):
+    agent = YoutubePublisherAgent()
+    monkeypatch.setattr(
+        "app.agents.youtube_publisher.service_proxy.list_active_pipelines",
+        AsyncMock(return_value=[
+            {"id": 7, "state": "failed", "failed_reason": "video: boom", "track_title": "T"}
+        ]),
+    )
+    sent = AsyncMock(return_value={"ok": True, "message_id": 1})
+    monkeypatch.setattr("app.agents.youtube_publisher.send_raw", sent)
+
+    await agent.poll_state_changes()
+    assert sent.await_count == 1
+    args, kwargs = sent.await_args
+    text = kwargs.get("text") or (args[0] if args else "")
+    assert "실패" in text
+    # 인라인 retry 버튼 callback_data
+    rm = kwargs.get("reply_markup") or {}
+    cb = rm["inline_keyboard"][0][0]["callback_data"]
+    assert cb == "ytpub_retry_7"
+
+    # 중복 방지: 같은 failed 재폴링 시 미발송
+    await agent.poll_state_changes()
+    assert sent.await_count == 1
+```
+(주의: `send_raw`가 `reply_markup`을 지원하는지 messaging 확인 — 미지원 시 Task에 messaging.send_raw에 reply_markup 인자 추가 포함. insta는 send_photo로 했으나 여기선 텍스트+버튼이므로 send_raw에 reply_markup 필요.)
+
+- [ ] **Step 2: 실패 확인** — `cd agent-office && PYTHONPATH=.. python -m pytest tests/test_youtube_publisher_retry.py -v` → FAIL (failed 미처리)
+
+- [ ] **Step 3: 구현** — `poll_state_changes`에 failed 분기 추가:
+```python
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._notified_state_per_pipeline: dict[int, tuple] = {}
+        self._notified_failed: set[int] = set()
+```
+`poll_state_changes` 루프 내, `*_pending` 처리 뒤:
+```python
+            if state == "failed" and pid not in self._notified_failed:
+                await self._notify_failed(p)
+                self._notified_failed.add(pid)
+            if state != "failed":
+                self._notified_failed.discard(pid)  # 재개 후 다시 실패하면 재알림
+```
+새 메서드:
+```python
+    async def _notify_failed(self, p: dict) -> None:
+        reason = p.get("failed_reason") or "?"
+        step = reason.split(":", 1)[0].strip()
+        title = p.get("track_title") or f"Pipeline #{p['id']}"
+        text = f"⚠️ [{title}] 파이프라인 #{p['id']} '{step}' 실패\n사유: {reason}"
+        kb = {"inline_keyboard": [[{"text": "🔄 재시도", "callback_data": f"ytpub_retry_{p['id']}"}]]}
+        await send_raw(text=text, reply_markup=kb)
+        add_log(self.agent_id, f"pipeline {p['id']} 실패 알림", "warning")
+```
+`send_raw`가 `reply_markup`을 받도록 `agent-office/app/telegram/messaging.py`의 `send_raw` 시그니처 확인/확장(이미 지원하면 그대로).
+
+- [ ] **Step 4: 통과 확인** — `cd agent-office && PYTHONPATH=.. python -m pytest tests/test_youtube_publisher_retry.py -v` → PASS + 전체 회귀
+
+- [ ] **Step 5: 커밋**
+```bash
+git add agent-office/app/agents/youtube_publisher.py agent-office/app/telegram/messaging.py agent-office/tests/test_youtube_publisher_retry.py
+git commit -m "feat(agent-office): youtube_publisher 파이프라인 실패 텔레그램 알림+재시도 버튼
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 6: webhook ytpub_retry 디스패치
+
+**Files:** Modify `agent-office/app/telegram/webhook.py`; Test `agent-office/tests/test_youtube_publisher_retry.py`
+
+> **먼저 확인**: `_handle_callback`의 prefix 분기 구조 + 기존 핸들러(`_handle_insta_issue` 등)가 service_proxy를 호출/회신하는 패턴.
+
+- [ ] **Step 1: 실패 테스트 추가**
+```python
+@pytest.mark.asyncio
+async def test_handle_ytpub_retry_calls_proxy(monkeypatch):
+    from app.telegram import webhook
+    retry = AsyncMock(return_value={"status_code": 202, "ok": True, "retrying_step": "video"})
+    monkeypatch.setattr("app.telegram.webhook.service_proxy.pipeline_retry", retry, raising=False)
+    monkeypatch.setattr("app.telegram.webhook.send_raw", AsyncMock(), raising=False)
+    res = await webhook._handle_ytpub_retry({"id": 1}, "ytpub_retry_7")
+    retry.assert_awaited_once_with(7)
+```
+(import 경로/`send_raw` 위치는 webhook.py 실제에 맞춤.)
+
+- [ ] **Step 2: 실패 확인** → FAIL (`_handle_ytpub_retry` 미존재)
+
+- [ ] **Step 3: 구현** — `_handle_callback`에 분기:
+```python
+    if callback_id.startswith("ytpub_retry_"):
+        return await _handle_ytpub_retry(callback_query, callback_id)
+```
+핸들러:
+```python
+async def _handle_ytpub_retry(callback_query: dict, callback_id: str) -> dict:
+    try:
+        pid = int(callback_id.removeprefix("ytpub_retry_"))
+    except (ValueError, AttributeError):
+        return {"ok": False, "error": "invalid_callback_data"}
+    res = await service_proxy.pipeline_retry(pid)
+    sc = res.get("status_code")
+    if sc in (200, 202):
+        await send_raw(text=f"🔄 파이프라인 #{pid} 재개: {res.get('retrying_step','?')}")
+    else:
+        await send_raw(text=f"⚠️ 재개 불가 (#{pid}): {res.get('detail', sc)}")
+    return {"ok": True}
+```
+(`service_proxy`/`send_raw` import는 webhook.py 기존 방식 따름.)
+
+- [ ] **Step 4: 통과 확인** + 전체 agent-office 회귀
+
+- [ ] **Step 5: 커밋**
+```bash
+git add agent-office/app/telegram/webhook.py agent-office/tests/test_youtube_publisher_retry.py
+git commit -m "feat(agent-office): ytpub_retry 텔레그램 콜백 → music-lab retry 프록시
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+```
+
+---
+
+## Task 7: 문서 + 배포 + 메모리
+
+**Files:** Modify `web-backend/CLAUDE.md`, `memory/service_music.md`
+
+- [ ] **Step 1: CLAUDE.md music API 표에 추가**
+```
+| POST | `/api/music/pipeline/{id}/retry` | 실패 파이프라인 실패 step부터 재개 (publish+업로드완료 시 409) |
+```
+
+- [ ] **Step 2: 전체 회귀**
+```bash
+cd music-lab && PYTHONPATH=.. python -m pytest tests/ -q
+cd ../agent-office && PYTHONPATH=.. python -m pytest tests/ -q
+```
+Expected: 모두 PASS (사전존재 stale 제외).
+
+- [ ] **Step 3: 커밋 + push (NAS 배포)**
+```bash
+cd C:/Users/jaeoh/Desktop/workspace/web-backend
+git add CLAUDE.md docs/superpowers/plans/2026-06-12-music-pipeline-reliability.md
+git commit -m "docs(music): 파이프라인 retry API 문서 + 구현 계획
+
+Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
+git push origin main
+```
+
+- [ ] **Step 4: 메모리 갱신** — `service_music.md`에 신뢰성/복구(자동 재시도 publish 제외 + 수동 retry 엔드포인트 + youtube_publisher 실패 알림) 추가.
+
+- [ ] **Step 5: 프로덕션 확인(경량)** — 배포 후 `POST /api/music/pipeline/<없는id>/retry` → 404, 실제 failed 파이프라인 있으면 retry 동작. (없으면 단위 테스트로 갈음.)
+
+---
+
+## Self-Review
+
+**Spec coverage:**
+- 자동 재시도(publish 제외, _resolve_input 제외) → Task 2 ✓
+- 수동 재개(실패 step, publish+video_id 가드) → Task 1(step 판별)+Task 3 ✓
+- 실패 알림 + [🔄재시도] → Task 5 ✓
+- 재시도 콜백 → Task 4(proxy)+Task 6(dispatch) ✓
+- stuck 감지 제외(YAGNI) → 계획에 없음 ✓
+
+**Placeholder scan:** 코드 스텝 모두 구체. "conftest 관례 확인"·"list_active가 failed 포함하는지 확인"은 기존 코드 소유를 존중하는 의도적 검증 지시(placeholder 아님).
+
+**Type consistency:** `get_last_failed_step(pid)` Task1↔Task3 일치. `_dispatch_step(step,p,ctx,feedback)` Task2 정의↔테스트 mock 일치. `run_step(pid, step)` 시그니처 기존 일치. callback `ytpub_retry_{pid}` Task5 생성↔Task6 파싱 일치. `pipeline_retry(pid)` Task4↔Task6 일치. retry 응답 `retrying_step`/`status_code` Task3↔Task4↔Task6 일치.
--- a/docs/superpowers/plans/2026-06-29-worker-observability.md
+++ b/docs/superpowers/plans/2026-06-29-worker-observability.md
--- a/docs/superpowers/plans/2026-07-02-realtime-trade-alerts.md
+++ b/docs/superpowers/plans/2026-07-02-realtime-trade-alerts.md
--- a/docs/superpowers/specs/2026-06-12-co-gahusb-team-bus-design.md
+++ b/docs/superpowers/specs/2026-06-12-co-gahusb-team-bus-design.md
@@ -0,0 +1,127 @@
+# co-gahusb — 세션 간 협업 팀 버스 설계
+
+작성일: 2026-06-12
+대상 repo: `web-backend` (서버) + `web-ui`/`web-ai` (클라이언트 배선)
+목적: 독립 실행되는 4개 Claude Code 세션(FE/BE/AI/Producer)이 역할을 갖고 비동기로 소통·협업하되, 공유 DB/리소스는 동시 쓰기를 방지한다.
+
+## 배경
+
+web-ui / web-backend / web-ai 세션은 각각 독립 프로세스라 서로의 컨텍스트를 못 본다. 협업하려면 세 곳(서로 다른 머신 포함)에서 닿는 공유 메시지 버스가 필요하다. 사용자가 방식 B(독립 MCP 서버)를 선택했고, 민감한 공유 영역의 동시 쓰기 분리를 핵심 요구로 명시했다.
+
+## 결정 사항 (브레인스토밍 확정)
+
+- 호스팅: 신규 독립 컨테이너 **`co-gahusb`**, NAS, 포트 **18920**(18900 agent-office 옆, 미사용 확인).
+- 전송/인증: **HTTP streamable MCP** + 정적 **Bearer 키**([[reference_webai_auth_pattern]] 재사용). nginx `/api/co/` → `co-gahusb:18920`, `Authorization` forward.
+- 백엔드: **Redis**(기존 공유 컨테이너 `redis://redis:6379`). 전 연산 원자적 → SQLite multi-writer 함정([[reference_sqlite_concurrency]]) 회피.
+- 동시쓰기 분리: **소유권 파티션 + 어드바이저리 락**.
+- 역할: web-ui=FE, web-backend=BE, web-ai=AI, 이 세션=Producer.
+- 수신: 각 세션 **/loop 폴링**(`read_inbox` + `list_tasks`).
+
+## 아키텍처
+
+```
+[FE 세션 web-ui]   [BE 세션 web-backend]   [AI 세션 web-ai(다른 머신)]   [Producer 세션]
+       \                  |                        /                         /
+        \                 |                       /                         /
+         ──────── .mcp.json HTTP + Bearer ───────────────────────────────
+                                   │
+                    nginx /api/co/ (Authorization forward)
+                                   │
+                       co-gahusb:18920  (FastMCP streamable-http)
+                                   │
+                          Redis (원자적 연산)
+```
+
+서버 구현: **Python `mcp` SDK(FastMCP) + streamable-http transport**(모든 lab이 FastAPI/Python 스택과 일관). 단일 책임 모듈로 분리:
+- `app/server.py` — FastMCP 인스턴스 + 툴 등록 + ASGI 앱(streamable-http) + Bearer 인증 미들웨어
+- `app/store.py` — Redis 데이터 액세스 레이어(메시지/작업/락), 전 함수 원자적
+- `app/locks.py` — 락 Lua 스크립트(소유자 확인 후 release/heartbeat)
+- `app/models.py` — 입출력 dataclass/스키마
+- `app/config.py` — env(REDIS_URL, CO_BUS_KEY, 포트)
+
+## MCP 툴 표면 (MVP — YAGNI)
+
+| 분류 | 툴 | 시그니처 → 반환 |
+|------|-----|------|
+| 메시지 | `post_message` | `(from_role, to_role, body, thread_id?)` → `{message_id}` |
+| 메시지 | `read_inbox` | `(role, after_id?, mark_read?=false)` → `{messages:[{id, from_role, body, thread_id, ts}], cursor}` |
+| 작업 | `create_task` | `(title, assignee_role, detail?, created_by)` → `{task_id}` |
+| 작업 | `claim_task` | `(task_id, role)` → `{ok, task}` (이미 claim 시 `{ok:false, held_by}`) |
+| 작업 | `update_task` | `(task_id, status, role, note?)` → `{ok, task}` (status ∈ open/in_progress/blocked/done) |
+| 작업 | `list_tasks` | `(status?, assignee_role?)` → `{tasks:[...]}` |
+| 락 | `acquire_lock` | `(resource, role, ttl_sec=300)` → `{acquired, held_by?, ttl_remaining?}` |
+| 락 | `release_lock` | `(resource, role)` → `{released}` (소유자 아니면 `{released:false}`) |
+| 락 | `heartbeat_lock` | `(resource, role, ttl_sec=300)` → `{renewed}` (소유자만) |
+| 락 | `list_locks` | `()` → `{locks:[{resource, held_by, ttl_remaining}]}` |
+| 가시성 | `team_log` | `(after_id?)` → `{events:[...], cursor}` (최근 활동 피드) |
+
+## Redis 데이터 모델 (전부 원자적)
+
+- **메시지**: `co:inbox:{role}` = Redis **Stream**. `post_message`=XADD, `read_inbox`=XREAD(`after_id` 커서, 비파괴). `mark_read`는 `co:read:{role}` 키에 마지막 id 저장.
+- **작업**: `co:task:{id}` Hash(title/assignee/status/detail/created_by/ts), `co:tasks` Set(id 목록), `INCR co:taskseq`로 id. `claim_task`/`update_task`는 **Lua 스크립트**로 read-modify-write 원자화(중복 claim/경합 방지).
+- **락**: 획득 = `SET co:lock:{resource} {role} NX EX {ttl}`(원자적). `release_lock`/`heartbeat_lock` = **Lua**로 `GET` 소유자 일치 확인 후 `DEL`/`EXPIRE`(check-and-act 원자화 → 남의 락 조작 불가).
+- **활동로그**: `co:log` = 캡트 Stream(`XADD ... MAXLEN ~ 500`). 메시지·작업·락 이벤트 기록 → Producer 오버사이트.
+
+## 동시 쓰기 분리 (핵심 요구)
+
+**1차 — 정적 소유권 파티션** (락 불필요한 자연 분리):
+- `web-ui` → FE만, `web-backend` → BE만, `web-ai` → AI만 쓰기. 각 세션은 자기 repo만 편집 → git 충돌 원천 차단.
+
+**2차 — 교차 리소스 어드바이저리 락** (여러 역할이 건드릴 수 있는 민감 영역만):
+- 예약 resource 명: `nas-deploy`, `stock-db-schema`, `lotto-db-schema`, `memory-mirror`(web-ui↔web-ai 미러), `nginx-conf`, `compose`.
+- 규약: 위 리소스 변경 전 `acquire_lock` 필수. 점유 중이면 `{acquired:false, held_by, ttl_remaining}` → 대기. **TTL 자동 해제로 세션 사망 시 데드락 방지**, 긴 작업은 `heartbeat_lock` 갱신.
+- 어드바이저리(협조적): 버스는 FS를 강제 잠그지 않음 → 각 세션 CLAUDE.md에 "공유 리소스 = 락 먼저" 규약 명문화로 강제.
+
+## 클라이언트 배선
+
+- 각 repo `.mcp.json`:
+  ```json
+  { "mcpServers": { "co-gahusb": {
+      "type": "http",
+      "url": "https://gahusb.synology.me/api/co/mcp",
+      "headers": { "Authorization": "Bearer ${CO_BUS_KEY}" } } } }
+  ```
+  (키는 커밋 금지 — 각 머신 env/로컬에서 주입. `.mcp.json`엔 placeholder, 실제 키는 `.env`/환경변수.)
+- 각 repo CLAUDE.md에 역할 블록 추가: "너는 역할 X / 모든 co-gahusb 툴에 role=X / 공유 리소스 변경 전 acquire_lock / `/loop`로 inbox·tasks 폴링".
+- web-ai는 다른 머신 → 해당 머신에서 `.mcp.json` 적용(스펙에 절차 명시).
+
+## 인프라 등재 (신규 컨테이너 추가 의무 위치 — [[reference_nas_url_routing]], [[reference_deploy_nas_services_whitelist]])
+
+1. `docker-compose.yml` — `co-gahusb` 서비스(build, `REDIS_URL`, `depends_on: redis`, `CO_BUS_KEY` env, `${RUNTIME_PATH}` 볼륨 불요(상태는 Redis)).
+2. nginx `default.conf` — **public `location /api/co/`** 추가(7번째 등재 규칙; `/api/internal/` 불필요).
+3. deploy 스크립트 SERVICES 화이트리스트에 `co-gahusb` 등재.
+4. `${RUNTIME_PATH}` 절대경로 — 본 서비스는 영속 볼륨 없음(Redis 백엔드)이라 코드 디렉토리만.
+5. frontend `depends_on` — 불필요(백엔드 전용 서비스).
+6. `.env` — `CO_BUS_KEY` 추가(커밋 금지).
+
+## 에러 / 엣지 처리
+
+- 인증 실패 → 401, 1회만 ERROR 로그 후 조용([[reference_webai_auth_pattern]]).
+- 락 획득 실패 → 예외 아닌 `{acquired:false, held_by, ttl_remaining}` 정상 반환.
+- 만료 락 → Redis TTL 자동 소멸(별도 GC 불필요).
+- 알 수 없는 role/resource → 명시적 에러 메시지.
+- Redis 연결 실패 → 503 + 명확한 메시지.
+
+## 테스트 (TDD, pytest + fakeredis)
+
+- **락**: 두 역할 같은 resource 획득 → 2번째 거부 / TTL 만료 후 획득 / 소유자 아닌 release·heartbeat 거부 / heartbeat 갱신 후 ttl 증가.
+- **메시지**: XADD 순서대로 `after_id` 커서 읽기 / mark_read 후 재읽기 시 제외 / 다른 role 우편함 격리.
+- **작업**: create→claim(중복 claim 거부)→update status 전이 / list 필터.
+- **인증**: 키 일치 통과 / 불일치 401.
+- **team_log**: 이벤트 기록 + MAXLEN 캡.
+
+## 구현 순서 (phase)
+
+1. 스캐폴드: 디렉토리/Dockerfile/requirements/config (기존 lab 구조 미러)
+2. `store.py` + `locks.py` (TDD, fakeredis) — 락 → 메시지 → 작업 → team_log
+3. `server.py` — FastMCP 툴 등록 + Bearer 인증 + ASGI
+4. 인프라 등재 6위치 (compose/nginx/deploy/env)
+5. 클라이언트 배선: web-ui·web-backend `.mcp.json` + CLAUDE.md 역할 블록 (web-ai는 절차 문서화)
+6. 배포(Gitea push → webhook) + 스모크 테스트(헬스/인증/락 경합)
+
+## 비범위 (YAGNI)
+
+- 실시간 push(텔레그램) — 후속. 우선 /loop 폴링.
+- SQLite 감사로그 — Redis 캡트 스트림으로 충분.
+- 웹 대시보드 — agent-office 오버사이트와 추후 통합 여지.
+- 락의 FS 레벨 강제 — 어드바이저리로 충분(세션은 협조적).
--- a/docs/superpowers/specs/2026-06-12-music-pipeline-reliability-design.md
+++ b/docs/superpowers/specs/2026-06-12-music-pipeline-reliability-design.md
@@ -0,0 +1,105 @@
+# music/YouTube 파이프라인 신뢰성·복구 — 설계
+
+> 작성 2026-06-12. YouTube 자동화 파이프라인의 step 실패를 자동 재시도(일시적)하고, 영구 실패는 실패 step부터 수동 재개(텔레그램 [🔄재시도])할 수 있게 한다. "music/YouTube 파이프라인 고도화" 중 **신뢰성/복구** 슬라이스.
+
+## 1. 목표
+
+파이프라인 step(`cover→video→thumb→meta→review→publish`) 실패가 ① 일시적이면 자동 재시도로 흡수하고, ② 영구적이면 terminal `failed`로 둔 뒤 **이전 산출물을 보존한 채 실패 step부터 재개**할 수 있게 한다. 현재는 step 한 번 실패하면 전체 파이프라인이 terminal `failed`가 되고 복구 경로가 없어 처음부터 다시 만들어야 한다.
+
+## 2. 배경 (현재 동작)
+
+- `orchestrator.run_step(pipeline_id, step, feedback)`: `pipeline_jobs` row 생성 → step 실행 → 성공 시 `update_pipeline_state(next_state)`, 예외 시 `pipeline_jobs.status='failed'` + 파이프라인 `state='failed'` + `failed_reason="{step}: {e}"`. **재시도/재개 없음.**
+- 항상 `bg.add_task(orchestrator.run_step, pid, step, ...)`로 BackgroundTask 호출(start_pipeline→cover, feedback→next_step, publish_pipeline→publish).
+- 이전 step 산출물(`cover_url`/`video_url`/`thumbnail_url`/`metadata_json`/`review_json`)은 파이프라인 row에 **보존**됨 → 실패 step만 재실행하면 이어갈 수 있는 구조.
+- `state_machine`: STEPS, `_APPROVE_NEXT`, TERMINAL_STATES={published, cancelled, **failed**, awaiting_manual}.
+- `agent-office youtube_publisher.poll_state_changes`: `*_pending` 신규 진입만 텔레그램 알림. **`failed`는 무알림(silent)** — 사용자가 실패를 모름.
+
+## 3. 요구사항 (확정)
+
+- **자동 재시도**: step 실행 실패 시 `STEP_MAX_RETRIES`(기본 2 → 총 3회)까지 backoff 재시도. 소진 후 terminal `failed`.
+  - `_resolve_input` 에러(입력/설정)는 재시도 안 함(재시도해도 안 고쳐짐).
+  - **`publish` step은 자동 재시도 제외** — youtube 업로드는 비멱등(중복 업로드 위험). 1회 시도 후 실패면 즉시 terminal.
+  - 재시도 대상 = `cover/video/thumb/meta/review`.
+- **수동 재개**: terminal `failed` 파이프라인을 실패 step부터 재실행. 이전 산출물 보존.
+  - publish 재개 가드: `youtube_video_id`가 이미 있으면 재개 거부(원 업로드 성공 가능성 → 중복 방지).
+- **실패 알림**: 영구 실패 시 텔레그램 알림 + 인라인 `[🔄재시도]` 버튼(현재 silent 갭 해소).
+- **범위 밖(YAGNI)**: stuck 감지(*_running hang / *_pending 방치). 수동 재시도로 복구 가능하므로 이번 슬라이스 제외.
+
+## 4. 아키텍처
+
+3 컴포넌트:
+```
+[music-lab orchestrator] run_step: step 실행을 재시도 루프로 (publish 제외) → 소진 시 failed
+[music-lab API]          POST /api/music/pipeline/{id}/retry → 실패 step부터 run_step 재트리거
+[agent-office]           youtube_publisher: failed 감지 → 텔레그램 알림+[🔄재시도]
+                         webhook: ytpub_retry_{pid} → service_proxy.pipeline_retry → music-lab retry
+```
+
+## 5. music-lab 상세
+
+### 5.1 자동 재시도 (`pipeline/orchestrator.py`)
+- 상수: `STEP_MAX_RETRIES = 2`, `STEP_RETRY_BACKOFF_SEC = [5, 15]`(시도 간 대기), `NON_RETRY_STEPS = {"publish"}`.
+- `run_step`의 step 실행부(현재 try lines 31-47)를 루프로:
+  ```
+  attempts = 1 if step in NON_RETRY_STEPS else (STEP_MAX_RETRIES + 1)
+  for i in range(attempts):
+      try:
+          result = await _dispatch_step(step, p, ctx, feedback)
+          update_pipeline_job(job_id, status="succeeded")
+          update_pipeline_state(pipeline_id, result["next_state"], **fields)
+          return
+      except Exception as e:
+          last = e
+          if i < attempts - 1:
+              add_log/pipeline_job note "retry {i+1}"
+              await asyncio.sleep(STEP_RETRY_BACKOFF_SEC[min(i, len-1)])
+  # 소진
+  update_pipeline_job(job_id, status="failed", error=str(last))
+  update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {last}")
+  ```
+- `_resolve_input` 실패는 루프 진입 전 early-return(현행 유지, 재시도 X).
+- 재시도 시도 가시화: `pipeline_jobs`에 attempt별 기록(또는 error 메시지에 "attempt n/N").
+
+### 5.2 resume 엔드포인트 (`main.py`)
+- `POST /api/music/pipeline/{id}/retry`:
+  - 파이프라인 조회 없으면 404.
+  - `state != "failed"` → 409 "재개 불가 (state=...)".
+  - 실패 step 판별: `db.get_last_failed_step(pipeline_id)` (pipeline_jobs에서 status='failed' 최신 step). 없으면 `failed_reason.split(":")[0].strip()` 폴백.
+  - 실패 step이 `publish`이고 `youtube_video_id`가 이미 있으면 → 409 "이미 업로드됨 (중복 방지)".
+  - `bg.add_task(orchestrator.run_step, pid, failed_step)` 재트리거. 반환 `{ok: true, retrying_step}`.
+- `db.get_last_failed_step(pipeline_id) -> str | None` 헬퍼 신규.
+
+## 6. agent-office 상세
+
+### 6.1 실패 알림 (`agents/youtube_publisher.py`)
+- `poll_state_changes`: `_STEP_TITLES`(*_pending) 처리 후, `state == "failed"` 인 파이프라인도 검사.
+  - 신규 failed(중복 방지: `self._notified_failed: set[int]`, 또는 기존 dict에 ('failed', reason_hash))면 텔레그램 발송:
+    `⚠️ [{track_title}] 파이프라인 #{id} '{step}' 실패\n사유: {failed_reason}` + 인라인 `[🔄 재시도]` (callback_data `ytpub_retry_{id}`).
+  - 발송 후 notified 기록.
+- `service_proxy.list_active_pipelines()`가 failed를 포함하는지 확인 — 미포함이면 failed도 반환하도록 보강(또는 별도 조회). (plan에서 확인.)
+
+### 6.2 재시도 콜백 (`telegram/webhook.py`)
+- `_handle_callback`에 `callback_id.startswith("ytpub_retry_")` 분기 → `_handle_ytpub_retry`.
+- `_handle_ytpub_retry`: `pid = int(callback_id.removeprefix("ytpub_retry_"))` → `service_proxy.pipeline_retry(pid)` → 결과 텔레그램 회신("재개: {step}" / 거부 사유).
+- `service_proxy.pipeline_retry(pid)` 신규: `POST {MUSIC_LAB_URL}/api/music/pipeline/{pid}/retry`.
+
+## 7. 에러 처리 / 엣지
+
+- 재시도 backoff 중 컨테이너 재시작 → 해당 step 작업 유실, 파이프라인 비-terminal stuck. 범위 밖이나 수동 [🔄재시도]로 복구 가능(안전망).
+- resume 시 state≠failed → 409(중복 재개·동시성 방지). 텔레그램 [🔄재시도] 중복 탭도 멱등 거부.
+- pipeline_jobs에 failed row 없고 state만 failed → `failed_reason` prefix 폴백.
+- publish 재개 + `youtube_video_id` 존재 → 409(중복 업로드 방지).
+- 알림 중복: notified 기록으로 같은 failed 1회만 발송.
+
+## 8. 테스트
+
+- **orchestrator (재시도)**: step 2회 실패 후 성공 → next_state 도달(3시도). 끝까지 실패 → failed. publish는 1시도 후 즉시 failed(재시도 X). `_resolve_input` 실패 → 재시도 없이 failed.
+- **API retry**: failed→run_step 재트리거(mock 확인) + retrying_step 반환. 비-failed→409. publish+youtube_video_id→409.
+- **db**: `get_last_failed_step` — 최신 failed job step 반환, 없으면 None.
+- **agent-office**: poll 신규 failed→텔레그램 발송(중복 방지). `_handle_ytpub_retry`→service_proxy.pipeline_retry 호출 + pid 파싱.
+
+## 9. 영향받는 파일
+
+- music-lab: `app/pipeline/orchestrator.py`(재시도 루프 + `_dispatch_step` 추출), `app/main.py`(retry 엔드포인트), `app/db.py`(`get_last_failed_step`), `tests/`.
+- agent-office: `app/agents/youtube_publisher.py`(failed 알림), `app/telegram/webhook.py`(ytpub_retry 디스패치), `app/service_proxy.py`(`pipeline_retry`, 필요 시 `list_active_pipelines` failed 포함), `tests/`.
+- web-backend/CLAUDE.md music API 표 + `service_music.md` 메모리 갱신.
--- a/docs/superpowers/specs/2026-06-29-distributed-worker-observability-design.md
+++ b/docs/superpowers/specs/2026-06-29-distributed-worker-observability-design.md
@@ -0,0 +1,207 @@
+# 분산 워커 관측 시스템 (Distributed Worker Observability) — 설계 문서
+
+> 작성일: 2026-06-29 · 작성 세션: BE (web-backend 소유)
+> 대상 repo 3종: `web-ai`(워커) · `web-backend`(NAS 집계/경보) · `web-ui`(Three.js 대시보드)
+
+---
+
+## 1. 문제 정의 (Problem)
+
+NAS 백엔드의 음악/영상/이미지/인스타 생성은 **무거운 작업을 Windows AI 머신(192.168.45.59)의 WSL2 Docker 워커**에 위임한다. NAS 게이트웨이(`music/video/image/insta-lab`)가 Redis 큐(`queue:<svc>-render`)에 job을 push하면, Windows 워커가 BLMOVE로 꺼내 처리하고 `/api/internal/<svc>/update` webhook으로 결과를 회신한다. 트레이딩봇 `ai_trade`(:8001)는 별도로 NAS stock(:18500)에서 HTTP pull을 한다.
+
+**핵심 문제: 이 분산 워커들이 살아있는지 NAS·사용자가 알 길이 없다.**
+- 각 워커에 로컬 `/health` 엔드포인트가 있으나 Windows 머신 안에서만 접근 가능.
+- 실제 사고: `insta-render` 워커가 redis 블로킹 read 버그로 **2026-05-22 ~ 06-08 약 2주간 사일런트로 죽어 있었고**(모든 슬레이트 draft 정지) 아무도 몰랐다. 일감이 없을 때의 "한가함"과 "죽음"을 구분할 수단이 없었던 것이 근본 원인.
+
+## 2. 목표 / 비목표 (Goals / Non-goals)
+
+**목표 (Phase 1)**
+- G1. 6개 워커(`music/video/image/insta-render` + `task-watcher` + `ai_trade`)의 생사·상태를 NAS에서 인지.
+- G2. 큐 깊이·실패(dead-letter)·고아작업(processing)·일시정지(paused) 상태를 집계.
+- G3. 상태 전이(다운/복구/실패누적)를 텔레그램으로 자동 경보.
+- G4. web-ui 신규 페이지 `/infra`에서 NAS↔Windows 파이프라인을 **Three.js로 시각화** — 정상이면 통신이 흐르는 애니메이션, 장애면 해당 구간을 끊김/빨강으로 표시.
+
+**비목표 (Phase 2 이후로 보류)**
+- 원격 제어(워커 재시작, 큐 pause/resume, dead-letter 재처리) — Windows 머신 제어가 필요해 보안·구현 복잡도 큼.
+- GPU 사용률(VRAM) 모니터링, stuck-task 자동 감지, WebSocket 라이브 푸시.
+- 다중 노드 확장(현재 Windows 노드 1대).
+
+## 3. 아키텍처 & 토폴로지
+
+```
+        web-backend (NAS, 192.168.45.54)              Windows 노드 (192.168.45.59)
+   ┌──────────────────────────────────┐          ┌────────────────────────────────────┐
+   │  music-lab ─┐                     │  ① job   │  WSL2 Docker:                       │
+   │  video-lab ─┤                     │  push    │   ┌─ music-render                   │
+   │  image-lab ─┼─► [ Redis 큐 버스 ]═╪══════════╪══►├─ video-render  (ReliableQueue)  │
+   │  insta-lab ─┘   queue:*-render    │          │   ├─ image-render                   │
+   │                 queue:paused      │◄═════════╪═══├─ insta-render                   │
+   │                                   │ ② webhook│   └─ task-watcher (paused 토글)     │
+   │  agent-office                     │◄─────────╪──   각 워커 → worker:<name>:heartbeat│
+   │   ├─ node_monitor (집계)          │◄─heartbeat   (Redis SET, TTL 45s)              │
+   │   └─ scheduler (1분 경보 cron)    │          │                                     │
+   │                                   │          │  Windows 호스트(WSL 밖):            │
+   │  stock (:18500) ◄── HTTP pull ────╪──────────╪──  ai_trade (:8001) ─ heartbeat ───►│
+   └──────────────┬───────────────────┘          └────────────────────────────────────┘
+                  │ GET /api/agent-office/nodes  (FE 2~3초 폴링)
+                  ▼
+        web-ui  /infra  ←  Three.js 파이프라인 시각화
+```
+
+**설계 기반(이미 존재하는 자산)**
+- 워커들은 이미 NAS Redis(`redis://192.168.45.54:6379`)에 BLMOVE로 연결 → heartbeat도 같은 Redis에 SET하면 방화벽/인바운드 포트 불필요, `queue:paused`여도 heartbeat는 계속 뛰므로 "정지 중이지만 살아있음"과 "죽음"을 구분 가능.
+- `_shared/reliable_queue.py`(ReliableQueue)가 이미 `processing:queue:<svc>-render:<worker_id>` 리스트와 `dead_letter:queue:<svc>-render` 리스트를 Redis에 남김 → 집계기가 **신규 워커 코드 없이** 큐 깊이·실패·고아작업을 읽을 수 있음.
+
+**채택하지 않은 대안**
+- 집계기를 게이트웨이 중 하나에 배치 → "어느 게이트웨이가 전체 노드 상태를 소유하나"가 의미상 어색. `agent-office`가 ops 브레인(텔레그램·스케줄러·WebSocket·서비스 로그 수집 보유)이라 의미상 정확.
+- NAS→워커 HTTP `/health` 폴링 → 워커별 포트 노출 + NAS→Windows 인바운드 접속 필요. Redis heartbeat가 단방향(워커→Redis)이라 더 단순.
+- 라이브 갱신을 WebSocket으로 → Phase 1은 2~3초 폴링으로 충분(단순). WebSocket은 Phase 2 강화.
+
+## 4. 컴포넌트 설계
+
+### 4.1 web-ai — heartbeat 생산자 (AI 세션 소유)
+
+**4.1.1 render 워커 4종 (`services/*-render/`)**
+- 신규 공용 모듈 `services/_shared/heartbeat.py`:
+  - `async def heartbeat_loop(redis, name, stats, interval=15, ttl=45)` — `interval`초마다 `worker:<name>:heartbeat` 키에 JSON 값을 `SET ... EX ttl`.
+  - 값 스키마는 §5.1 참조. 죽으면 키가 TTL 만료 → 집계기가 "missing = dead" 판정.
+- 각 워커 `main.py` lifespan에서 `worker_loop`와 함께 `heartbeat_loop` 태스크 spawn.
+- `state` 산정: `queue:paused`가 set이면 `paused`, 현재 job 처리 중이면 `busy`, 아니면 `idle`. 처리 중 여부와 카운터(`jobs_done`/`jobs_failed`/`last_job_at`)는 `poll_once`가 갱신하는 모듈 레벨 `stats` 객체로 추적.
+- TTL=45s = interval(15s)의 3배 → 1~2회 누락은 dead로 오판하지 않음.
+
+**4.1.2 task-watcher (`services/task-watcher/`)**
+- `watcher_loop`에 동일 heartbeat 추가. `worker:task-watcher:heartbeat`에 `state` + 현재 `mode`(`trading`/`free`)를 함께 발행 → 대시보드가 paused의 **이유**("작업중(트레이딩)")를 표시.
+
+**4.1.3 ai_trade (`ai_trade/`) — 다른 런타임**
+- ai_trade는 Windows **호스트**에서 직접 uvicorn 실행(WSL Docker 아님), NAS Redis 큐에 연결되어 있지 않음(현재 NAS stock으로 HTTP pull만).
+- 변경: `redis.asyncio` 의존성 추가 → `main.py` lifespan에 heartbeat 태스크 추가 → 같은 NAS Redis(`192.168.45.54:6379`)에 `worker:ai_trade:heartbeat` SET.
+  - Redis는 Windows 머신에서 이미 도달 가능(render 워커들이 같은 호스트에서 BLMOVE 중).
+  - heartbeat 로직은 ~10줄이므로 `ai_trade` 자체 미니 헬퍼로 둔다(`_shared` import 경로 의존 회피 — render 워커는 컨테이너 PYTHONPATH로 `_shared` 접근, ai_trade는 호스트 실행이라 경로가 다름). **계약(키 스키마)만 동일**하면 코드 공유 불필요.
+- `state` 의미가 다름: render 워커의 idle/busy/paused가 아니라 `market_open`(poll_loop 활성·신호 생성 중) / `market_closed`(휴장·장외 idle). **task-watcher의 `queue:paused`와 무관**(트레이딩은 일시정지 대상 아님).
+- 토폴로지 표현: Redis 큐 버스가 아니라 **HTTP pull 파이프라인**(ai_trade ⇄ NAS stock :18500)으로 별도 표시.
+
+### 4.2 web-backend / agent-office — 집계기 + 경보 (이 BE 세션 소유)
+
+**4.2.1 Redis 클라이언트 추가**
+- `agent-office`는 현재 Redis 미사용 → `requirements.txt`에 `redis>=5.0`(asyncio) 추가, `docker-compose.yml` agent-office 블록에 `REDIS_URL` 환경변수 + `depends_on: redis` 추가.
+
+**4.2.2 `app/node_monitor.py` 신규**
+- 워커 레지스트리(상수): 각 워커의 `name`, 연관 `queue`(있으면), `internal webhook` 경로, 토폴로지 link 타입(`redis-queue` | `http-pull`).
+- `async def collect_status() -> dict`:
+  - 각 워커: `GET worker:<name>:heartbeat` → 존재하면 `alive=True` + JSON 파싱 + `last_beat_age_s = now - ts`; 없으면 `alive=False`(dead).
+  - 각 render 큐: `LLEN queue:<svc>-render`(depth), `LLEN dead_letter:queue:<svc>-render`, `processing:queue:<svc>-render:*` 키 스캔으로 in-flight 수.
+  - `GET queue:paused` + TTL → paused 플래그 + reason(task-watcher heartbeat의 mode).
+  - Redis 연결 실패 → `redis_ok=False`(전 구간 degrade).
+  - link 상태 합성(§5.2).
+- 응답 스키마는 §5.2.
+
+**4.2.3 엔드포인트**
+- `GET /api/agent-office/nodes` → `collect_status()`. nginx `/api/agent-office/` 이미 라우팅됨 → **nginx 변경 불필요**.
+
+**4.2.4 경보 cron (scheduler)**
+- `_run_node_health_check` (APScheduler, 1분 간격):
+  - 직전 상태 `_node_state`(인메모리 dict)와 비교:
+    - `alive → dead`: 🔴 `<name> 워커 다운 (last beat Xs ago)`
+    - `dead → alive`: 🟢 `<name> 워커 복구`
+    - `dead_letter` 카운트가 임계(`NODE_ALERT_DEADLETTER_THRESHOLD`, 기본 1) 신규 초과: ❌ `<queue> 실패 누적 N건`
+  - `_notified` 패턴(기존 `youtube_publisher.poll_state_changes` 재사용)으로 스팸 방지, 복구 시 재알림 가능하도록 set 차집합.
+  - 텔레그램 발송은 agent-office 기존 봇 재사용.
+
+### 4.3 web-ui — Three.js 대시보드 (FE 세션 소유)
+
+- 신규 의존성: `three` + `@react-three/fiber` + `@react-three/drei`(React 코드베이스이므로 r3f가 관용적).
+- 신규 라우트 `/infra`(Router.jsx) + Nav 등록.
+- `pages/infra/InfraMonitor.jsx`:
+  - r3f `<Canvas>` 토폴로지 — 좌측 NAS(게이트웨이 sub-node) / 중앙 Redis 큐 버스(글로우 코어) / 우측 Windows 노드(워커 sub-node). ai_trade는 별도 HTTP-pull 파이프라인.
+  - 노드 간 파이프라인(튜브) + 상태별 머티리얼/애니메이션(§6).
+  - `useNodeStatus` 훅: `GET /api/agent-office/nodes`를 2~3초 폴링 → 상태를 시각 상태로 매핑(`src/api.js`에 헬퍼 추가).
+  - **2D 폴백**: WebGL 미지원/모바일 대비 카드·테이블 요약 뷰 토글.
+  - 실제 구현 시 `designer` 스킬 활성화(브레인스토밍 단계에서는 금지).
+
+## 5. 잠그는 계약 (Contracts)
+
+> 3 세션이 독립 병렬 작업하려면 이 두 스키마만 고정하면 된다.
+
+### 5.1 Heartbeat 키 스키마
+
+- **키**: `worker:<name>:heartbeat` (name ∈ `music-render`, `video-render`, `image-render`, `insta-render`, `task-watcher`, `ai_trade`)
+- **값**(JSON 문자열), `SET ... EX 45`:
+```json
+{
+  "name": "image-render",
+  "kind": "render",          // "render" | "watcher" | "trader"
+  "state": "idle",           // render: idle|busy|paused / watcher: trading|free / trader: market_open|market_closed
+  "ts": "2026-06-29T12:34:56Z",   // UTC ISO8601 (heartbeat 발신 시각)
+  "last_job_at": "2026-06-29T12:30:00Z",  // nullable
+  "jobs_done": 42,
+  "jobs_failed": 1,
+  "mode": "free"             // task-watcher 전용(paused 이유), 그 외 생략 가능
+}
+```
+
+### 5.2 `/api/agent-office/nodes` 응답 스키마
+```json
+{
+  "redis_ok": true,
+  "paused": false,
+  "paused_reason": "trading",     // queue:paused가 set일 때 task-watcher mode
+  "generated_at": "2026-06-29T12:34:57Z",
+  "workers": [
+    {
+      "name": "image-render", "kind": "render",
+      "alive": true, "state": "idle", "last_beat_age_s": 3,
+      "queue_depth": 0, "dead_letter": 0, "processing": 0,
+      "jobs_done": 42, "jobs_failed": 1, "last_job_at": "2026-06-29T12:30:00Z"
+    }
+  ],
+  "links": [
+    { "from": "nas", "to": "image-render", "type": "redis-queue", "status": "healthy" },
+    { "from": "ai_trade", "to": "nas-stock", "type": "http-pull", "status": "healthy" }
+  ]
+}
+```
+- `link.status` ∈ `healthy` | `paused` | `down` | `degraded`. 산정: 워커 dead → `down`; paused → `paused`; dead_letter>0 → `degraded`; redis_ok=false → 전 링크 `down`.
+
+## 6. 시각화 상태 (Three.js)
+
+| 상태 | 파이프라인(튜브) | 노드 |
+|------|------------------|------|
+| **정상 idle** | 시안/그린, 파티클이 NAS→워커→NAS 루프로 흐름(느림) | 초록 글로우 + 큐깊이/처리수 HUD |
+| **정상 busy** | 파티클 빠르게 흐름 | "처리 중 N" |
+| **일시정지 paused** | 앰버, 파티클 느려짐/정지 | "⏸ 작업중(트레이딩)" 라벨 |
+| **장애 dead / link down** | 빨강, 흐름 멈춤, 끊긴 지점 스파크/단절 | 빨강 + ⚠ 경고, "last beat Xs ago" |
+| **실패누적 dead-letter>0** | 해당 튜브 ❌ 뱃지 | dead-letter 카운트 강조 |
+| **Redis/집계기 다운** | 중앙 버스 전체 빨강 | "집계 서버 연결 끊김" 오버레이 |
+
+- ai_trade의 HTTP-pull 파이프라인은 큐 흐름이 아닌 pull 방향(ai_trade→NAS stock) 파티클로 구분 표현. `market_closed`는 정상 idle과 동일 톤(휴장은 장애 아님).
+
+## 7. 에러 처리
+
+- heartbeat TTL 만료 = dead 판정(권위 신호). 큐가 비어 일감이 없어도 heartbeat가 살아있으면 alive로 정확 판정(2주 사일런트 사고 재발 방지).
+- Redis 다운 → `/nodes`가 `redis_ok=false` 반환(500 아님) → 대시보드가 전 구간 degrade 표시.
+- agent-office 다운 → FE 폴링 실패 → "집계 서버 연결 끊김" 오버레이.
+- 집계기는 read-only(Redis에 쓰지 않음) → 워커 동작에 영향 0.
+
+## 8. 테스트
+
+- **web-ai**: `heartbeat.py` 단위 테스트(fakeredis/mock) — 발신 주기·TTL·state 전이·카운터. ai_trade heartbeat 별도 테스트.
+- **web-backend**: `node_monitor.collect_status` 테스트(mock redis: 키 존재/만료/큐 깊이/dead-letter 케이스) + 경보 전이 테스트(alive→dead→alive, dead-letter 증가). TDD 적용.
+- **web-ui**: `InfraMonitor` 컴포넌트가 mock 상태로 렌더 + 상태→색상 매핑 단위 테스트(r3f는 렌더 스모크 수준).
+
+## 9. 단계 (Phasing)
+
+- **Phase 1 (본 스펙 전체)**: 6 워커(render 4 + task-watcher + ai_trade) heartbeat / `/nodes` API / 텔레그램 경보 / Three.js `/infra` 대시보드.
+- **Phase 2 (후속)**: GPU 사용률(VRAM 16GB 경합 가시화), stuck-task 감지, WebSocket 라이브 푸시, 원격 제어(워커 재시작·pause/resume·dead-letter 재처리).
+
+## 10. 세션 분담 & 협업 (co-gahusb)
+
+- **소유권**: BE(이 세션)=web-backend, AI 세션=web-ai, FE 세션=web-ui. 각자 자기 repo만 커밋.
+- **선행 게이트**: §5의 두 계약(heartbeat 키 스키마 + `/nodes` 응답 스키마)을 먼저 확정·공유 → 3 세션 병렬 진행.
+- **공유 리소스 락**: agent-office 의존성/compose 변경은 `compose` 락, nginx 무변경(불필요). 배포는 `nas-deploy` 락.
+- BE 작업: agent-office redis 추가 + `node_monitor.py` + `/nodes` + 경보 cron + 본 메모리 기록. AI/FE 작업은 co-gahusb 태스크로 배분.
+
+## 11. 메모리 갱신 계획
+
+- 신규 cross-cutting 메모리 `infra_distributed_workers.md` 작성: 큐 계약 / webhook 계약 / ReliableQueue 키 / heartbeat 키 스키마 / task-watcher paused / node_monitor·`/nodes`·경보. `MEMORY.md` 인덱스 등재.
+- 관련 서비스 메모리(`service_video/image/music/insta`)에 heartbeat·관측 추가 사실을 cross-link.
+```
--- a/docs/superpowers/specs/2026-07-02-realtime-trade-alerts-design.md
+++ b/docs/superpowers/specs/2026-07-02-realtime-trade-alerts-design.md
@@ -0,0 +1,220 @@
+# 실시간 매매 알람 (Real-time Trade Alerts) — 설계 스펙
+
+- 작성일: 2026-07-02
+- 상태: 설계 승인됨 (사용자 리뷰 대기)
+- 관련 세션: BE(web-backend, 본 스펙 주도) · AI(web-ai 워커) · FE(web-ui 탭)
+
+## 1. 목표
+
+장이 열려 있는 동안(**시간외 포함**) 실시간으로 주가 기준치를 분석해, 조건 충족 시 **매수/매도 알람**을 텔레그램으로 **사용자 + 아내** 둘 다에게 전송한다. 기술적 분석(TA) 계산은 **Windows PC의 docker 워커**에서 수행한다.
+
+기존에는 이 판단들이 EOD(하루 1회)로만 돌았다:
+- 매수 후보 = 스크리너(평일 16:30) · 매도/보유 advisory = holdings_intel(08:30/16:50).
+
+이번 작업의 핵심 = **동일 판단을 장중(+시간외) 1분 주기 실시간으로 전환 + 조건 충족 즉시 알람**.
+
+## 2. 확정된 요구사항 (사용자 결정)
+
+| 항목 | 결정 |
+|------|------|
+| 매수 유니버스 | **watchlist(사용자 관리) ∪ 당일 스크리너 후보** |
+| 매수 트리거 | **TA 자동 시그널**(수동 목표가 없음) |
+| 매도 트리거 | **기존 exit 룰 + 트레일링 스톱** |
+| 감시 주기/세션 | **1분 폴링** · 장전 시간외 08:30–09:00 · 정규장 09:00–15:30 · 시간외 단일가 16:00–18:00 |
+| 중복 방지 | **상태 전이(edge-triggered)** — 거짓→참 전이 시만 알림, 참 유지 중 무알림, 재무장 |
+| watchlist 관리 | **텔레그램 봇 명령 + web-ui 탭 둘 다** |
+| 수신자 | **사용자 + 아내 둘 다**(매수·매도 모두) |
+| TA 연산 위치 | **Windows WSL2 docker 신규 워커** |
+| 트레일링 스톱 기본값 | 보유기간 고점 대비 **−10%**(파라미터화) |
+| 매수 신호 | 지지선 되돌림(MA20/50) · 돌파(전고점/52주) · RSI 과매도 반등 |
+
+## 3. 아키텍처
+
+```
+[Windows WSL2 docker] trade-monitor 워커  (web-ai · AI세션)
+  1분 루프 (KST 세션 게이팅)
+   ① GET  NAS  /api/webai/trade-alert/monitor-set   (X-WebAI-Key)
+   ② KIS 실시간/시간외 시세 + 분봉/일봉 → TA 계산
+   ③ 조건 평가 → 현재 발화집합 F = {(ticker, kind, condition)}
+   ④ POST NAS  /api/webai/trade-alert/report {firing: F}   (X-WebAI-Key)
+   ⑤ heartbeat: worker:trade-monitor:heartbeat (EX45, 관측 편입)
+        │
+        ▼
+[NAS] stock (:18500 · web-backend · BE)
+   • watchlist·alert_state(edge dedup, 영속)·alert_history·holding high-water
+   • monitor-set 조립(watchlist ∪ screener 후보 ∪ 보유) + 세션/휴장 게이팅
+   • report 수신 → edge diff(F vs 직전 발화) → 신규 edge를 agent-office로 push
+        │ (텔레그램 전송 성공 시에만 alert_state 갱신)
+        ▼
+[NAS] agent-office (:18900 · web-backend · BE)
+   • POST /api/agent-office/stock/trade-alert → 텔레그램(너+아내)
+   • 봇 명령 /watch /unwatch /watchlist → stock watchlist CRUD
+   • 알람 activity feed 편입
+
+[web-ui] 관심종목 탭 (FE세션) — watchlist CRUD + 알람 이력 뷰
+```
+
+**설계 원칙**
+- TA/조건판정 = Windows(요구사항). **edge 중복판정 상태 = NAS 영속** → 워커 재시작해도 재알림 스팸 없음(youtube_publisher 교훈 재적용).
+- 워커는 dedup 상태를 **안 가진다**. 매 사이클 "현재 발화집합 전체"만 보고 → NAS가 diff(단일 진실원천).
+- 워커의 대외 채널은 **NAS stock 한 곳**(기존 ai_trade↔stock의 `X-WebAI-Key` 재사용). 텔레그램 발송은 stock→agent-office push(기존 realestate→agent-office/notify 패턴).
+
+## 4. DB 스키마 (stock.db)
+
+```sql
+-- 매수 감시 관심종목 (사용자 관리)
+CREATE TABLE IF NOT EXISTS watchlist (
+  ticker      TEXT PRIMARY KEY,
+  name        TEXT,
+  note        TEXT,
+  params_json TEXT NOT NULL DEFAULT '{}',   -- 종목별 조건 오버라이드(선택)
+  added_at    TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+);
+
+-- edge 중복판정 상태 (영속 — 재시작 스팸 방지의 핵심)
+CREATE TABLE IF NOT EXISTS trade_alert_state (
+  ticker           TEXT NOT NULL,
+  kind             TEXT NOT NULL,   -- 'buy' | 'sell'
+  condition        TEXT NOT NULL,   -- ex) buy_ma20_pullback, sell_trailing_stop
+  currently_firing INTEGER NOT NULL DEFAULT 0,
+  first_fired_at   TEXT,
+  last_fired_at    TEXT,
+  last_seen_at     TEXT,
+  PRIMARY KEY (ticker, kind, condition)
+);
+
+-- 알람 이력
+CREATE TABLE IF NOT EXISTS trade_alert_history (
+  id          INTEGER PRIMARY KEY AUTOINCREMENT,
+  ticker      TEXT NOT NULL,
+  name        TEXT,
+  kind        TEXT NOT NULL,
+  condition   TEXT NOT NULL,
+  price       REAL,
+  detail_json TEXT,
+  fired_at    TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+);
+CREATE INDEX IF NOT EXISTS idx_tah_fired ON trade_alert_history(fired_at DESC);
+```
+
+보유기간 고점(트레일링 스톱용) high-water는 `krx_daily_prices`(기존)에서 lookback max로 계산하거나 별도 컬럼으로 관리 — 구현 계획에서 확정(v1: 포지션 최초 관측 이후 일봉 고가 max, 없으면 최근 N일).
+
+## 5. 계약 (Contracts) — cross-repo 잠금 대상
+
+### 5.1 NAS stock ↔ Windows 워커 (X-WebAI-Key)
+
+`GET /api/webai/trade-alert/monitor-set`
+```json
+{
+  "session": "pre | regular | after | closed",
+  "as_of": "2026-07-02T09:01:00+09:00",
+  "buy_targets":  [{"ticker":"005930","name":"삼성전자","source":"watch|screener","params":{}}],
+  "sell_targets": [{"ticker":"000660","name":"SK하이닉스","avg_price":180000,"qty":10,
+                    "holding_high":210000,"params":{}}],
+  "buy_params":  {"rsi_oversold":30,"breakout_vol_mult":1.5,"pullback_pct":0.02},
+  "exit_params": {"stop_pct":0.08,"take_pct":0.25,"trailing_pct":0.10}
+}
+```
+- `session=closed`면 워커는 KIS 호출 없이 sleep.
+
+`POST /api/webai/trade-alert/report`
+```json
+{ "as_of":"2026-07-02T09:01:00+09:00",
+  "firing":[ {"ticker":"005930","kind":"buy","condition":"buy_ma20_pullback",
+              "price":71500,"detail":{"ma20":71200,"rsi":34}} ] }
+```
+응답: `{ "new_alerts": <int>, "cleared": <int> }`
+- NAS가 `firing` vs `trade_alert_state[firing=1]` diff → 신규 edge만 텔레그램.
+
+### 5.2 stock → agent-office (내부)
+
+`POST /api/agent-office/stock/trade-alert`
+```json
+{ "alerts":[ {"ticker":"005930","name":"삼성전자","kind":"buy",
+              "condition":"buy_ma20_pullback","price":71500,
+              "detail":{...},"fired_at":"..."} ] }
+```
+→ agent-office가 너+아내에게 텔레그램. (realestate/notify 패턴)
+
+### 5.3 stock watchlist CRUD (web-ui + agent-office 봇)
+- `GET /api/stock/watchlist`
+- `POST /api/stock/watchlist` `{ticker, note?}`
+- `DELETE /api/stock/watchlist/{ticker}`
+- `GET /api/stock/trade-alerts?days=N` (이력, web-ui용)
+
+### 5.4 워커 heartbeat (관측 편입)
+`worker:trade-monitor:heartbeat` EX45, 값 JSON `{name:"trade-monitor",kind:"trader",state:"market_open|market_closed|idle",ts,last_alert_at,...}`. `/api/agent-office/nodes` workers[]에 추가.
+
+## 6. 알람 조건 (Windows 워커가 계산)
+
+**매수** (buy_targets):
+- `buy_ma20_pullback` — MA20>MA50>MA200 정렬 + 저가가 MA20/50에 `pullback_pct` 이내 접근 후 종가 반등
+- `buy_breakout` — 종가 > (전 N일 고점 또는 52주 신고가) + 거래량 > `breakout_vol_mult`×20일평균
+- `buy_rsi_bounce` — RSI(14)가 `rsi_oversold` 아래로 내려갔다가 **봉 시리즈 내에서** 다시 상향 돌파(최근 봉에서 30 상향 크로스). 워커는 무상태 — 매 사이클 봉 데이터로 크로스를 계산(cross-cycle 메모리 불필요)
+
+**매도** (sell_targets):
+- `sell_stop_loss` — (price−avg)/avg ≤ −`stop_pct`
+- `sell_ma_break` — 종가 < MA50 (심각: < MA200)
+- `sell_take_profit` — (price−avg)/avg ≥ `take_pct`
+- `sell_climax` — 급등 소진(holdings_intel climax 로직 이식)
+- `sell_trailing_stop` — price ≤ holding_high × (1 − `trailing_pct`)
+
+## 7. 데이터 흐름 — edge dedup (NAS)
+
+```
+매 1분 report 수신 시:
+  F        = report.firing 집합
+  prev     = SELECT (ticker,kind,condition) FROM trade_alert_state WHERE currently_firing=1
+  new_edge = F − prev
+  cleared  = prev − F
+  for e in new_edge:
+      ok = agent_office.send_trade_alert(e)          # 텔레그램
+      if ok:
+          INSERT trade_alert_history(e)
+          UPSERT trade_alert_state(e, firing=1, fired/last=now)
+      # 실패 시 상태 미갱신 → 다음 사이클 재시도
+  for c in cleared:
+      UPDATE trade_alert_state SET firing=0 WHERE key=c   # 재무장
+  UPDATE last_seen_at for all F
+```
+- 영속 `trade_alert_state` → 워커·NAS 재시작에도 재알림 스팸 없음.
+- 텔레그램 실패 시 firing 미표시 → 재시도 보장(node_monitor "성공 시만 갱신" 관용).
+
+## 8. 세션/휴장 게이팅
+
+NAS `monitor-set.session` 필드가 KST 시각 + `holidays.json`(`is_market_open`)으로 판정:
+- pre 08:30–09:00 / regular 09:00–15:30 / after 16:00–18:00 → 그 외/휴장 = closed.
+- 워커는 `closed`면 sleep. (불필요 KIS 호출·알람 차단)
+
+## 9. 에러 처리
+
+- 워커: KIS 실패 → 해당 사이클 skip + 다음 분 재시도, 종목별 실패 격리. heartbeat로 생사 노출.
+- NAS: 워커 인증 `X-WebAI-Key`. 텔레그램 실패 → 상태 미갱신. `report`는 멱등(같은 F 재전송 무해).
+- 워커 다운 시 알람 정지 → node_monitor 경보(기존 관측)로 감지.
+
+## 10. 테스트 전략 (BE, TDD)
+
+- watchlist CRUD (추가/중복/삭제/조회)
+- monitor-set 조립 (watchlist ∪ screener ∪ 보유, 세션 게이팅, 휴장)
+- **edge diff 로직**: 신규 edge만 알림 / 참 유지 무알림 / 해제 후 재발화 재알림 / 재시작 지속성(영속 상태)
+- 텔레그램 전송 실패 시 상태 미갱신(재시도)
+- alert_history 기록 / trade-alerts 조회
+- agent-office: /watch·/unwatch·/watchlist 봇 명령 → stock CRUD, trade-alert notify → 텔레그램 포맷(너+아내)
+- webai 계약 엔드포인트(monitor-set/report) 스키마·인증
+
+## 11. 작업 분담
+
+| repo | 세션 | 산출물 | 상태 |
+|------|------|--------|------|
+| **web-backend** (stock + agent-office) | **BE(본 세션)** | DB·watchlist·edge·webai 계약·텔레그램·봇 | 이번에 구현 |
+| **web-ai** (`services/trade-monitor/` WSL2 docker) | AI세션 | 1분 루프·KIS·TA·조건평가·report·heartbeat | 계약 넘김 |
+| **web-ui** (관심종목 탭) | FE세션 | watchlist CRUD·조건·이력 뷰 | 계약 넘김 |
+
+- 계약(§5)은 co-gahusb로 잠근 뒤 3세션 병렬.
+- 워커 재빌드는 로컬 docker(사용자): `wsl -d Ubuntu-24.04 -- docker compose up -d --build trade-monitor`.
+
+## 12. 범위 밖 (YAGNI / 후속)
+- 실주문 자동 집행(알람 전용, KIS 주문 X).
+- KIS 웹소켓 실시간 틱(1분 폴링으로 충분).
+- 종목별 수동 목표가(이번은 TA 자동만).
+- 백테스트/성과 추적(후속 슬라이스).
--- a/music-lab/app/db.py
+++ b/music-lab/app/db.py
@@ -1100,6 +1100,19 @@ def get_pipeline(pid: int) -> Optional[Dict[str, Any]]:
    return _parse_pipeline_row(row)


+def delete_pipeline(pid: int) -> bool:
+    """파이프라인과 자식행(pipeline_feedback, pipeline_jobs)을 하드 삭제.
+
+    SQLite FK를 강제하지 않으므로 자식행을 명시적으로 먼저 삭제한다.
+    파이프라인이 존재했으면 True, 없었으면 False.
+    """
+    with _conn() as conn:
+        conn.execute("DELETE FROM pipeline_feedback WHERE pipeline_id = ?", (pid,))
+        conn.execute("DELETE FROM pipeline_jobs WHERE pipeline_id = ?", (pid,))
+        cur = conn.execute("DELETE FROM video_pipelines WHERE id = ?", (pid,))
+        return cur.rowcount > 0
+
+
 def update_pipeline_state(pid: int, state: str, **fields) -> None:
    """파이프라인 state를 갱신하고 옵션 컬럼을 함께 업데이트한다.

@@ -1135,6 +1148,21 @@ def list_pipelines(active_only: bool = False) -> List[Dict[str, Any]]:
    return [_parse_pipeline_row(r) for r in rows]


+def list_pipelines_by_state(state: str) -> List[Dict[str, Any]]:
+    """특정 state의 파이프라인만 조회 (예: 'failed')."""
+    sql = """
+        SELECT vp.*, ml.title AS track_title, cj.title AS compile_title
+        FROM video_pipelines vp
+        LEFT JOIN music_library ml ON ml.id = vp.track_id
+        LEFT JOIN compile_jobs cj ON cj.id = vp.compile_job_id
+        WHERE vp.state = ?
+        ORDER BY vp.created_at DESC
+    """
+    with _conn() as conn:
+        rows = conn.execute(sql, (state,)).fetchall()
+    return [_parse_pipeline_row(r) for r in rows]
+
+
 def increment_feedback_count(pid: int, step: str) -> int:
    """원자적으로 feedback_count_per_step.<step>를 +1 한 뒤 새 값을 반환.

@@ -1220,6 +1248,18 @@ def list_pipeline_jobs(pid: int) -> List[Dict[str, Any]]:
    return [dict(r) for r in rows]


+def get_last_failed_step(pid: int) -> Optional[str]:
+    """파이프라인의 가장 최근 status='failed' pipeline_job의 step. 없으면 None."""
+    with _conn() as conn:
+        row = conn.execute(
+            "SELECT step FROM pipeline_jobs "
+            "WHERE pipeline_id = ? AND status = 'failed' "
+            "ORDER BY id DESC LIMIT 1",
+            (pid,),
+        ).fetchone()
+    return row["step"] if row else None
+
+
 def get_youtube_setup() -> Dict[str, Any]:
    """youtube_setup의 기본 1행을 반환. 누락 시 자동 시드 후 재조회."""
    with _conn() as conn:
--- a/music-lab/app/main.py
+++ b/music-lab/app/main.py
@@ -1030,7 +1030,12 @@ def create_pipeline(req: PipelineCreate):

@app.get("/api/music/pipeline")
 def list_pipelines_endpoint(status: str = "all"):
-    pipelines = _db_module.list_pipelines(active_only=(status == "active"))
+    if status == "active":
+        pipelines = _db_module.list_pipelines(active_only=True)
+    elif status == "failed":
+        pipelines = _db_module.list_pipelines_by_state("failed")
+    else:
+        pipelines = _db_module.list_pipelines(active_only=False)
    return {"pipelines": pipelines}


@@ -1128,6 +1133,39 @@ def cancel_pipeline(pid: int):
    return {"ok": True}


+@app.delete("/api/music/pipeline/{pid}")
+def delete_pipeline_endpoint(pid: int):
+    """파이프라인 행을 하드 삭제(전체 목록에서 완전 제거). 없으면 404."""
+    if not _db_module.delete_pipeline(pid):
+        raise HTTPException(404)
+    return {"ok": True, "deleted": pid}
+
+
+@app.post("/api/music/pipeline/{pid}/retry", status_code=202)
+async def retry_pipeline(pid: int, bg: BackgroundTasks):
+    from .pipeline.state_machine import STEPS
+    p = _db_module.get_pipeline(pid)
+    if not p:
+        raise HTTPException(404)
+    if p["state"] != "failed":
+        raise HTTPException(409, f"재개 불가 (state={p['state']})")
+    failed_step = _db_module.get_last_failed_step(pid)
+    if not failed_step:
+        reason = p.get("failed_reason") or ""
+        failed_step = reason.split(":", 1)[0].strip() or None
+    if not failed_step:
+        raise HTTPException(409, "실패 step을 판별할 수 없음")
+    # Fix 3: failed_step이 알려진 STEPS에 없으면 409
+    if failed_step not in STEPS:
+        raise HTTPException(409, "실패 step 판별 불가")
+    if failed_step == "publish" and p.get("youtube_video_id"):
+        raise HTTPException(409, "이미 업로드됨 (중복 방지)")
+    # Fix 1: bg.add_task 직전에 상태를 'retrying'으로 전이 → 동시 retry 409 방지
+    _db_module.update_pipeline_state(pid, "retrying")
+    bg.add_task(orchestrator.run_step, pid, failed_step)
+    return {"ok": True, "retrying_step": failed_step}
+
+
@app.post("/api/music/pipeline/{pid}/publish", status_code=202)
 async def publish_pipeline(pid: int, bg: BackgroundTasks):
    p = _db_module.get_pipeline(pid)
--- a/music-lab/app/pipeline/orchestrator.py
+++ b/music-lab/app/pipeline/orchestrator.py
@@ -11,6 +11,10 @@ from .gradient import make_gradient_with_title

 logger = logging.getLogger("music-lab.orchestrator")

+STEP_MAX_RETRIES = 2          # 추가 재시도 (총 시도 = +1)
+STEP_RETRY_BACKOFF_SEC = [5, 15]
+NON_RETRY_STEPS = {"publish"}
+

 async def run_step(pipeline_id: int, step: str, feedback: str = "") -> None:
    """단계 실행 → 결과를 DB에 반영하고 *_pending 또는 다음 단계로 전이.
@@ -28,27 +32,35 @@ async def run_step(pipeline_id: int, step: str, feedback: str = "") -> None:
        db.update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {e}")
        return

+    attempts = 1 if step in NON_RETRY_STEPS else (STEP_MAX_RETRIES + 1)
+    last_err = None
+    for i in range(attempts):
        try:
-        if step == "cover":
-            result = await _run_cover(p, ctx, feedback)
-        elif step == "video":
-            result = await _run_video(p, ctx)
-        elif step == "thumb":
-            result = await _run_thumb(p, ctx, feedback)
-        elif step == "meta":
-            result = await _run_meta(p, ctx, feedback)
-        elif step == "review":
-            result = await _run_review(p, ctx)
-        elif step == "publish":
-            result = await _run_publish(p, ctx)
-        else:
-            raise ValueError(f"unknown step: {step}")
+            result = await _dispatch_step(step, p, ctx, feedback)
            db.update_pipeline_job(job_id, status="succeeded")
            db.update_pipeline_state(pipeline_id, result["next_state"], **result.get("fields", {}))
+            return
        except Exception as e:
-        logger.exception("step %s failed for pipeline %s", step, pipeline_id)
-        db.update_pipeline_job(job_id, status="failed", error=str(e))
-        db.update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {e}")
+            last_err = e
+            logger.exception(
+                "step %s 실패 (pipeline %s, attempt %d/%d)", step, pipeline_id, i + 1, attempts
+            )
+            if i < attempts - 1:
+                backoff = STEP_RETRY_BACKOFF_SEC[min(i, len(STEP_RETRY_BACKOFF_SEC) - 1)] if STEP_RETRY_BACKOFF_SEC else 0
+                await asyncio.sleep(backoff)
+    db.update_pipeline_job(job_id, status="failed", error=str(last_err))
+    db.update_pipeline_state(pipeline_id, "failed", failed_reason=f"{step}: {last_err}")
+
+
+async def _dispatch_step(step: str, p: dict, ctx: dict, feedback: str) -> dict:
+    """step 이름으로 실행 함수 디스패치."""
+    if step == "cover":   return await _run_cover(p, ctx, feedback)
+    if step == "video":   return await _run_video(p, ctx)
+    if step == "thumb":   return await _run_thumb(p, ctx, feedback)
+    if step == "meta":    return await _run_meta(p, ctx, feedback)
+    if step == "review":  return await _run_review(p, ctx)
+    if step == "publish": return await _run_publish(p, ctx)
+    raise ValueError(f"unknown step: {step}")


 def _resolve_input(p: dict) -> dict:
--- a/music-lab/pytest.ini
+++ b/music-lab/pytest.ini
@@ -1,4 +1,4 @@
 [pytest]
 testpaths = tests
-pythonpath = .
+pythonpath = . ..
 asyncio_mode = auto
--- a/music-lab/tests/test_pipeline_endpoints.py
+++ b/music-lab/tests/test_pipeline_endpoints.py
@@ -52,6 +52,19 @@ def test_list_pipelines_active_filter(client):
    assert all(p["state"] != "published" for p in r.json()["pipelines"])


+def test_list_pipelines_failed_filter(client):
+    """status=failed 필터는 state='failed' 파이프라인만 반환한다."""
+    # failed 파이프라인 생성
+    pid_f = client.post("/api/music/pipeline", json={"track_id": 1}).json()["id"]
+    db.update_pipeline_state(pid_f, "failed", failed_reason="cover: oops")
+    r = client.get("/api/music/pipeline?status=failed")
+    assert r.status_code == 200
+    pipelines = r.json()["pipelines"]
+    assert len(pipelines) == 1
+    assert pipelines[0]["state"] == "failed"
+    assert pipelines[0]["id"] == pid_f
+
+
 def test_feedback_reject_records_feedback_and_increments_count(client):
    pid = client.post("/api/music/pipeline", json={"track_id": 1}).json()["id"]
    db.update_pipeline_state(pid, "cover_pending")
@@ -92,6 +105,29 @@ def test_cancel_pipeline(client):
    assert db.get_pipeline(pid)["state"] == "cancelled"


+def test_delete_pipeline_removes_from_db(client):
+    pid = client.post("/api/music/pipeline", json={"track_id": 1}).json()["id"]
+    r = client.request("DELETE", f"/api/music/pipeline/{pid}")
+    assert r.status_code == 200
+    assert r.json()["ok"] is True
+    assert db.get_pipeline(pid) is None
+    all_ids = [p["id"] for p in client.get("/api/music/pipeline?status=all").json()["pipelines"]]
+    assert pid not in all_ids
+
+
+def test_delete_pipeline_not_found_returns_404(client):
+    r = client.request("DELETE", "/api/music/pipeline/99999")
+    assert r.status_code == 404
+
+
+def test_delete_pipeline_removes_child_jobs(client):
+    pid = client.post("/api/music/pipeline", json={"track_id": 1}).json()["id"]
+    db.create_pipeline_job(pid, "cover")
+    assert len(db.list_pipeline_jobs(pid)) == 1
+    client.request("DELETE", f"/api/music/pipeline/{pid}")
+    assert db.list_pipeline_jobs(pid) == []
+
+
 def test_setup_get_returns_defaults(client):
    r = client.get("/api/music/setup")
    assert r.status_code == 200
--- a/music-lab/tests/test_pipeline_retry.py
+++ b/music-lab/tests/test_pipeline_retry.py
@@ -0,0 +1,174 @@
+import pytest
+from fastapi.testclient import TestClient
+from app import db
+from app.pipeline import orchestrator
+
+
+@pytest.fixture
+def fresh_db(monkeypatch, tmp_path):
+    db_path = tmp_path / "music.db"
+    monkeypatch.setattr(db, "DB_PATH", str(db_path))
+    db.init_db()
+    return db_path
+
+
+@pytest.fixture(autouse=True)
+def _no_backoff(monkeypatch):
+    monkeypatch.setattr(orchestrator, "STEP_RETRY_BACKOFF_SEC", [0, 0])
+
+
+def test_get_last_failed_step_returns_step(fresh_db):
+    pid = db.create_pipeline(track_id=1)
+    job_id = db.create_pipeline_job(pid, "video")
+    db.update_pipeline_job(job_id, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason="video: boom")
+    assert db.get_last_failed_step(pid) == "video"
+
+
+def test_get_last_failed_step_none_when_no_failure(fresh_db):
+    pid = db.create_pipeline(track_id=1)
+    db.create_pipeline_job(pid, "cover")
+    assert db.get_last_failed_step(pid) is None
+
+
+async def test_retryable_step_retries_then_succeeds(fresh_db, monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    calls = {"n": 0}
+
+    async def flaky(step, p, ctx, feedback):
+        calls["n"] += 1
+        if calls["n"] < 3:
+            raise RuntimeError("transient")
+        return {"next_state": "video_pending", "fields": {}}
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", flaky)
+    monkeypatch.setattr(
+        orchestrator, "_resolve_input",
+        lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0},
+    )
+    await orchestrator.run_step(pid, "cover")
+    assert calls["n"] == 3
+    assert db.get_pipeline(pid)["state"] == "video_pending"
+
+
+async def test_retryable_step_exhausts_to_failed(fresh_db, monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+
+    async def always_fail(step, p, ctx, feedback):
+        raise RuntimeError("permanent")
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", always_fail)
+    monkeypatch.setattr(
+        orchestrator, "_resolve_input",
+        lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0},
+    )
+    await orchestrator.run_step(pid, "cover")
+    assert db.get_pipeline(pid)["state"] == "failed"
+
+
+async def test_publish_not_retried(fresh_db, monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    calls = {"n": 0}
+
+    async def fail_publish(step, p, ctx, feedback):
+        calls["n"] += 1
+        raise RuntimeError("upload error")
+
+    monkeypatch.setattr(orchestrator, "_dispatch_step", fail_publish)
+    monkeypatch.setattr(
+        orchestrator, "_resolve_input",
+        lambda p: {"genre": "x", "title": "t", "moods": [], "tracks": [], "audio_path": "", "duration_sec": 0},
+    )
+    await orchestrator.run_step(pid, "publish")
+    assert calls["n"] == 1
+    assert db.get_pipeline(pid)["state"] == "failed"
+
+
+# ── Task 3: retry endpoint tests ─────────────────────────────────────────────
+
+@pytest.fixture
+def client(fresh_db):
+    from app.main import app
+    return TestClient(app)
+
+
+def test_retry_failed_pipeline_retriggers(fresh_db, client, monkeypatch):
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "video")
+    db.update_pipeline_job(job, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason="video: boom")
+    called = {}
+
+    async def fake_run(p, step, *a):
+        called["pid"], called["step"] = p, step
+
+    monkeypatch.setattr(orchestrator, "run_step", fake_run)
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code in (200, 202)
+    assert r.json()["retrying_step"] == "video"
+
+
+def test_retry_non_failed_409(fresh_db, client):
+    pid = db.create_pipeline(track_id=1)  # state='created'
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code == 409
+
+
+def test_retry_publish_with_video_id_rejected(fresh_db, client):
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "publish")
+    db.update_pipeline_job(job, status="failed", error="x")
+    db.update_pipeline_state(pid, "failed", failed_reason="publish: x",
+                             youtube_video_id="abc123")
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code == 409
+
+
+# ── Fix 2: fake_run 인자 검증 ────────────────────────────────────────────────
+
+def test_retry_failed_pipeline_retriggers_with_correct_args(fresh_db, client, monkeypatch):
+    """fake_run이 (pid, failed_step)으로 호출되는지 검증."""
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "video")
+    db.update_pipeline_job(job, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason="video: boom")
+    called = {}
+
+    async def fake_run(p, step, *a):
+        called["pid"], called["step"] = p, step
+
+    monkeypatch.setattr(orchestrator, "run_step", fake_run)
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code in (200, 202)
+    assert called["pid"] == pid
+    assert called["step"] == "video"
+
+
+# ── Fix 1: retrying 전이로 중복 retry 409 ────────────────────────────────────
+
+def test_retry_twice_second_is_409(fresh_db, client, monkeypatch):
+    """첫 번째 retry가 상태를 'retrying'으로 전이 → 두 번째 retry는 409."""
+    pid = db.create_pipeline(track_id=1)
+    job = db.create_pipeline_job(pid, "video")
+    db.update_pipeline_job(job, status="failed", error="boom")
+    db.update_pipeline_state(pid, "failed", failed_reason="video: boom")
+
+    async def fake_run(p, step, *a):
+        pass
+
+    monkeypatch.setattr(orchestrator, "run_step", fake_run)
+    r1 = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r1.status_code in (200, 202)
+    r2 = client.post(f"/api/music/pipeline/{pid}/retry")  # 이미 retrying → 409
+    assert r2.status_code == 409
+
+
+# ── Fix 3: 알 수 없는 step prefix → 409 ─────────────────────────────────────
+
+def test_retry_unparseable_failed_reason_409(fresh_db, client):
+    """failed_reason이 known STEPS에 없는 prefix면 409."""
+    pid = db.create_pipeline(track_id=1)
+    # failed job row 없이 state만 failed + 비-step prefix reason
+    db.update_pipeline_state(pid, "failed", failed_reason="ValueError: track 1 없음")
+    r = client.post(f"/api/music/pipeline/{pid}/retry")
+    assert r.status_code == 409
--- a/nginx/default.conf
+++ b/nginx/default.conf
@@ -400,6 +400,20 @@ server {
    proxy_pass http://$saju_backend$request_uri;
  }

+  # co-gahusb — FastMCP streamable-http bus
+  # Authorization forward required (Bearer key auth), no buffering, long read timeout
+  # trailing slash on proxy_pass strips /api/co/ prefix: /api/co/mcp → /mcp
+  location /api/co/ {
+    proxy_pass http://co-gahusb:8000/;
+    proxy_http_version 1.1;
+    proxy_set_header Host $host;
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header Authorization $http_authorization;
+    proxy_set_header Connection "";
+    proxy_buffering off;
+    proxy_read_timeout 3600s;
+  }
+
  # agent-office API + WebSocket
  location /api/agent-office/ {
    resolver 127.0.0.11 valid=10s;
--- a/scripts/deploy-nas.sh
+++ b/scripts/deploy-nas.sh
@@ -2,7 +2,7 @@
 set -euo pipefail

 # ── 서비스 목록 (한 곳에서만 관리) ──
-SERVICES="lotto travel-proxy deployer stock music-lab insta-lab realestate-lab agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab nginx scripts _shared"
+SERVICES="lotto travel-proxy deployer stock music-lab insta-lab realestate-lab co-gahusb agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab nginx scripts _shared"

 # 1. 자동 감지: Docker 컨테이너 내부인가?
 if [ -d "/repo" ] && [ -d "/runtime" ]; then
--- a/scripts/deploy.sh
+++ b/scripts/deploy.sh
@@ -15,13 +15,13 @@ flock -n 200 || { echo "Deploy already running, skipping"; exit 0; }

 # ── 서비스 목록 (한 곳에서만 관리) ──
 # docker compose 서비스명 (deployer 제외 — 자기 자신을 재빌드하면 스크립트 중단)
-BUILD_TARGETS="lotto travel-proxy stock music-lab insta-lab realestate-lab agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab frontend"
+BUILD_TARGETS="lotto travel-proxy stock music-lab insta-lab realestate-lab co-gahusb agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab frontend"
 # 컨테이너 이름 (고아 정리용 — blog-lab은 폐기 대상으로 정리 리스트에 유지)
-CONTAINER_NAMES="lotto stock music-lab insta-lab blog-lab realestate-lab agent-office personal packs-lab travel-proxy video-lab image-lab tarot-lab saju-lab frontend"
+CONTAINER_NAMES="lotto stock music-lab insta-lab blog-lab realestate-lab co-gahusb agent-office personal packs-lab travel-proxy video-lab image-lab tarot-lab saju-lab frontend"
 # Infra 서비스 (image-based, 영속 데이터 보존을 위해 stop/rm 없이 up만)
 INFRA_SERVICES="redis"
 # 헬스체크 대상
-HEALTH_ENDPOINTS="lotto stock travel-proxy music-lab insta-lab realestate-lab agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab redis"
+HEALTH_ENDPOINTS="lotto stock travel-proxy music-lab insta-lab realestate-lab co-gahusb agent-office personal packs-lab video-lab image-lab tarot-lab saju-lab redis"
 # data 디렉토리 (packs-lab은 별도 media/packs 사용)
 DATA_DIRS="music stock insta realestate agent-office personal video image tarot saju"

--- a/stock/app/db.py
+++ b/stock/app/db.py
@@ -2,6 +2,7 @@ import sqlite3
 import os
 import hashlib
 import json
+import datetime as dt
 from typing import List, Dict, Any, Optional

 from app.screener.schema import ensure_screener_schema
@@ -125,6 +126,42 @@ def init_db():
        conn.execute("CREATE INDEX IF NOT EXISTS idx_holdings_sig_ticker "
                     "ON holdings_signals(ticker, date DESC);")

+        # 실시간 매매 알람: watchlist / alert_state / alert_history
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS watchlist (
+                ticker      TEXT PRIMARY KEY,
+                name        TEXT,
+                note        TEXT,
+                params_json TEXT NOT NULL DEFAULT '{}',
+                added_at    TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+            )
+        """)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS trade_alert_state (
+                ticker           TEXT NOT NULL,
+                kind             TEXT NOT NULL,
+                condition        TEXT NOT NULL,
+                currently_firing INTEGER NOT NULL DEFAULT 0,
+                first_fired_at   TEXT,
+                last_fired_at    TEXT,
+                last_seen_at     TEXT,
+                PRIMARY KEY (ticker, kind, condition)
+            )
+        """)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS trade_alert_history (
+                id          INTEGER PRIMARY KEY AUTOINCREMENT,
+                ticker      TEXT NOT NULL,
+                name        TEXT,
+                kind        TEXT NOT NULL,
+                condition   TEXT NOT NULL,
+                price       REAL,
+                detail_json TEXT,
+                fired_at    TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+            )
+        """)
+        conn.execute("CREATE INDEX IF NOT EXISTS idx_tah_fired ON trade_alert_history(fired_at DESC)")
+
        # Screener 스키마 부트스트랩 (7테이블 + 디폴트 설정 시드)
        ensure_screener_schema(conn)

@@ -379,3 +416,146 @@ def get_holdings_signal_history(ticker: str, limit: int = 30) -> list:
            "SELECT * FROM holdings_signals WHERE ticker=? ORDER BY date DESC LIMIT ?",
            (ticker, limit)).fetchall()
    return [_row_to_signal(r) for r in rows]
+
+
+# --- 실시간 매매 알람: 공통 유틸 ---
+
+def _now_iso() -> str:
+    return dt.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%fZ")
+
+
+# --- Watchlist CRUD ---
+
+def add_watchlist(ticker: str, name: str = None, note: str = None) -> None:
+    with _conn() as conn:
+        conn.execute(
+            "INSERT OR IGNORE INTO watchlist(ticker,name,note) VALUES(?,?,?)",
+            (ticker, name, note),
+        )
+        # 이름/노트 갱신(이미 있으면)
+        conn.execute(
+            "UPDATE watchlist SET name=COALESCE(?,name), note=COALESCE(?,note) WHERE ticker=?",
+            (name, note, ticker),
+        )
+
+
+def remove_watchlist(ticker: str) -> bool:
+    with _conn() as conn:
+        cur = conn.execute("DELETE FROM watchlist WHERE ticker=?", (ticker,))
+        return cur.rowcount > 0
+
+
+def get_watchlist() -> list:
+    with _conn() as conn:
+        rows = conn.execute("SELECT * FROM watchlist ORDER BY added_at").fetchall()
+    return [
+        {"ticker": r["ticker"], "name": r["name"], "note": r["note"],
+         "params": json.loads(r["params_json"] or "{}"), "added_at": r["added_at"]}
+        for r in rows
+    ]
+
+
+# --- Trade Alert State ---
+
+def get_alert_state_firing() -> set:
+    with _conn() as conn:
+        rows = conn.execute(
+            "SELECT ticker,kind,condition FROM trade_alert_state WHERE currently_firing=1"
+        ).fetchall()
+    return {(r["ticker"], r["kind"], r["condition"]) for r in rows}
+
+
+def set_alert_firing(ticker: str, kind: str, condition: str, firing: bool,
+                     at_iso: str = None, mark_fired: bool = True) -> None:
+    """currently_firing 상태 갱신.
+
+    mark_fired=True(기본): 실제 알림 발송 → first/last_fired_at 갱신.
+    mark_fired=False: 쿨다운으로 발송 억제하되 firing 상태만 유지 → 발동시각 미갱신
+      (쿨다운이 계속 연장되지 않도록).
+    """
+    now = at_iso or _now_iso()
+    with _conn() as conn:
+        if firing and mark_fired:
+            conn.execute(
+                """INSERT INTO trade_alert_state(ticker,kind,condition,currently_firing,first_fired_at,last_fired_at,last_seen_at)
+                   VALUES(?,?,?,1,?,?,?)
+                   ON CONFLICT(ticker,kind,condition) DO UPDATE SET
+                     currently_firing=1,
+                     first_fired_at=COALESCE(first_fired_at,excluded.first_fired_at),
+                     last_fired_at=excluded.last_fired_at,
+                     last_seen_at=excluded.last_seen_at""",
+                (ticker, kind, condition, now, now, now),
+            )
+        elif firing and not mark_fired:
+            conn.execute(
+                """INSERT INTO trade_alert_state(ticker,kind,condition,currently_firing,last_seen_at)
+                   VALUES(?,?,?,1,?)
+                   ON CONFLICT(ticker,kind,condition) DO UPDATE SET
+                     currently_firing=1, last_seen_at=excluded.last_seen_at""",
+                (ticker, kind, condition, now),
+            )
+        else:
+            conn.execute(
+                "UPDATE trade_alert_state SET currently_firing=0, last_seen_at=? WHERE ticker=? AND kind=? AND condition=?",
+                (now, ticker, kind, condition),
+            )
+
+
+def get_alert_last_fired_map() -> dict:
+    """{(ticker,kind,condition): last_fired_at ISO} — 쿨다운 판정용."""
+    with _conn() as conn:
+        rows = conn.execute(
+            "SELECT ticker,kind,condition,last_fired_at FROM trade_alert_state"
+        ).fetchall()
+    return {(r["ticker"], r["kind"], r["condition"]): r["last_fired_at"] for r in rows}
+
+
+def get_ticker_name(ticker: str) -> Optional[str]:
+    """종목명 해석 — watchlist → portfolio → krx_master 순. 없으면 None."""
+    with _conn() as conn:
+        for sql in (
+            "SELECT name FROM watchlist WHERE ticker=?",
+            "SELECT name FROM portfolio WHERE ticker=? LIMIT 1",
+            "SELECT name FROM krx_master WHERE ticker=?",
+        ):
+            try:
+                row = conn.execute(sql, (ticker,)).fetchone()
+            except sqlite3.OperationalError:
+                continue  # 일부 테스트 DB엔 해당 테이블 부재
+            if row and row["name"]:
+                return row["name"]
+    return None
+
+
+def touch_alert_seen(keys: list, at_iso: str) -> None:
+    with _conn() as conn:
+        for (ticker, kind, condition) in keys:
+            conn.execute(
+                "UPDATE trade_alert_state SET last_seen_at=? WHERE ticker=? AND kind=? AND condition=?",
+                (at_iso, ticker, kind, condition),
+            )
+
+
+# --- Trade Alert History ---
+
+def add_alert_history(ticker: str, name: str, kind: str, condition: str, price, detail: dict) -> int:
+    with _conn() as conn:
+        cur = conn.execute(
+            "INSERT INTO trade_alert_history(ticker,name,kind,condition,price,detail_json) VALUES(?,?,?,?,?,?)",
+            (ticker, name, kind, condition, price, json.dumps(detail or {}, ensure_ascii=False)),
+        )
+        return cur.lastrowid
+
+
+def get_alert_history(days: int = 7) -> list:
+    with _conn() as conn:
+        rows = conn.execute(
+            "SELECT * FROM trade_alert_history WHERE fired_at >= strftime('%Y-%m-%dT%H:%M:%fZ','now', ?) ORDER BY fired_at DESC",
+            (f"-{int(days)} days",),
+        ).fetchall()
+    return [
+        {"id": r["id"], "ticker": r["ticker"], "name": r["name"], "kind": r["kind"],
+         "condition": r["condition"], "price": r["price"],
+         "detail": json.loads(r["detail_json"] or "{}"), "fired_at": r["fired_at"]}
+        for r in rows
+    ]
--- a/stock/app/main.py
+++ b/stock/app/main.py
@@ -21,6 +21,9 @@ from .db import (
    upsert_broker_cash, get_all_broker_cash, delete_broker_cash,
    upsert_asset_snapshot, get_asset_snapshots,
    add_sell_history, get_sell_history, update_sell_history, delete_sell_history,
+    add_watchlist, remove_watchlist, get_watchlist, get_alert_history,
+    get_alert_state_firing, set_alert_firing, touch_alert_seen, add_alert_history,
+    get_alert_last_fired_map, get_ticker_name,
 )
 from .scraper import fetch_market_news, fetch_major_indices
 from .price_fetcher import get_current_prices, get_current_prices_detail
@@ -28,6 +31,10 @@ from .ai_summarizer import summarize_news, OllamaError
 from .auth import verify_webai_key
 from . import webai_cache
 from . import holdings_intel
+from . import trade_alerts
+from .trade_alerts import (
+    build_monitor_set, current_session, diff_firing, DEFAULT_EXIT_PARAMS, DEFAULT_BUY_PARAMS,
+)

 app = FastAPI()
 install_access_log(app)
@@ -506,6 +513,90 @@ def get_webai_news_sentiment(date: str | None = None):
    return result


+@app.get("/api/webai/trade-alert/monitor-set", dependencies=[Depends(verify_webai_key)])
+def get_trade_alert_monitor_set():
+    """web-ai(Windows 워커) 전용 — 실시간 매매 알람 감시대상 조립 (계약 §5.1).
+
+    session은 KST 시각으로 pre/regular/after 판정 후, 평일·휴장 여부(is_market_open)를
+    함께 게이팅해 최종 closed 여부를 결정한다.
+    """
+    from datetime import datetime, timezone, timedelta
+    kst = timezone(timedelta(hours=9))
+    now_kst = datetime.now(kst)
+    session = current_session(now_kst)
+    if not is_market_open(now_kst.date()):
+        session = "closed"
+
+    from .db import _conn
+    conn = _conn()
+    try:
+        return build_monitor_set(conn, session, DEFAULT_EXIT_PARAMS, DEFAULT_BUY_PARAMS)
+    finally:
+        conn.close()
+
+
+class TradeAlertReport(BaseModel):
+    as_of: str | None = None
+    firing: list[dict] = []
+
+
+@app.post("/api/webai/trade-alert/report", dependencies=[Depends(verify_webai_key)])
+def post_trade_alert_report(req: TradeAlertReport):
+    """web-ai(Windows 워커) 전용 — 발화 보고 수신 (계약 §5.2).
+
+    직전 발화상태 대비 edge diff(diff_firing) 후, 신규 alert는
+    agent-office 전송 성공 시에만 상태(firing=True)+이력 반영한다.
+    전송 실패 시 상태를 채택하지 않아 다음 사이클에 동일 alert가 다시
+    "신규"로 잡혀 재시도된다(멱등). 해제(cleared)는 전송과 무관하게 firing=False.
+    """
+    from datetime import datetime, timedelta
+    cooldown_h = float(os.getenv("TRADE_ALERT_COOLDOWN_HOURS", "6"))
+    now = datetime.utcnow()
+
+    prev = get_alert_state_firing()
+    last_fired = get_alert_last_fired_map()
+    d = diff_firing(req.firing, prev)
+
+    new_count = 0
+    suppressed = 0
+    for a in d["new"]:
+        key = (a["ticker"], a["kind"], a["condition"])
+        # 쿨다운: 같은 종목·조건이 최근 발동됐으면(해제→재발화 오실레이션) 재알림 억제
+        lf = last_fired.get(key)
+        if cooldown_h > 0 and _within_cooldown(now, lf, timedelta(hours=cooldown_h)):
+            set_alert_firing(*key, firing=True, mark_fired=False)   # firing 유지, 발동시각 미갱신
+            suppressed += 1
+            continue
+        name = a.get("name") or get_ticker_name(a["ticker"])
+        alert = {**a, "name": name}
+        if trade_alerts.notify_agent_office([alert]):
+            set_alert_firing(*key, firing=True)                    # 발동시각 갱신(UTC)
+            add_alert_history(
+                a["ticker"], name, a["kind"], a["condition"],
+                a.get("price"), a.get("detail") or {},
+            )
+            new_count += 1
+
+    for ticker, kind, condition in d["cleared"]:
+        set_alert_firing(ticker, kind, condition, firing=False)
+
+    touch_alert_seen(d["seen"], req.as_of or "")
+
+    return {"new_alerts": new_count, "cleared": len(d["cleared"]), "suppressed": suppressed}
+
+
+def _within_cooldown(now, last_iso, cooldown) -> bool:
+    """last_iso(UTC ISO `%Y-%m-%dT%H:%M:%fZ`)가 now 기준 cooldown 이내면 True."""
+    if not last_iso:
+        return False
+    from datetime import datetime
+    try:
+        lf = datetime.strptime(last_iso, "%Y-%m-%dT%H:%M:%fZ")
+    except (ValueError, TypeError):
+        return False
+    return (now - lf) < cooldown
+
+
@app.post("/api/portfolio", status_code=201)
 def create_portfolio_item(req: PortfolioItemRequest):
    """포트폴리오 종목 추가"""
@@ -653,6 +744,41 @@ def remove_sell_history(record_id: int):
    return {"ok": True}


+# --- Watchlist & Trade Alerts API (실시간 매매 알람) ---
+
+class WatchlistItemRequest(BaseModel):
+    ticker: str
+    name: str | None = None
+    note: str | None = None
+
+
+@app.get("/api/stock/watchlist")
+def list_watchlist():
+    """관심종목 목록 조회"""
+    return {"watchlist": get_watchlist()}
+
+
+@app.post("/api/stock/watchlist", status_code=201)
+def create_watchlist_item(req: WatchlistItemRequest):
+    """관심종목 추가 (이미 존재하면 name/note 갱신, 멱등)"""
+    add_watchlist(req.ticker, req.name, req.note)
+    return {"ok": True}
+
+
+@app.delete("/api/stock/watchlist/{ticker}")
+def delete_watchlist_item(ticker: str):
+    """관심종목 삭제"""
+    if not remove_watchlist(ticker):
+        raise HTTPException(status_code=404, detail="not in watchlist")
+    return {"ok": True}
+
+
+@app.get("/api/stock/trade-alerts")
+def list_trade_alerts(days: int = 7):
+    """매매 알람 이력 조회 (최근 N일)"""
+    return {"alerts": get_alert_history(days)}
+
+
 # --- Holdings Intelligence API ---

@app.get("/api/stock/holdings/intel")
--- a/stock/app/screener/ai_news/analyzer.py
+++ b/stock/app/screener/ai_news/analyzer.py
@@ -2,6 +2,7 @@

 from __future__ import annotations

+import datetime as dt
 import json
 import logging
 import os
@@ -59,13 +60,19 @@ async def score_sentiment(
    *,
    name: str | None = None,
    model: str = DEFAULT_MODEL,
+    asof: dt.date | None = None,
 ) -> Dict[str, Any]:
-    """Returns {ticker, score_raw, reason, news_count, tokens_input, tokens_output, model}."""
+    """Returns {ticker, score_raw, reason, news_count, tokens_input, tokens_output, model}.
+
+    asof(현재 KST 일자)를 주면 prompt 맨 앞에 오늘 날짜를 명시해 LLM이 현재 시점 기준으로 판단한다.
+    """
    news_block = _format_news_block(news)
    prompt = PROMPT_TEMPLATE.format(
        name=name or ticker, ticker=ticker,
        n=len(news), news_block=news_block,
    )
+    if asof is not None:
+        prompt = f"오늘 날짜: {asof.isoformat()} (이 시점 기준으로 뉴스를 평가하세요)\n\n" + prompt
    resp = await llm.messages.create(
        model=model,
        max_tokens=200,
--- a/stock/app/screener/ai_news/pipeline.py
+++ b/stock/app/screener/ai_news/pipeline.py
@@ -39,11 +39,11 @@ def _make_llm():

 async def _process_one(
    ticker: str, name: str, articles: List[Dict[str, Any]],
-    sem: asyncio.Semaphore, llm, model: str,
+    sem: asyncio.Semaphore, llm, model: str, asof: dt.date,
 ) -> Dict[str, Any]:
    async with sem:
        return await _analyzer.score_sentiment(
-            llm, ticker, articles, name=name, model=model,
+            llm, ticker, articles, name=name, model=model, asof=asof,
        )


@@ -110,7 +110,7 @@ async def refresh_daily(
            arts = articles_by_ticker.get(t, [])
            if not arts:
                continue  # 매핑 0 — score 미생성
-            tasks.append(_process_one(t, name_map.get(t, t), arts, sem, llm, model))
+            tasks.append(_process_one(t, name_map.get(t, t), arts, sem, llm, model, asof))
        raw_results = await asyncio.gather(*tasks, return_exceptions=True)

    successes: List[Dict[str, Any]] = []
--- a/stock/app/screener/router.py
+++ b/stock/app/screener/router.py
@@ -125,6 +125,16 @@ from . import telegram as _tg
 from .engine import Screener, ScreenContext


+def _today_kst() -> dt.date:
+    """KST 오늘 날짜.
+
+    stock 컨테이너는 python:3.12-alpine + tzdata 미설치라 TZ=Asia/Seoul이 무효 →
+    date.today()가 UTC를 반환한다. 08시대(KST) 리포트가 하루 밀리는 것을 막기 위해
+    UTC+9로 명시 보정한다(holdings_intel._today_kst와 동일한 관용).
+    """
+    return (dt.datetime.utcnow() + dt.timedelta(hours=9)).date()
+
+
 def _resolve_asof(asof_str, conn: sqlite3.Connection) -> dt.date:
    if asof_str:
        return dt.date.fromisoformat(asof_str)
@@ -263,7 +273,7 @@ from . import snapshot as _snap

@router.post("/snapshot/refresh")
 def post_snapshot_refresh(asof: Optional[str] = None):
-    asof_date = dt.date.fromisoformat(asof) if asof else dt.date.today()
+    asof_date = dt.date.fromisoformat(asof) if asof else _today_kst()
    if asof_date.weekday() >= 5:
        return {"asof": asof_date.isoformat(), "status": "skipped_weekend"}
    with _conn() as c:
@@ -300,7 +310,7 @@ from .ai_news import validation as _ai_validation

@router.post("/snapshot/refresh-news-sentiment")
 async def post_refresh_news_sentiment(asof: Optional[str] = None):
-    asof_date = dt.date.fromisoformat(asof) if asof else dt.date.today()
+    asof_date = dt.date.fromisoformat(asof) if asof else _today_kst()
    if asof_date.weekday() >= 5:
        return {"asof": asof_date.isoformat(), "status": "skipped_weekend"}
    if _is_holiday(asof_date):
--- a/stock/app/trade_alerts.py
+++ b/stock/app/trade_alerts.py
@@ -0,0 +1,138 @@
+"""매매 알람 — 감시대상(monitor-set) 조립. 순수 조립 로직(HTTP/텔레그램 없음).
+
+계약 §5.1 (docs/superpowers/specs/2026-07-02-realtime-trade-alerts-design.md) —
+Windows 워커가 GET /api/webai/trade-alert/monitor-set 로 받는 응답을 조립한다.
+NAS는 watchlist ∪ screener 최신 성공 run 후보를 buy_targets로, 보유 종목을
+sell_targets로 병합해 넘긴다. TA/조건판정은 워커 쪽 책임.
+"""
+import os
+import httpx
+
+from datetime import datetime, timedelta, timezone, time as _time
+from typing import Optional
+
+from app.db import get_all_portfolio, get_watchlist
+
+_KST = timezone(timedelta(hours=9))
+
+# KST 세션 창(시:분) — 평일+휴장 판정은 호출부에서 is_market_open으로 별도 게이팅
+_SESSIONS = [
+    ("pre",     (8, 30), (9, 0)),
+    ("regular", (9, 0),  (15, 30)),
+    ("after",   (16, 0), (18, 0)),
+]
+
+
+def current_session(now_kst) -> str:
+    """now_kst의 time만으로 pre/regular/after/closed 세션 판정 (요일·휴장 무관)."""
+    t = now_kst.time()
+    for name, (sh, sm), (eh, em) in _SESSIONS:
+        if _time(sh, sm) <= t < _time(eh, em):
+            return name
+    return "closed"
+
+
+DEFAULT_EXIT_PARAMS = {"stop_pct": 0.08, "take_pct": 0.25, "trailing_pct": 0.10,
+                       "climax_vol_x": 3.0, "climax_close_pct": 0.97}
+DEFAULT_BUY_PARAMS = {"rsi_oversold": 30, "breakout_vol_mult": 1.5, "pullback_pct": 0.02}
+
+
+def latest_screener_candidates(conn) -> list:
+    """최신 성공(status='success') screener run의 후보 {ticker,name} 목록."""
+    row = conn.execute(
+        "SELECT id FROM screener_runs WHERE status='success' ORDER BY asof DESC, id DESC LIMIT 1"
+    ).fetchone()
+    if not row:
+        return []
+    run_id = row[0]
+    rows = conn.execute(
+        "SELECT ticker, name FROM screener_results WHERE run_id=? ORDER BY rank", (run_id,)
+    ).fetchall()
+    return [{"ticker": r[0], "name": r[1]} for r in rows]
+
+
+def holding_high(conn, ticker: str, lookback_days: int = 60) -> Optional[float]:
+    """보유기간 고점(트레일링 스톱용) — krx_daily_prices 최근 lookback_days 최고 high."""
+    row = conn.execute(
+        "SELECT MAX(high) FROM krx_daily_prices WHERE ticker=? "
+        "AND date >= date('now', ?)",
+        (ticker, f"-{int(lookback_days)} days"),
+    ).fetchone()
+    return row[0] if row and row[0] is not None else None
+
+
+def build_monitor_set(conn, session: str, exit_params: dict, buy_params: dict) -> dict:
+    """계약 §5.1 monitor-set 응답 dict 조립.
+
+    buy_targets  = watchlist ∪ 최신 screener 후보 (ticker 기준 중복 제거, watchlist 우선)
+    sell_targets = 보유 종목(portfolio) + avg_price/qty/holding_high
+    """
+    buy: dict[str, dict] = {}
+    for w in get_watchlist():
+        buy[w["ticker"]] = {
+            "ticker": w["ticker"], "name": w["name"],
+            "source": "watch", "params": w.get("params") or {},
+        }
+    for c in latest_screener_candidates(conn):
+        if c["ticker"] not in buy:
+            buy[c["ticker"]] = {
+                "ticker": c["ticker"], "name": c["name"],
+                "source": "screener", "params": {},
+            }
+
+    sell_targets = []
+    for p in get_all_portfolio():
+        ticker = p["ticker"]
+        sell_targets.append({
+            "ticker": ticker,
+            "name": p.get("name"),
+            "avg_price": p.get("avg_price"),
+            "qty": p.get("quantity"),
+            "holding_high": holding_high(conn, ticker),
+            "params": {},
+        })
+
+    return {
+        "session": session,
+        "as_of": datetime.now(_KST).isoformat(),
+        "buy_targets": list(buy.values()),
+        "sell_targets": sell_targets,
+        "buy_params": buy_params,
+        "exit_params": exit_params,
+    }
+
+
+def diff_firing(reported: list, prev: set) -> dict:
+    """워커 발화집합(reported) vs 직전 발화상태(prev) edge diff.
+
+    reported 각 항목: {ticker,kind,condition,price,detail,name?}.
+    key = (ticker,kind,condition).
+    반환 {"new":[신규 alert...], "cleared":[해제 key...], "seen":[현재 key...]}.
+    """
+    cur = {}
+    for a in reported:
+        key = (a["ticker"], a["kind"], a["condition"])
+        cur[key] = a
+    cur_keys = set(cur.keys())
+    new_keys = cur_keys - prev
+    cleared = sorted(prev - cur_keys)
+    return {
+        "new": [cur[k] for k in cur_keys if k in new_keys],
+        "cleared": cleared,
+        "seen": sorted(cur_keys),
+    }
+
+
+def notify_agent_office(alerts: list) -> bool:
+    """신규 alert들을 agent-office로 push (계약 §5.2). 전송 성공 시 True.
+
+    실패(네트워크 오류/비-200)는 False — 호출부가 상태/이력 미채택 후 다음
+    사이클에 동일 alert를 재시도하도록 한다(멱등, at-least-once).
+    """
+    url = os.getenv("AGENT_OFFICE_URL", "http://agent-office:8000") + "/api/agent-office/stock/trade-alert"
+    try:
+        with httpx.Client(timeout=10) as c:
+            resp = c.post(url, json={"alerts": alerts})
+        return resp.status_code == 200
+    except httpx.HTTPError:
+        return False
--- a/stock/tests/test_ai_news_analyzer.py
+++ b/stock/tests/test_ai_news_analyzer.py
@@ -58,6 +58,18 @@ async def test_score_sentiment_clamps_negative_out_of_range():
    assert out["score_raw"] == -10.0


+@pytest.mark.asyncio
+async def test_score_sentiment_includes_asof_date_in_prompt():
+    """asof(현재 KST 일자)를 넘기면 prompt에 오늘 날짜가 포함되어 LLM이 현재 일자 기준으로 판단."""
+    import datetime as _dt
+    llm = _mk_llm(json.dumps({"score": 5.0, "reason": "ok"}))
+    await analyzer.score_sentiment(
+        llm, "005930", NEWS, name="삼성전자", asof=_dt.date(2026, 7, 2),
+    )
+    user_msg = llm.messages.create.call_args.kwargs["messages"][0]["content"]
+    assert "2026-07-02" in user_msg
+
+
@pytest.mark.asyncio
 async def test_score_sentiment_includes_summary_in_prompt():
    """summary 가 있으면 prompt 에 포함, 없으면 title 만."""
--- a/stock/tests/test_ai_news_pipeline.py
+++ b/stock/tests/test_ai_news_pipeline.py
@@ -39,7 +39,7 @@ async def test_refresh_daily_happy_path(conn):
    scores_by_ticker = {
        "005930": 7.5, "000660": 4.0, "373220": -6.0,
    }
-    async def fake_score(llm, ticker, news, *, name=None, model="m"):
+    async def fake_score(llm, ticker, news, *, name=None, model="m", asof=None):
        return {
            "ticker": ticker, "score_raw": scores_by_ticker[ticker],
            "reason": f"r{ticker}", "news_count": 1,
@@ -81,7 +81,7 @@ async def test_refresh_daily_failures_isolated(conn):
    }
    fake_stats = {"total_articles": 3, "matched_pairs": 3, "hit_tickers": 3}

-    async def fake_score(llm, ticker, news, *, name=None, model="m"):
+    async def fake_score(llm, ticker, news, *, name=None, model="m", asof=None):
        if ticker == "000660":
            raise RuntimeError("llm exploded")
        return {
@@ -116,7 +116,7 @@ async def test_refresh_daily_no_match_ticker_skipped(conn):
    }
    fake_stats = {"total_articles": 1, "matched_pairs": 1, "hit_tickers": 1}

-    async def fake_score(llm, ticker, news, *, name=None, model="m"):
+    async def fake_score(llm, ticker, news, *, name=None, model="m", asof=None):
        return {
            "ticker": ticker, "score_raw": 5.0, "reason": "r",
            "news_count": 1, "tokens_input": 100, "tokens_output": 20,
@@ -152,7 +152,7 @@ async def test_refresh_daily_sign_gate_no_positive_in_neg(conn):
    fake_stats = {"total_articles": 3, "matched_pairs": 3, "hit_tickers": 3}
    scores = {"005930": 6.0, "000660": 2.0, "373220": 0.5}  # 모두 양수

-    async def fake_score(llm, ticker, news, *, name=None, model="m"):
+    async def fake_score(llm, ticker, news, *, name=None, model="m", asof=None):
        return {
            "ticker": ticker, "score_raw": scores[ticker], "reason": "r",
            "news_count": 1, "tokens_input": 1, "tokens_output": 1, "model": model,
@@ -183,7 +183,7 @@ async def test_refresh_daily_sign_gate_excludes_neutral(conn):
    fake_stats = {"total_articles": 3, "matched_pairs": 3, "hit_tickers": 3}
    scores = {"005930": 3.0, "000660": 0.0, "373220": -3.0}

-    async def fake_score(llm, ticker, news, *, name=None, model="m"):
+    async def fake_score(llm, ticker, news, *, name=None, model="m", asof=None):
        return {
            "ticker": ticker, "score_raw": scores[ticker], "reason": "r",
            "news_count": 1, "tokens_input": 1, "tokens_output": 1, "model": model,
--- a/stock/tests/test_ai_news_router.py
+++ b/stock/tests/test_ai_news_router.py
@@ -5,6 +5,21 @@ from fastapi.testclient import TestClient
 from app.main import app


+def test_today_kst_uses_kst_offset_not_utc(monkeypatch):
+    """컨테이너가 UTC(Alpine, tzdata 미설치)라 date.today()는 08시 KST에 어제를 준다.
+    _today_kst()는 UTC+9로 보정해 오늘(KST)을 반환해야 한다."""
+    from app.screener import router
+
+    class _FrozenDT(dt.datetime):
+        @classmethod
+        def utcnow(cls):
+            # 2026-07-01 23:30 UTC == 2026-07-02 08:30 KST (AI 뉴스 리포트 시각대)
+            return dt.datetime(2026, 7, 1, 23, 30, 0)
+
+    monkeypatch.setattr(router.dt, "datetime", _FrozenDT)
+    assert router._today_kst() == dt.date(2026, 7, 2)
+
+
 def test_refresh_news_sentiment_weekend_skip():
    # 2026-05-16 = Saturday
    client = TestClient(app)
--- a/stock/tests/test_trade_alerts_db.py
+++ b/stock/tests/test_trade_alerts_db.py
@@ -0,0 +1,48 @@
+import os, sqlite3, tempfile, datetime as dt
+import pytest
+
+@pytest.fixture
+def db(monkeypatch, tmp_path):
+    from app import db as _db
+    monkeypatch.setattr(_db, "DB_PATH", str(tmp_path / "stock.db"))
+    _db.init_db()
+    return _db
+
+def test_watchlist_add_get_remove(db):
+    db.add_watchlist("005930", "삼성전자", note="관심")
+    db.add_watchlist("005930", "삼성전자")  # 멱등
+    wl = db.get_watchlist()
+    assert [w["ticker"] for w in wl] == ["005930"]
+    assert wl[0]["name"] == "삼성전자"
+    assert db.remove_watchlist("005930") is True
+    assert db.get_watchlist() == []
+
+def test_alert_state_edge_firing_and_clear(db):
+    key = ("005930", "buy", "buy_breakout")
+    assert db.get_alert_state_firing() == set()
+    db.set_alert_firing(*key, firing=True, at_iso="2026-07-02T00:01:00Z")
+    assert key in db.get_alert_state_firing()
+    db.set_alert_firing(*key, firing=False)
+    assert key not in db.get_alert_state_firing()
+
+def test_alert_history_records_and_reads(db):
+    db.add_alert_history("005930", "삼성전자", "buy", "buy_breakout", 71500, {"vol": 2.1})
+    rows = db.get_alert_history(days=7)
+    assert len(rows) == 1
+    assert rows[0]["ticker"] == "005930" and rows[0]["kind"] == "buy"
+    assert rows[0]["detail"]["vol"] == 2.1
+
+def test_alert_history_days_filter_format_consistency(db):
+    """fired_at은 ISO(T/Z)로 저장 — 필터도 ISO여야 경계일 비교가 정확.
+    7일 경계 밖(정확히 7일 전 자정) 레코드는 제외되어야 한다. 포맷 불일치면 잘못 포함됨."""
+    db.add_alert_history("005930", "삼성", "buy", "buy_breakout", 71500, {})  # now
+    conn = sqlite3.connect(db.DB_PATH)
+    conn.execute(
+        "INSERT INTO trade_alert_history(ticker,name,kind,condition,price,detail_json,fired_at) "
+        "VALUES('000660','SK','sell','sell_stop_loss',60000,'{}', "
+        "strftime('%Y-%m-%dT%H:%M:%fZ','now','-7 days','start of day'))"
+    )
+    conn.commit(); conn.close()
+    tickers = [r["ticker"] for r in db.get_alert_history(days=7)]
+    assert "005930" in tickers
+    assert "000660" not in tickers
--- a/stock/tests/test_trade_alerts_edge.py
+++ b/stock/tests/test_trade_alerts_edge.py
@@ -0,0 +1,18 @@
+def test_diff_new_and_cleared_and_rearm():
+    from app.trade_alerts import diff_firing
+    reported = [{"ticker": "005930", "kind": "buy", "condition": "buy_breakout",
+                 "price": 71500, "detail": {}}]
+    # 최초: prev 비어있음 → 신규
+    d1 = diff_firing(reported, prev=set())
+    assert [a["condition"] for a in d1["new"]] == ["buy_breakout"]
+    assert d1["cleared"] == []
+    # 유지: prev에 이미 있음 → 신규 없음
+    prev = {("005930", "buy", "buy_breakout")}
+    d2 = diff_firing(reported, prev=prev)
+    assert d2["new"] == []
+    # 해제: reported 비었고 prev에 있음 → cleared
+    d3 = diff_firing([], prev=prev)
+    assert d3["cleared"] == [("005930", "buy", "buy_breakout")]
+    # 재무장 후 재발화: prev 다시 비면 신규
+    d4 = diff_firing(reported, prev=set())
+    assert len(d4["new"]) == 1
--- a/stock/tests/test_trade_alerts_monitorset.py
+++ b/stock/tests/test_trade_alerts_monitorset.py
@@ -0,0 +1,101 @@
+import sqlite3
+import pytest
+
+
+@pytest.fixture
+def conn(monkeypatch, tmp_path):
+    from app import db as _db
+    monkeypatch.setattr(_db, "DB_PATH", str(tmp_path / "stock.db"))
+    _db.init_db()
+    c = sqlite3.connect(_db.DB_PATH)
+    c.row_factory = sqlite3.Row
+    # 보유 1종목 (add_portfolio_item 실제 시그니처: broker/ticker/name/quantity/avg_price — market 파라미터 없음)
+    _db.add_portfolio_item(ticker="000660", name="SK하이닉스", quantity=10,
+                            avg_price=180000, broker="kis")
+    # watchlist 1종목
+    _db.add_watchlist("005930", "삼성전자")
+    yield c
+    c.close()
+
+
+def test_build_monitor_set_merges_sources(conn):
+    from app import trade_alerts as ta
+    ms = ta.build_monitor_set(conn, session="regular",
+                               exit_params={"stop_pct": 0.08}, buy_params={"rsi_oversold": 30})
+    buy_tickers = {t["ticker"] for t in ms["buy_targets"]}
+    sell_tickers = {t["ticker"] for t in ms["sell_targets"]}
+    assert "005930" in buy_tickers            # watchlist
+    assert "000660" in sell_tickers           # 보유
+    assert ms["session"] == "regular"
+    assert ms["exit_params"]["stop_pct"] == 0.08
+    sell = next(t for t in ms["sell_targets"] if t["ticker"] == "000660")
+    assert sell["avg_price"] == 180000 and sell["qty"] == 10
+
+
+def test_latest_screener_candidates_empty_when_no_run(conn):
+    from app import trade_alerts as ta
+    assert ta.latest_screener_candidates(conn) == []
+
+
+def test_latest_screener_candidates_picks_latest_success_run(conn):
+    from app import trade_alerts as ta
+    now = "2026-07-02T09:00:00Z"
+    conn.execute(
+        "INSERT INTO screener_runs (asof, mode, status, started_at, weights_json, "
+        "node_params_json, gate_params_json, top_n) VALUES (?,?,?,?,?,?,?,?)",
+        (now, "manual", "failed", now, "{}", "{}", "{}", 20),
+    )
+    conn.execute(
+        "INSERT INTO screener_runs (asof, mode, status, started_at, weights_json, "
+        "node_params_json, gate_params_json, top_n) VALUES (?,?,?,?,?,?,?,?)",
+        (now, "manual", "success", now, "{}", "{}", "{}", 20),
+    )
+    run_id = conn.execute("SELECT id FROM screener_runs WHERE status='success'").fetchone()[0]
+    conn.execute(
+        "INSERT INTO screener_results (run_id, rank, ticker, name, total_score, scores_json) "
+        "VALUES (?,?,?,?,?,?)",
+        (run_id, 1, "035720", "카카오", 88.5, "{}"),
+    )
+    conn.commit()
+    candidates = ta.latest_screener_candidates(conn)
+    assert candidates == [{"ticker": "035720", "name": "카카오"}]
+
+
+def test_holding_high_returns_max_high_within_lookback(conn):
+    from app import trade_alerts as ta
+    conn.execute(
+        "INSERT INTO krx_daily_prices (ticker, date, high) VALUES (?,?,?)",
+        ("000660", "2026-06-01", 200000),
+    )
+    conn.execute(
+        "INSERT INTO krx_daily_prices (ticker, date, high) VALUES (?,?,?)",
+        ("000660", "2026-06-20", 210000),
+    )
+    conn.commit()
+    assert ta.holding_high(conn, "000660", lookback_days=60) == 210000
+
+
+def test_holding_high_none_when_no_price_history(conn):
+    from app import trade_alerts as ta
+    assert ta.holding_high(conn, "999999") is None
+
+
+def test_build_monitor_set_dedupes_watchlist_and_screener_overlap(conn):
+    from app import trade_alerts as ta
+    now = "2026-07-02T09:00:00Z"
+    cur = conn.execute(
+        "INSERT INTO screener_runs (asof, mode, status, started_at, weights_json, "
+        "node_params_json, gate_params_json, top_n) VALUES (?,?,?,?,?,?,?,?)",
+        (now, "manual", "success", now, "{}", "{}", "{}", 20),
+    )
+    run_id = cur.lastrowid
+    # 스크리너 후보가 watchlist와 중복(005930)
+    conn.execute(
+        "INSERT INTO screener_results (run_id, rank, ticker, name, total_score, scores_json) "
+        "VALUES (?,?,?,?,?,?)",
+        (run_id, 1, "005930", "삼성전자", 90.0, "{}"),
+    )
+    conn.commit()
+    ms = ta.build_monitor_set(conn, session="regular", exit_params={}, buy_params={})
+    buy_tickers = [t["ticker"] for t in ms["buy_targets"]]
+    assert buy_tickers.count("005930") == 1
--- a/stock/tests/test_trade_alerts_monitorset_api.py
+++ b/stock/tests/test_trade_alerts_monitorset_api.py
@@ -0,0 +1,43 @@
+import datetime as dt
+import pytest
+from fastapi.testclient import TestClient
+
+
+def test_current_session_windows():
+    from app.trade_alerts import current_session
+    d = dt.date(2026, 7, 2)
+    assert current_session(dt.datetime.combine(d, dt.time(8, 40))) == "pre"
+    assert current_session(dt.datetime.combine(d, dt.time(10, 0))) == "regular"
+    assert current_session(dt.datetime.combine(d, dt.time(17, 0))) == "after"
+    assert current_session(dt.datetime.combine(d, dt.time(20, 0))) == "closed"
+
+
+@pytest.fixture
+def client(monkeypatch, tmp_path):
+    from app import db as _db
+    monkeypatch.setattr(_db, "DB_PATH", str(tmp_path / "stock.db"))
+    _db.init_db()
+    monkeypatch.setenv("WEBAI_API_KEY", "k")
+    from app.main import app
+    return TestClient(app)
+
+
+def test_monitor_set_requires_auth(client):
+    assert client.get("/api/webai/trade-alert/monitor-set").status_code == 401
+
+
+def test_monitor_set_ok(client):
+    r = client.get("/api/webai/trade-alert/monitor-set", headers={"X-WebAI-Key": "k"})
+    assert r.status_code == 200
+    body = r.json()
+    assert body["session"] in ("pre", "regular", "after", "closed")
+    assert "buy_targets" in body and "sell_targets" in body
+    assert body["exit_params"]["trailing_pct"] == 0.10
+
+
+def test_monitor_set_exit_params_include_climax(client):
+    """climax 파라미터 중앙화 — 워커가 하드코딩 대신 NAS exit_params에서 받아 튜닝."""
+    ep = client.get("/api/webai/trade-alert/monitor-set",
+                    headers={"X-WebAI-Key": "k"}).json()["exit_params"]
+    assert ep["climax_vol_x"] == 3.0
+    assert ep["climax_close_pct"] == 0.97
--- a/stock/tests/test_trade_alerts_report_api.py
+++ b/stock/tests/test_trade_alerts_report_api.py
@@ -0,0 +1,85 @@
+import pytest
+from unittest.mock import patch
+from fastapi.testclient import TestClient
+
+
+@pytest.fixture
+def client(monkeypatch, tmp_path):
+    from app import db as _db
+    monkeypatch.setattr(_db, "DB_PATH", str(tmp_path / "stock.db"))
+    _db.init_db()
+    monkeypatch.setenv("WEBAI_API_KEY", "k")
+    from app.main import app
+    return TestClient(app)
+
+
+def _report(client, firing):
+    return client.post("/api/webai/trade-alert/report",
+                       headers={"X-WebAI-Key": "k"},
+                       json={"as_of": "2026-07-02T09:01:00+09:00", "firing": firing})
+
+
+def test_report_new_edge_sends_and_persists(client):
+    firing = [{"ticker": "005930", "name": "삼성전자", "kind": "buy",
+               "condition": "buy_breakout", "price": 71500, "detail": {"vol": 2.0}}]
+    with patch("app.trade_alerts.notify_agent_office", return_value=True) as m:
+        r1 = _report(client, firing)
+    assert r1.json()["new_alerts"] == 1
+    assert m.called
+    # 2번째 동일 firing → 유지, 신규 0
+    with patch("app.trade_alerts.notify_agent_office", return_value=True):
+        r2 = _report(client, firing)
+    assert r2.json()["new_alerts"] == 0
+    # 이력 1건
+    assert len(client.get("/api/stock/trade-alerts?days=1").json()["alerts"]) == 1
+
+
+def test_report_send_failure_does_not_persist(client):
+    firing = [{"ticker": "005930", "name": "삼성전자", "kind": "buy",
+               "condition": "buy_breakout", "price": 71500, "detail": {}}]
+    with patch("app.trade_alerts.notify_agent_office", return_value=False):
+        r = _report(client, firing)
+    assert r.json()["new_alerts"] == 0            # 전송 실패 → 미채택
+    # 다음 사이클(전송 성공) 재시도되어 알림
+    with patch("app.trade_alerts.notify_agent_office", return_value=True):
+        r2 = _report(client, firing)
+    assert r2.json()["new_alerts"] == 1
+
+
+def test_report_cooldown_suppresses_immediate_refire(client):
+    """같은 종목·조건이 해제됐다 곧바로 재발화해도 쿨다운(기본 6h) 내면 재알림 억제."""
+    firing = [{"ticker": "005930", "name": "삼성", "kind": "buy",
+               "condition": "buy_breakout", "price": 71500, "detail": {}}]
+    with patch("app.trade_alerts.notify_agent_office", return_value=True):
+        assert _report(client, firing).json()["new_alerts"] == 1   # 최초 알림
+        _report(client, [])                                        # 해제
+        r = _report(client, firing)                                # 즉시 재발화 → 쿨다운 억제
+    assert r.json()["new_alerts"] == 0
+    assert r.json()["suppressed"] == 1
+
+
+def test_report_refire_after_cooldown_alerts(client, monkeypatch):
+    """쿨다운=0이면 해제 후 재발화 시 재알림."""
+    monkeypatch.setenv("TRADE_ALERT_COOLDOWN_HOURS", "0")
+    firing = [{"ticker": "005930", "name": "삼성", "kind": "buy",
+               "condition": "buy_breakout", "price": 71500, "detail": {}}]
+    with patch("app.trade_alerts.notify_agent_office", return_value=True):
+        _report(client, firing)
+        _report(client, [])
+        r = _report(client, firing)
+    assert r.json()["new_alerts"] == 1
+
+
+def test_report_resolves_stock_name_from_watchlist(client):
+    """워커 firing에 name이 없어도 NAS가 종목명을 해석해 알림에 포함한다."""
+    from app import db
+    db.add_watchlist("000660", "SK하이닉스")
+    firing = [{"ticker": "000660", "kind": "buy", "condition": "buy_breakout",
+               "price": 180000, "detail": {}}]   # name 없음
+    with patch("app.trade_alerts.notify_agent_office", return_value=True) as m:
+        _report(client, firing)
+    sent_alert = m.call_args[0][0][0]
+    assert sent_alert["name"] == "SK하이닉스"
+    # 이력에도 종목명 기록
+    alerts = client.get("/api/stock/trade-alerts?days=1").json()["alerts"]
+    assert alerts[0]["name"] == "SK하이닉스"
--- a/stock/tests/test_watchlist_api.py
+++ b/stock/tests/test_watchlist_api.py
@@ -0,0 +1,22 @@
+import pytest
+from fastapi.testclient import TestClient
+
+@pytest.fixture
+def client(monkeypatch, tmp_path):
+    from app import db as _db
+    monkeypatch.setattr(_db, "DB_PATH", str(tmp_path / "stock.db"))
+    _db.init_db()
+    from app.main import app
+    return TestClient(app)
+
+def test_watchlist_crud(client):
+    assert client.get("/api/stock/watchlist").json()["watchlist"] == []
+    r = client.post("/api/stock/watchlist", json={"ticker": "005930", "name": "삼성전자"})
+    assert r.status_code == 201
+    wl = client.get("/api/stock/watchlist").json()["watchlist"]
+    assert wl[0]["ticker"] == "005930"
+    assert client.delete("/api/stock/watchlist/005930").status_code == 200
+    assert client.delete("/api/stock/watchlist/005930").status_code == 404
+
+def test_trade_alerts_history_empty(client):
+    assert client.get("/api/stock/trade-alerts?days=7").json()["alerts"] == []
Author	SHA1	Message	Date
gahusb	9baea3a0e2	feat(stock): 매매알람 쿨다운 중복억제 + 종목명 해석 - 쿨다운(TRADE_ALERT_COOLDOWN_HOURS 기본 6h): 같은 종목·조건 해제→재발화 오실레이션 시 재알림 억제(set_alert_firing mark_fired=False로 firing 유지·발동시각 미갱신, suppressed 카운트). - 종목명: 워커 firing에 name 없어도 NAS가 watchlist→portfolio→krx_master로 해석해 알림·이력에 포함.	2026-07-03 16:14:51 +09:00
gahusb	80daa53558	feat(agent-office): 매매알람에 조건별 '왜 매수/매도' 한 줄 근거(💡) 추가	2026-07-03 16:14:51 +09:00
gahusb	35795abb0f	docs(README): 실시간 매매 알람 + WSL워커 /infra 관측 팀규칙 + Alpine tzdata 함정 반영 stock 실시간 매매알람(watchlist/trade_alert_state/history·webai 계약·1분 Windows 워커), agent-office 매매알람 notify+/watch 봇·분산워커 관측, 주의사항에 팀규칙·tzdata, DB 테이블 목록 최신화. (기존 하네스 엔지니어링 섹션도 함께 커밋)	2026-07-03 11:01:24 +09:00
gahusb	4e47f5dd43	docs(CLAUDE.md): [팀 규칙] 모든 WSL docker 워커는 /infra 관측 필수 (node_monitor WORKER_REGISTRY 등재+heartbeat 3단계)	2026-07-03 10:48:17 +09:00
gahusb	246c8d5328	feat(agent-office): node_monitor에 trade-monitor 워커 등재 + trader 링크 from을 워커명으로 수정 WSL 워커 관측 규칙 — 모든 WSL docker 워커는 /infra에서 모니터링 가능해야 함. trade-monitor(kind=trader) 등재 → /nodes·/infra 노출. 링크 from 하드코딩('ai_trade')을 w[name]으로 고쳐 다중 trader가 각자 링크를 갖도록 함. 미배포 워커는 prev=None이라 다운 경보 없음.	2026-07-03 10:45:45 +09:00
gahusb	ed17193945	feat(stock): 매매알람 exit_params에 climax 파라미터 중앙화 (climax_vol_x 3.0, climax_close_pct 0.97)	2026-07-03 10:37:57 +09:00
gahusb	c4b2fffeb4	docs(CLAUDE.md): 실시간 매매 알람 엔드포인트 카탈로그 등재 (stock watchlist/webai + agent-office notify/봇명령)	2026-07-02 20:09:07 +09:00
gahusb	c6540b2417	feat(agent-office): /watch /unwatch /watchlist 봇 명령	2026-07-02 20:05:59 +09:00
gahusb	2bce07c367	feat(agent-office): 매매알람 텔레그램 notify(너+아내) 엔드포인트	2026-07-02 20:01:10 +09:00
gahusb	2906a2ae3e	feat(stock): webai report — edge diff→agent-office push→상태/이력(전송성공시만)	2026-07-02 19:56:58 +09:00
gahusb	134b9e5d07	feat(stock): session 판정 + webai monitor-set 엔드포인트 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 19:51:57 +09:00
gahusb	bf84328d59	feat(stock): edge diff(신규/해제/재무장) 순수 함수	2026-07-02 19:48:45 +09:00
gahusb	d8b3267b98	feat(stock): 감시대상(monitor-set) 조립 로직	2026-07-02 15:51:06 +09:00
gahusb	89c52b1fb6	feat(stock): watchlist CRUD + 알람 이력 API	2026-07-02 15:45:14 +09:00
gahusb	01a8aee226	fix(stock): 매매알람 이력 days 필터 포맷을 ISO로 통일 (경계일 과다포함 수정, 리뷰 Important)	2026-07-02 15:43:22 +09:00
gahusb	b2c4ca0e0b	feat(stock): 매매알람 DB — watchlist/alert_state/history 테이블+헬퍼 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 15:34:53 +09:00
gahusb	baa3a3075d	docs(stock): 실시간 매매 알람 BE 구현 계획 (9 tasks, TDD) watchlist/alert_state/history DB → CRUD API → monitor-set 조립 → edge diff → webai monitor-set/report → agent-office 텔레그램(너+아내) → /watch 봇 명령 → 회귀/배포. 워커(web-ai)·탭(web-ui)은 계약(스펙 §5)만 정의해 각 세션 핸드오프. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 15:25:26 +09:00
gahusb	4cb9dc6a7c	docs(stock): 실시간 매매 알람 설계 스펙 (watchlist∪screener buy + exit+trailing sell, 1분 Windows 워커, NAS edge dedup) 브레인스토밍 확정 요구사항 6종 + 아키텍처 A(신규 Windows docker 워커). TA/조건판정은 Windows, edge 중복판정 상태는 NAS 영속(재시작 스팸 방지). cross-repo 계약(webai monitor-set/report, agent-office notify, watchlist CRUD, heartbeat) 정의. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 15:19:00 +09:00
gahusb	36e8d11060	fix(stock): AI 뉴스 리포트 하루 밀림 해소 — asof를 KST로 보정 + LLM에 현재 일자 주입 근본원인: stock 컨테이너는 python:3.12-alpine + tzdata 미설치라 TZ=Asia/Seoul이 무효 → date.today()가 UTC를 반환. AI 뉴스 리포트 cron은 08:00 KST(=전날 23:00 UTC)라 asof가 어제로 계산돼 라벨·기사 윈도우·news_sentiment 저장이 전부 하루 밀렸음 (월요일은 일요일 UTC로 계산돼 skip_weekend까지). - screener/router.py: _today_kst()(=utcnow+9h, holdings_intel 관용) 추가. /snapshot/refresh · /snapshot/refresh-news-sentiment의 asof 기본값을 KST로. - ai_news/analyzer.py: score_sentiment(asof=...) → 프롬프트 앞에 "오늘 날짜" 명시, LLM이 현재 일자 기준으로 뉴스 평가(사용자 요청). - ai_news/pipeline.py: refresh_daily가 asof를 score_sentiment까지 스레딩. - 테스트: _today_kst KST 보정 + analyzer asof 주입 2종 TDD Red→Green. 기존 pipeline 목 시그니처에 asof 반영. stock 전체 149 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 14:38:51 +09:00
gahusb	db6fed72b3	feat(music-lab): 파이프라인 하드 삭제 엔드포인트 DELETE /api/music/pipeline/{id} cancel(state→cancelled, active/failed 뷰에서만 제거)만으론 status=all 뷰에 행이 남아 옛 dead 파이프라인을 완전히 치울 수 없었음. DELETE로 하드 삭제 추가. - db.delete_pipeline(pid)→bool: 자식행(pipeline_feedback, pipeline_jobs) 먼저 삭제 후 video_pipelines 삭제(SQLite FK 미강제라 명시적 cascade). 존재 여부 bool. - DELETE /api/music/pipeline/{id}: 없으면 404, 있으면 {"ok":true,"deleted":id}. 상태 가드 없음(관리자 정리 용도, cancel과 동일한 단순 정책). - 테스트 3종(삭제+404+자식행 cascade) TDD Red→Green. music-lab 152 passed. - CLAUDE.md 엔드포인트 카탈로그 갱신. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-02 13:52:11 +09:00
gahusb	7cce5c422f	fix(agent-office): 파이프라인 실패 알림 dedup을 DB 영속화 (재시작 재알림 스팸 해소) youtube_publisher._notified_failed(인메모리 set)가 컨테이너 재시작 시 소실되어 기존 failed 파이프라인(예: video 인코딩 구버전 실패 #3)을 매 재시작마다 "신규"로 재알림하던 스팸 버그를 notified_failed_pipelines 테이블로 영속화해 해결. 부수 버그 fix: failed 폴링이 예외를 던지면 failed=[]로 오해해 원장을 통째로 비우던 코드 → 예외 시 early-return(원장 보존). 진행 중 *_pending 승인 dedup(_notified_state_per_pipeline)은 의도적으로 인메모리 유지(재시작 시 살아있는 파이프라인 승인 재알림은 유용한 리마인더). 테스트: 재시작 지속성 + 일시적 폴링 예외 재현 테스트 2종 추가(TDD Red→Green). DB_PATH 첫 import 고정으로 인한 테스트 간 영속 테이블 누수를 monkeypatch로 격리. agent-office 전체 140개 통과. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EqCYBhvTcdeCTUDX3RhWx9	2026-07-01 15:20:07 +09:00
gahusb	94beecbfaf	docs(CLAUDE.md): agent-office 카탈로그에 /nodes 엔드포인트 + node_monitor.py 등재 분산 워커 관측 시스템 — GET /api/agent-office/nodes(heartbeat 생사+큐깊이+ dead-letter 집계, web-ui /infra 소비) 엔드포인트 표 추가 + 핵심파일에 node_monitor.py 추가. 상세는 infra_distributed_workers.md 메모리. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-07-01 02:56:37 +09:00
gahusb	98b17f3a3a	fix(redis): bgsave fork 실패로 인한 쓰기 차단 해소 (--save "" + stop-writes off) 근본원인: NAS vm.overcommit_memory=0 + Committed_AS≈CommitLimit(98%)로 redis bgsave fork()가 거부되어 stop-writes-on-bgsave-error(기본 yes)가 모든 쓰기를 차단(6/29 20:36 이후). AOF가 durability를 담당하므로 실패하는 RDB 스냅샷을 비활성화(--save "")하고 stop-writes-on-bgsave-error no로 안전망 추가. 호스트 vm.overcommit_memory=1(sudo)은 별도 권장. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-30 14:21:09 +09:00
gahusb	94cddccaa7	fix(agent-office): alive를 heartbeat staleness로 판정 + 다운/복구 전이 발송실패 시 재시도 (최종 리뷰 I1·I2) I1: collect_status - heartbeat 키 존재 여부가 아닌 ts age 기반으로 alive 판정. age > NODE_STALE_THRESHOLD_SEC(90s, env 주입 가능)이면 키 있어도 dead. config.py에 NODE_STALE_THRESHOLD_SEC=90 추가. I2: check_and_alert - 다운/복구 전이 시 send_raw 실패하면 _node_state 갱신 보류. 다음 사이클에서 동일 전이 재감지 → 재발송 시도 (다운 이벤트 유실 방지). 테스트: _hb 헬퍼 현재 시각 기본값으로 수정 + 신규 2개 (stale→dead, I2 재시도 회귀). 14 passed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 18:50:45 +09:00
gahusb	b49cc14ef3	fix(agent-office): dead-letter _dl_notified 갱신을 발송성공 시로 한정 + collect_status 예외방어 (B4 리뷰) - _dl_notified[name] = dl을 if ok: 블록 안으로 이동 — 텔레그램 실패 시 갱신 방지 - check_and_alert에 collect_status try/except 추가 — 스케줄러 잡 생존 보장 - tests: import app.node_monitor as nm 최상단 이동 - tests: test_dl_notified_not_updated_on_telegram_failure 회귀 테스트 추가 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 18:13:33 +09:00
gahusb	5d5ff27d29	feat(agent-office): 노드 헬스 1분 cron + 텔레그램 경보(다운/복구/dead-letter)	2026-06-29 18:06:38 +09:00
gahusb	2a0090a1d4	feat(agent-office): GET /api/agent-office/nodes 엔드포인트 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 18:01:00 +09:00
gahusb	ea1f0d103d	fix(agent-office): node_monitor 루프 예외 방어 + 테스트 보강 (B2 리뷰) - per-worker 루프 전체를 try/except로 감싸 Redis 예외 시 redis_ok=False+break (Blocker) - heartbeat 파싱 except에 UnicodeDecodeError 추가 (Important) - hb.get('ts') or '' 로 null ts 안전 처리 (Minor) - 테스트 3개 추가: paused 폴백·processing 집계·llen 예외 회귀 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 17:56:18 +09:00
gahusb	a3ae85cde1	feat(agent-office): node_monitor.collect_status (heartbeat+큐+dead-letter 집계)	2026-06-29 17:50:16 +09:00
gahusb	363e95c5a9	chore(agent-office): redis 의존성 + REDIS_URL/dead-letter 임계 설정	2026-06-29 17:44:45 +09:00
gahusb	c69b18243b	docs: 분산 워커 관측 시스템 구현 계획(3-repo TDD plan) 추가 Part A(web-ai heartbeat) / Part B(agent-office 집계+경보) / Part C(web-ui Three.js 대시보드). 각 Part 독립 실행·테스트 가능, 계약 2개를 Global Constraints로 잠금. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 17:33:16 +09:00
gahusb	f0fad05f2d	docs: 분산 워커 관측 시스템(NAS↔Windows) 설계 스펙 추가 music/video/image/insta-render + task-watcher + ai_trade의 heartbeat 기반 관측, agent-office /nodes 집계 API + 텔레그램 경보, web-ui Three.js 파이프라인 시각화를 다루는 3-repo 설계. heartbeat 키 스키마 + /nodes 응답 스키마를 잠그는 계약으로 정의. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019LV86jBozkNhSFXJA412fq	2026-06-29 17:25:13 +09:00
gahusb	ed8ffdf343	docs: co-gahusb를 서비스 목록·포트·nginx 라우팅 테이블에 등재 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 03:52:30 +09:00
gahusb	c7036212e2	merge: co-gahusb DNS-rebinding 421 핫픽스	2026-06-12 10:20:05 +09:00
gahusb	756d9fccf3	fix(co-gahusb): DNS-rebinding 보호 비활성화 (public Host 421 해결) - FastMCP가 기본 host(127.0.0.1)에서 DNS rebinding 보호를 자동 활성화 → allowed_hosts=localhost만 허용 → nginx가 넘기는 Host gahusb.synology.me가 421. - 실 보안은 nginx 앞단 Bearer 인증(MCP 도달 전 401)이므로 Host 검증 비활성화. - 재현/회귀 테스트 추가 + config.CO_BUS_KEY import-순서 격리 버그 수정 (23 통과). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 10:20:04 +09:00
gahusb	ea5cf49cea	merge: co-gahusb 세션 협업 팀 버스 (MCP + Redis + 어드바이저리 락) - FastMCP streamable-http 서버(12툴) + Bearer 인증 + Redis 백엔드 - 메시지/작업보드/락/team_log, 동시쓰기 분리(소유권 파티션 + 락) - compose(18920)/nginx(/api/co/)/deploy 등재 + 클라이언트 배선 - 22 테스트 (전부 통과) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 07:51:00 +09:00
gahusb	d07a8dad76	feat(co-gahusb): BE 클라이언트 배선 (.mcp.json + 역할 블록 + 셋업 문서) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 07:34:08 +09:00
gahusb	d74bc189b5	feat(co-gahusb): deploy SERVICES 화이트리스트 등재 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 07:32:10 +09:00
gahusb	d4405204f9	feat(co-gahusb): nginx public /api/co/ 라우팅 (Authorization forward, no-buffer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 07:31:44 +09:00
gahusb	2c157334dc	feat(co-gahusb): docker-compose 서비스 등재 (18920, depends_on redis) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 07:31:28 +09:00
gahusb	d840859fc9	fix(co-gahusb): update_task 존재하지 않는 task_id not_found 가드	2026-06-12 07:30:03 +09:00
gahusb	e115eee159	feat(co-gahusb): FastMCP 서버 (12 툴 + Bearer 인증 + health)	2026-06-12 07:25:47 +09:00
gahusb	fc1ebf134d	docs(checkpoint): oversight 프론트 배포 완료 반영 ActivityTimeline 프론트 NAS 라이브 반영 완료(SSH 직접 배포, Z: 매핑 우회). `56d0f5b` 위 새 커밋 — feat/co-gahusb-team-bus가 56d0f5b를 base로 의존하므로 amend 대신 신규 커밋. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 07:23:45 +09:00
gahusb	d71937b6ee	feat(co-gahusb): team_log 활동 피드 (capped, TDD)	2026-06-12 07:23:14 +09:00
gahusb	0cc4505af7	feat(co-gahusb): 작업 보드 (create/claim/update/list, TDD)	2026-06-12 07:22:55 +09:00
gahusb	9c18f0a467	feat(co-gahusb): 메시지 inbox (post/read/mark_read, TDD)	2026-06-12 07:22:36 +09:00
gahusb	8212a51f90	feat(co-gahusb): 어드바이저리 락 (acquire/release/heartbeat/list, TDD)	2026-06-12 07:20:30 +09:00
gahusb	0d466b235c	feat(co-gahusb): 스캐폴드 (Dockerfile·requirements·config)	2026-06-12 07:19:51 +09:00
gahusb	1129600341	docs: co-gahusb 팀 버스 구현 플랜 (11 태스크, TDD) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 01:31:06 +09:00
gahusb	2a0a2f3490	docs: co-gahusb 세션 협업 팀 버스 설계 spec Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 01:26:11 +09:00
gahusb	56d0f5b8a8	docs(checkpoint): 5/25~6/12 작업 전면 반영 + 보드 재편 5/22 이후 누락분(tarot/saju 분리·신설, _shared 로그, lotto v3 백테스트, stock 보유종목 인텔, nginx CVE, insta 카드뉴스 v2 + 자율발급, 에이전트 오버사이트, music 파이프라인 신뢰성) 완료 타임라인에 반영. 미완성 큰 기능(Video Studio 프론트) + 후속(music stuck 감지) + 백로그 재편. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 01:18:48 +09:00
gahusb	796ac6d39f	test(agent-office): test_init_and_seed stale 단언 수정 (고정 개수→subset) 에이전트 레지스트리가 2→7로 늘어 len==2/{stock,music} 고정 단언이 stale였음. 핵심 시드 subset 검증으로 변경(레지스트리 확장에 견고). 이번 세션 audit에서 반복 플래그된 부채. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:48:58 +09:00
gahusb	18cea427be	docs(music): 파이프라인 retry 엔드포인트 문서화 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:46:04 +09:00
gahusb	6c178006d3	feat(agent-office): ytpub_retry 텔레그램 콜백 → music-lab retry 프록시 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:39:31 +09:00
gahusb	084e4f1b4d	feat(agent-office): youtube_publisher 파이프라인 실패 텔레그램 알림+재시도 버튼 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:36:38 +09:00
gahusb	d048251a97	feat(agent-office): service_proxy pipeline_retry/list_failed_pipelines (+ music-lab status=failed 필터) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:33:28 +09:00
gahusb	ef1a7a92fd	fix(music-lab): retry 레이스 가드(retrying 전이) + failed_step 검증 + backoff 빈리스트 가드 - Fix 1: retry_pipeline이 bg.add_task 직전 상태를 'retrying'으로 전이 → 동시 retry 409 방지 - Fix 2: test_retry_failed_pipeline_retriggers에 called[pid/step] assert 추가 - Fix 3: failed_step이 STEPS에 없으면 409 (엉뚱한 prefix 방지) - Fix 4: STEP_RETRY_BACKOFF_SEC 빈 리스트 시 IndexError → 0으로 폴백 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:31:19 +09:00
gahusb	44dbe7c426	feat(music-lab): POST /pipeline/{id}/retry — 실패 step 수동 재개 terminal failed 파이프라인을 마지막 실패 step부터 재개. publish + youtube_video_id 있으면 중복 업로드 방지 409. pytest.ini에 pythonpath=.. 추가 (PYTHONPATH=.. 없이 TestClient 테스트 구동). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:23:24 +09:00
gahusb	e90e25d78f	feat(music-lab): orchestrator step 자동 재시도 (publish 제외) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:20:29 +09:00
gahusb	d638666659	feat(music-lab): get_last_failed_step — 파이프라인 재개용 실패 step 판별 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:18:07 +09:00
gahusb	51eff1538e	docs(plan): music 파이프라인 신뢰성·복구 구현 계획 (7 tasks, TDD) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:12:33 +09:00
gahusb	ffb96de61d	docs(spec): music/YouTube 파이프라인 신뢰성·복구 설계 step 자동 재시도(publish 제외) + terminal failed의 실패 step 수동 재개(텔레그램 [재시도]). orchestrator + retry 엔드포인트 + youtube_publisher 실패 알림. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 00:08:01 +09:00