feat: Download the uploaded files (#31068)

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-06-26 16:02:18 +08:00 · 2026-01-19 16:48:13 +08:00
parent 2d4289a925
commit 62ac02a568
20 changed files with 1226 additions and 10 deletions
@@ -0,0 +1,52 @@
+## Purpose
+
+`api/controllers/console/datasets/datasets_document.py` contains the console (authenticated) APIs for managing dataset documents (list/create/update/delete, processing controls, estimates, etc.).
+
+## Storage model (uploaded files)
+
+- For local file uploads into a knowledge base, the binary is stored via `extensions.ext_storage.storage` under the key:
+  - `upload_files/<tenant_id>/<uuid>.<ext>`
+- File metadata is stored in the `upload_files` table (`UploadFile` model), keyed by `UploadFile.id`.
+- Dataset `Document` records reference the uploaded file via:
+  - `Document.data_source_info.upload_file_id`
+
+## Download endpoint
+
+- `GET /datasets/<dataset_id>/documents/<document_id>/download`
+
+  - Only supported when `Document.data_source_type == "upload_file"`.
+  - Performs dataset permission + tenant checks via `DocumentResource.get_document(...)`.
+  - Delegates `Document -> UploadFile` validation and signed URL generation to `DocumentService.get_document_download_url(...)`.
+  - Applies `cloud_edition_billing_rate_limit_check("knowledge")` to match other KB operations.
+  - Response body is **only**: `{ "url": "<signed-url>" }`.
+
+- `POST /datasets/<dataset_id>/documents/download-zip`
+
+  - Accepts `{ "document_ids": ["..."] }` (upload-file only).
+  - Returns `application/zip` as a single attachment download.
+  - Rationale: browsers often block multiple automatic downloads; a ZIP avoids that limitation.
+  - Applies `cloud_edition_billing_rate_limit_check("knowledge")`.
+  - Delegates dataset permission checks, document/upload-file validation, and download-name generation to
+    `DocumentService.prepare_document_batch_download_zip(...)` before streaming the ZIP.
+
+## Verification plan
+
+- Upload a document from a local file into a dataset.
+- Call the download endpoint and confirm it returns a signed URL.
+- Open the URL and confirm:
+  - Response headers force download (`Content-Disposition`), and
+  - Downloaded bytes match the uploaded file.
+- Select multiple uploaded-file documents and download as ZIP; confirm all selected files exist in the archive.
+
+## Shared helper
+
+- `DocumentService.get_document_download_url(document)` resolves the `UploadFile` and signs a download URL.
+- `DocumentService.prepare_document_batch_download_zip(...)` performs dataset permission checks, batches
+  document + upload file lookups, preserves request order, and generates the client-visible ZIP filename.
+- Internal helpers now live in `DocumentService` (`_get_upload_file_id_for_upload_file_document(...)`,
+  `_get_upload_file_for_upload_file_document(...)`, `_get_upload_files_by_document_id_for_zip_download(...)`).
+- ZIP packing is handled by `FileService.build_upload_files_zip_tempfile(...)`, which also:
+  - sanitizes entry names to avoid path traversal, and
+  - deduplicates names while preserving extensions (e.g., `doc.txt` → `doc (1).txt`).
+    Streaming the response and deferring cleanup is handled by the route via `send_file(path, ...)` + `ExitStack` +
+    `response.call_on_close(...)` (the file is deleted when the response is closed).
@@ -0,0 +1,18 @@
+## Purpose
+
+`api/services/dataset_service.py` hosts dataset/document service logic used by console and API controllers.
+
+## Batch document operations
+
+- Batch document workflows should avoid N+1 database queries by using set-based lookups.
+- Tenant checks must be enforced consistently across dataset/document operations.
+- `DocumentService.get_documents_by_ids(...)` fetches documents for a dataset using `id.in_(...)`.
+- `FileService.get_upload_files_by_ids(...)` performs tenant-scoped batch lookup for `UploadFile` (dedupes ids with `set(...)`).
+- `DocumentService.get_document_download_url(...)` and `prepare_document_batch_download_zip(...)` handle
+  dataset/document permission checks plus `Document -> UploadFile` validation for download endpoints.
+
+## Verification plan
+
+- Exercise document list and download endpoints that use the service helpers.
+- Confirm batch download uses constant query count for documents + upload files.
+- Request a ZIP with a missing document id and confirm a 404 is returned.
@@ -0,0 +1,35 @@
+## Purpose
+
+`api/services/file_service.py` owns business logic around `UploadFile` objects: upload validation, storage persistence,
+previews/generators, and deletion.
+
+## Key invariants
+
+- All storage I/O goes through `extensions.ext_storage.storage`.
+- Uploaded file keys follow: `upload_files/<tenant_id>/<uuid>.<ext>`.
+- Upload validation is enforced in `FileService.upload_file(...)` (blocked extensions, size limits, dataset-only types).
+
+## Batch lookup helpers
+
+- `FileService.get_upload_files_by_ids(tenant_id, upload_file_ids)` is the canonical tenant-scoped batch loader for
+  `UploadFile`.
+
+## Dataset document download helpers
+
+The dataset document download/ZIP endpoints now delegate “Document → UploadFile” validation and permission checks to
+`DocumentService` (`api/services/dataset_service.py`). `FileService` stays focused on generic `UploadFile` operations
+(uploading, previews, deletion), plus generic ZIP serving.
+
+### ZIP serving
+
+- `FileService.build_upload_files_zip_tempfile(...)` builds a ZIP from `UploadFile` objects and yields a seeked
+  tempfile **path** so callers can stream it (e.g., `send_file(path, ...)`) without hitting "read of closed file"
+  issues from file-handle lifecycle during streamed responses.
+- Flask `send_file(...)` and the `ExitStack`/`call_on_close(...)` cleanup pattern are handled in the route layer.
+
+## Verification plan
+
+- Unit: `api/tests/unit_tests/controllers/console/datasets/test_datasets_document_download.py`
+  - Verify signed URL generation for upload-file documents and ZIP download behavior for multiple documents.
+- Unit: `api/tests/unit_tests/services/test_file_service_zip_and_lookup.py`
+  - Verify ZIP packing produces a valid, openable archive and preserves file content.
@@ -0,0 +1,28 @@
+## Purpose
+
+Unit tests for the console dataset document download endpoint:
+
+- `GET /datasets/<dataset_id>/documents/<document_id>/download`
+
+## Testing approach
+
+- Uses `Flask.test_request_context()` and calls the `Resource.get(...)` method directly.
+- Monkeypatches console decorators (`login_required`, `setup_required`, rate limit) to no-ops to keep the test focused.
+- Mocks:
+  - `DatasetService.get_dataset` / `check_dataset_permission`
+  - `DocumentService.get_document` for single-file download tests
+  - `DocumentService.get_documents_by_ids` + `FileService.get_upload_files_by_ids` for ZIP download tests
+  - `FileService.get_upload_files_by_ids` for `UploadFile` lookups in single-file tests
+  - `services.dataset_service.file_helpers.get_signed_file_url` to return a deterministic URL
+- Document mocks include `id` fields so batch lookups can map documents by id.
+
+## Covered cases
+
+- Success returns `{ "url": "<signed>" }` for upload-file documents.
+- 404 when document is not `upload_file`.
+- 404 when `upload_file_id` is missing.
+- 404 when referenced `UploadFile` row does not exist.
+- 403 when document tenant does not match current tenant.
+- Batch ZIP download returns `application/zip` for upload-file documents.
+- Batch ZIP download rejects non-upload-file documents.
+- Batch ZIP download uses a random `.zip` attachment name (`download_name`), so tests only assert the suffix.
@@ -0,0 +1,18 @@
+## Purpose
+
+Unit tests for `api/services/file_service.py` helper methods that are not covered by higher-level controller tests.
+
+## What’s covered
+
+- `FileService.build_upload_files_zip_tempfile(...)`
+  - ZIP entry name sanitization (no directory components / traversal)
+  - name deduplication while preserving extensions
+  - writing streamed bytes from `storage.load(...)` into ZIP entries
+  - yields a tempfile path so callers can open/stream the ZIP without holding a live file handle
+- `FileService.get_upload_files_by_ids(...)`
+  - returns `{}` for empty id lists
+  - returns an id-keyed mapping for non-empty lists
+
+## Notes
+
+- These tests intentionally stub `storage.load` and `db.session.scalars(...).all()` to avoid needing a real DB/storage.