mirror of
https://github.com/YFGaia/dify-plus.git
synced 2026-06-26 16:02:18 +08:00
feat: Download the uploaded files (#31068)
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,52 @@
|
||||
## Purpose
|
||||
|
||||
`api/controllers/console/datasets/datasets_document.py` contains the console (authenticated) APIs for managing dataset documents (list/create/update/delete, processing controls, estimates, etc.).
|
||||
|
||||
## Storage model (uploaded files)
|
||||
|
||||
- For local file uploads into a knowledge base, the binary is stored via `extensions.ext_storage.storage` under the key:
|
||||
- `upload_files/<tenant_id>/<uuid>.<ext>`
|
||||
- File metadata is stored in the `upload_files` table (`UploadFile` model), keyed by `UploadFile.id`.
|
||||
- Dataset `Document` records reference the uploaded file via:
|
||||
- `Document.data_source_info.upload_file_id`
|
||||
|
||||
## Download endpoint
|
||||
|
||||
- `GET /datasets/<dataset_id>/documents/<document_id>/download`
|
||||
|
||||
- Only supported when `Document.data_source_type == "upload_file"`.
|
||||
- Performs dataset permission + tenant checks via `DocumentResource.get_document(...)`.
|
||||
- Delegates `Document -> UploadFile` validation and signed URL generation to `DocumentService.get_document_download_url(...)`.
|
||||
- Applies `cloud_edition_billing_rate_limit_check("knowledge")` to match other KB operations.
|
||||
- Response body is **only**: `{ "url": "<signed-url>" }`.
|
||||
|
||||
- `POST /datasets/<dataset_id>/documents/download-zip`
|
||||
|
||||
- Accepts `{ "document_ids": ["..."] }` (upload-file only).
|
||||
- Returns `application/zip` as a single attachment download.
|
||||
- Rationale: browsers often block multiple automatic downloads; a ZIP avoids that limitation.
|
||||
- Applies `cloud_edition_billing_rate_limit_check("knowledge")`.
|
||||
- Delegates dataset permission checks, document/upload-file validation, and download-name generation to
|
||||
`DocumentService.prepare_document_batch_download_zip(...)` before streaming the ZIP.
|
||||
|
||||
## Verification plan
|
||||
|
||||
- Upload a document from a local file into a dataset.
|
||||
- Call the download endpoint and confirm it returns a signed URL.
|
||||
- Open the URL and confirm:
|
||||
- Response headers force download (`Content-Disposition`), and
|
||||
- Downloaded bytes match the uploaded file.
|
||||
- Select multiple uploaded-file documents and download as ZIP; confirm all selected files exist in the archive.
|
||||
|
||||
## Shared helper
|
||||
|
||||
- `DocumentService.get_document_download_url(document)` resolves the `UploadFile` and signs a download URL.
|
||||
- `DocumentService.prepare_document_batch_download_zip(...)` performs dataset permission checks, batches
|
||||
document + upload file lookups, preserves request order, and generates the client-visible ZIP filename.
|
||||
- Internal helpers now live in `DocumentService` (`_get_upload_file_id_for_upload_file_document(...)`,
|
||||
`_get_upload_file_for_upload_file_document(...)`, `_get_upload_files_by_document_id_for_zip_download(...)`).
|
||||
- ZIP packing is handled by `FileService.build_upload_files_zip_tempfile(...)`, which also:
|
||||
- sanitizes entry names to avoid path traversal, and
|
||||
- deduplicates names while preserving extensions (e.g., `doc.txt` → `doc (1).txt`).
|
||||
Streaming the response and deferring cleanup is handled by the route via `send_file(path, ...)` + `ExitStack` +
|
||||
`response.call_on_close(...)` (the file is deleted when the response is closed).
|
||||
@@ -0,0 +1,18 @@
|
||||
## Purpose
|
||||
|
||||
`api/services/dataset_service.py` hosts dataset/document service logic used by console and API controllers.
|
||||
|
||||
## Batch document operations
|
||||
|
||||
- Batch document workflows should avoid N+1 database queries by using set-based lookups.
|
||||
- Tenant checks must be enforced consistently across dataset/document operations.
|
||||
- `DocumentService.get_documents_by_ids(...)` fetches documents for a dataset using `id.in_(...)`.
|
||||
- `FileService.get_upload_files_by_ids(...)` performs tenant-scoped batch lookup for `UploadFile` (dedupes ids with `set(...)`).
|
||||
- `DocumentService.get_document_download_url(...)` and `prepare_document_batch_download_zip(...)` handle
|
||||
dataset/document permission checks plus `Document -> UploadFile` validation for download endpoints.
|
||||
|
||||
## Verification plan
|
||||
|
||||
- Exercise document list and download endpoints that use the service helpers.
|
||||
- Confirm batch download uses constant query count for documents + upload files.
|
||||
- Request a ZIP with a missing document id and confirm a 404 is returned.
|
||||
@@ -0,0 +1,35 @@
|
||||
## Purpose
|
||||
|
||||
`api/services/file_service.py` owns business logic around `UploadFile` objects: upload validation, storage persistence,
|
||||
previews/generators, and deletion.
|
||||
|
||||
## Key invariants
|
||||
|
||||
- All storage I/O goes through `extensions.ext_storage.storage`.
|
||||
- Uploaded file keys follow: `upload_files/<tenant_id>/<uuid>.<ext>`.
|
||||
- Upload validation is enforced in `FileService.upload_file(...)` (blocked extensions, size limits, dataset-only types).
|
||||
|
||||
## Batch lookup helpers
|
||||
|
||||
- `FileService.get_upload_files_by_ids(tenant_id, upload_file_ids)` is the canonical tenant-scoped batch loader for
|
||||
`UploadFile`.
|
||||
|
||||
## Dataset document download helpers
|
||||
|
||||
The dataset document download/ZIP endpoints now delegate “Document → UploadFile” validation and permission checks to
|
||||
`DocumentService` (`api/services/dataset_service.py`). `FileService` stays focused on generic `UploadFile` operations
|
||||
(uploading, previews, deletion), plus generic ZIP serving.
|
||||
|
||||
### ZIP serving
|
||||
|
||||
- `FileService.build_upload_files_zip_tempfile(...)` builds a ZIP from `UploadFile` objects and yields a seeked
|
||||
tempfile **path** so callers can stream it (e.g., `send_file(path, ...)`) without hitting "read of closed file"
|
||||
issues from file-handle lifecycle during streamed responses.
|
||||
- Flask `send_file(...)` and the `ExitStack`/`call_on_close(...)` cleanup pattern are handled in the route layer.
|
||||
|
||||
## Verification plan
|
||||
|
||||
- Unit: `api/tests/unit_tests/controllers/console/datasets/test_datasets_document_download.py`
|
||||
- Verify signed URL generation for upload-file documents and ZIP download behavior for multiple documents.
|
||||
- Unit: `api/tests/unit_tests/services/test_file_service_zip_and_lookup.py`
|
||||
- Verify ZIP packing produces a valid, openable archive and preserves file content.
|
||||
+28
@@ -0,0 +1,28 @@
|
||||
## Purpose
|
||||
|
||||
Unit tests for the console dataset document download endpoint:
|
||||
|
||||
- `GET /datasets/<dataset_id>/documents/<document_id>/download`
|
||||
|
||||
## Testing approach
|
||||
|
||||
- Uses `Flask.test_request_context()` and calls the `Resource.get(...)` method directly.
|
||||
- Monkeypatches console decorators (`login_required`, `setup_required`, rate limit) to no-ops to keep the test focused.
|
||||
- Mocks:
|
||||
- `DatasetService.get_dataset` / `check_dataset_permission`
|
||||
- `DocumentService.get_document` for single-file download tests
|
||||
- `DocumentService.get_documents_by_ids` + `FileService.get_upload_files_by_ids` for ZIP download tests
|
||||
- `FileService.get_upload_files_by_ids` for `UploadFile` lookups in single-file tests
|
||||
- `services.dataset_service.file_helpers.get_signed_file_url` to return a deterministic URL
|
||||
- Document mocks include `id` fields so batch lookups can map documents by id.
|
||||
|
||||
## Covered cases
|
||||
|
||||
- Success returns `{ "url": "<signed>" }` for upload-file documents.
|
||||
- 404 when document is not `upload_file`.
|
||||
- 404 when `upload_file_id` is missing.
|
||||
- 404 when referenced `UploadFile` row does not exist.
|
||||
- 403 when document tenant does not match current tenant.
|
||||
- Batch ZIP download returns `application/zip` for upload-file documents.
|
||||
- Batch ZIP download rejects non-upload-file documents.
|
||||
- Batch ZIP download uses a random `.zip` attachment name (`download_name`), so tests only assert the suffix.
|
||||
@@ -0,0 +1,18 @@
|
||||
## Purpose
|
||||
|
||||
Unit tests for `api/services/file_service.py` helper methods that are not covered by higher-level controller tests.
|
||||
|
||||
## What’s covered
|
||||
|
||||
- `FileService.build_upload_files_zip_tempfile(...)`
|
||||
- ZIP entry name sanitization (no directory components / traversal)
|
||||
- name deduplication while preserving extensions
|
||||
- writing streamed bytes from `storage.load(...)` into ZIP entries
|
||||
- yields a tempfile path so callers can open/stream the ZIP without holding a live file handle
|
||||
- `FileService.get_upload_files_by_ids(...)`
|
||||
- returns `{}` for empty id lists
|
||||
- returns an id-keyed mapping for non-empty lists
|
||||
|
||||
## Notes
|
||||
|
||||
- These tests intentionally stub `storage.load` and `db.session.scalars(...).all()` to avoid needing a real DB/storage.
|
||||
Reference in New Issue
Block a user