Loading
Please wait while your experience is prepared...
Please wait while your experience is prepared...
backend / Apr 20, 2026 / 10 min
canvas self-hosted gave us only rest polling. the fix was a spring + sqs poll chain with watermarks for a 15k/month qms pipeline.
The working Canvas integration was not a webhook. It was a Spring worker that reads qms-canvas-poll-trigger, calls Canvas with submitted_since, writes a per-assignment watermark, publishes qms-canvas-submission, and schedules the next trigger with a delayed SQS message.
The throughput target in our QMS docs was 15,000 submissions per month. That works out to roughly 500 submissions per day. The load-test design assumed a 4-hour peak window and treated 7 requests per second for 2 minutes as 12x the required peak ingestion rate. Moodle had a plugin webhook path. Canvas had a service-account token and the REST API.
The part that made polling tolerable was not the Canvas API call. It was the state around it: a database cursor per assignment, two SQS queues, a watchdog that reseeds silent chains every 30 minutes, and a chain id that kills old trigger loops after a reset.
Canvas gives you a per-assignment submissions endpoint. That is the ingestion boundary I would design around first, because the rest of the system becomes a cursor problem.
GET /api/v1/courses/{courseId}/assignments/{assignmentId}/submissions
?per_page=50
&submitted_since={isoInstant}That endpoint is the ingest surface in the Upthink backend. The poller does not read a course-level "all new submissions" feed. It loads active assignments from our database and calls Canvas per assignment with a cursor.
The local configuration makes the shape explicit. Canvas has 2 queue URLs, not a callback URL:
app:
sqs:
queue:
canvas-submission-poll-trigger: http://sqs.ap-south-1.localhost:4566/000000000000/qms-canvas-poll-trigger
canvas-submission: http://sqs.ap-south-1.localhost:4566/000000000000/qms-canvas-submission
scheduler:
canvas-poll-trigger-consumer-interval-ms: 30000
canvas-watchdog-cron: "0 */30 * * * *"That is the first constraint. The backend wakes every 30 seconds to check the trigger queue, and a watchdog runs every 30 minutes to find accounts whose poll chain stopped moving.
The submitted_since filter takes an ISO 8601 timestamp. In practice we still filter the response client-side because some Canvas environments can return stale submissions even when the query includes the cursor. The database watermark is the source of truth; the Canvas filter is just a way to reduce response size.
Cron is the wrong primary control loop when the cursor is per account and per assignment. The moment you have multiple API workers, retries, tenant isolation, and a reset button, a single scheduled method turns into a locking problem.
Distributed locking. With multiple API instances, 2 processes can poll the same Canvas account at the same time. You need a lock per account or a queue message per account. SQS already gave us the second option.
Fixed intervals. A cron interval is a deployment config. In the actual worker, the next delay is computed from consecutiveEmptyPolls: 120 seconds for the first 2 empty polls, then 300, 600, and 1200 seconds, plus up to 29 seconds of jitter. SQS DelaySeconds carries that decision on the next trigger.
Silent failures. If a self-feeding chain breaks, there may be no message left to retry. We store last_canvas_poll_at on lms_account, then the watchdog treats an account as stale after 10 minutes in cloud deploys. Local dev overrides that threshold to 0 because LocalStack SQS is wiped on restart while Postgres survives.
Operational overhead. We already had SQS consumers for Moodle submission ingestion, plagiarism, and passback. Canvas added 2 more queues: one trigger queue and one submission queue. That kept all long-running LMS ingestion inside the same worker model.
A self-feeding queue has one rule: the poller attempts to schedule the next trigger before deleting the current one. In this codebase, the trigger queue is just the account-level clock, and the watchdog exists because that scheduling step can still fail.
qms-canvas-poll-trigger
message: lmsAccountId, tenantId, assignmentWatermarks, chainId
|
v
CanvasSubmissionPollConsumer
- verifies tenant/account match before setting RLS
- scans active courses and assignments
- calls Canvas submissions endpoint with submitted_since
- writes assignment watermarks
|
+--> qms-canvas-submission
| message: tenantId, lmsAccountId, courseRef, assignmentRef, studentRef
|
+--> delayed trigger back to qms-canvas-poll-trigger
Each Canvas account gets its own trigger chain. The poll-trigger consumer receives up to 5 messages with a 5-second long poll, extends visibility to 480 seconds before work starts, and processes each account separately.
The submission queue is separate because polling and file ingestion have different failure modes. qms-canvas-submission receives up to 10 messages at a time, also with a 5-second long poll. It has MAX_RECEIVE_COUNT = 3 in the app, while Terraform sends the 4th receive to the DLQ.
The cloud queue settings are not exotic. The important bit is separating poll triggers from submission processing:
resource "aws_sqs_queue" "canvas_poll_trigger" {
name = "${var.name_prefix}-canvas-poll-trigger"
visibility_timeout_seconds = 300
message_retention_seconds = 86400
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.canvas_poll_trigger_dlq.arn
maxReceiveCount = 2
})
}
resource "aws_sqs_queue" "canvas_submission" {
name = "${var.name_prefix}-canvas-submission"
visibility_timeout_seconds = 300
message_retention_seconds = 86400
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.canvas_submission_dlq.arn
maxReceiveCount = 4
})
}The failure mode to plan for: if the trigger chain breaks after a deploy, queue purge, or JVM crash, polling stops silently. We handled that with last_canvas_poll_at on the LMS account and a watchdog that runs on startup and every 30 minutes. When it reseeds an account, it writes a new canvas_poll_chain_id to the database and stamps that value on the trigger message. The consumer discards old trigger messages whose chain id no longer matches the database. This is how we kill parallel poll chains without purging SQS.
The poll trigger is not just tenant id and cursor. It carries enough state to make the chain resumable and discard stale messages:
public record CanvasPollTriggerMessage(
UUID lmsAccountId,
UUID tenantId,
int consecutiveEmptyPolls,
Map<String, String> assignmentWatermarks,
UUID chainId
) {
public CanvasPollTriggerMessage {
if (assignmentWatermarks == null) assignmentWatermarks = Map.of();
if (consecutiveEmptyPolls < 0) consecutiveEmptyPolls = 0;
}
}The first safety check happens before the worker sets the tenant RLS context. The trigger message contains both lmsAccountId and tenantId, but the worker treats those ids as untrusted until it has loaded the account without RLS and confirmed ownership:
LmsAccount account = transactionTemplate.execute(status -> {
LmsAccount a = lmsAccountReadPort.findByIdBypassRls(lmsAccountId)
.filter(acc -> acc.getTenantId().equals(tenantId))
.orElse(null);
if (a != null) {
rlsConnectionManager.initializeRlsForTenant(tenantId);
}
return a;
});
if (account == null || !account.isOperational()) {
return;
}
UUID messageChainId = trigger.chainId();
UUID dbChainId = account.getCanvasPollChainId();
if (dbChainId != null && !dbChainId.equals(messageChainId)) {
return;
}The poller then resolves an assignment watermark, calls Canvas, publishes submission messages, and advances the watermark. This is the part that stopped repeated ingestion when Canvas returned the same rows again:
List<CanvasClient.CanvasSubmissionRef> submissions = canvasClient.getSubmissions(
account.getBaseUrl(), token,
course.lmsCourseRef(), assignment.getLmsAssignmentRef(),
since)
.stream()
.filter(s -> !s.submittedAt().isBefore(since))
.toList();
if (submissions.isEmpty()) return 0;
Instant latestSubmittedAt = Instant.MIN;
List<CanvasSubmissionMessage> messages = new ArrayList<>(submissions.size());
for (var sub : submissions) {
messages.add(new CanvasSubmissionMessage(
tenantId,
account.getId(),
course.lmsCourseRef(),
assignment.getLmsAssignmentRef(),
sub.userId()));
if (sub.submittedAt().isAfter(latestSubmittedAt)) {
latestSubmittedAt = sub.submittedAt();
}
}
publishSubmissionMessages(messages);
Instant nextWatermark = latestSubmittedAt.plusSeconds(1);
updatedWatermarks.put(assignment.getId().toString(), nextWatermark.toString());That plusSeconds(1) matters. Canvas timestamps have second-level precision. Advancing by a millisecond still leaves the cursor inside the same timestamp bucket, which can cause the same submission to appear on the next cycle.
The Canvas API call follows pagination through the Link response header:
public List<CanvasSubmissionRef> getSubmissions(String baseUrl, String token,
String courseId, String assignmentId,
Instant since) {
List<CanvasSubmissionRef> results = new ArrayList<>();
String uri = "/api/v1/courses/" + courseId + "/assignments/" + assignmentId + "/submissions"
+ "?per_page=50"
+ (since != null ? "&submitted_since=" + since : "");
while (uri != null) {
var response = clientFor(baseUrl, token).get()
.uri(uri)
.exchangeToMono(resp -> {
String linkHeader = resp.headers().asHttpHeaders().getFirst("Link");
return resp.bodyToMono(String.class)
.map(body -> Map.entry(body, linkHeader != null ? linkHeader : ""));
})
.block(TIMEOUT);
if (response == null) break;
JsonNode arr = objectMapper.readTree(response.getKey());
for (JsonNode node : arr) {
String submittedAt = node.path("submitted_at").asText(null);
if (submittedAt == null || submittedAt.isBlank()) continue;
results.add(new CanvasSubmissionRef(
node.path("id").asText(),
node.path("user_id").asText(),
Instant.parse(submittedAt)));
}
uri = safeNextLink(response.getValue(), baseUrl);
}
return results;
}The link header is the part that catches people. Canvas doesn't paginate with ?page=2. It returns a Link header like:
Link: <https://canvas.example.edu/api/v1/courses/123/submissions?...&page=bookmark:abc>; rel="next",
<https://canvas.example.edu/api/v1/courses/123/submissions?...&page=first>; rel="first"
Parse the rel="next" URL and follow it. The code also checks that absolute next-page URLs stay on the same origin as the configured Canvas base URL. If Canvas or a proxy returns a cross-origin next link, the worker stops pagination instead of letting WebClient follow an unexpected host.
Canvas submissions enter the existing pipeline as references first, not as full payloads. The poller only enqueues identifiers, then a submission consumer fetches the full submission and attachment when it is ready to create the internal submission record.
public record CanvasSubmissionMessage(
UUID tenantId,
UUID lmsAccountId,
String lmsCourseRef,
String lmsAssignmentRef,
String lmsStudentRef
) {}The message intentionally avoids carrying attachment URLs. Canvas file URLs can expire or redirect to object storage, so the worker fetches the submission detail only when it is ready to process it:
CanvasClient.CanvasSubmissionDetail detail = canvasClient.getSingleSubmission(
account.getBaseUrl(), token, lmsCourseRef, lmsAssignmentRef, lmsStudentRef);
if (detail == null || detail.attachments().isEmpty()) {
return null;
}
CanvasClient.CanvasAttachment attachment = detail.attachments().get(0);
byte[] fileContent = canvasClient.downloadFile(account.getBaseUrl(), token, attachment.url());
return new CanvasSubmissionResult(
lmsStudentRef,
attachment.filename(),
fileContent,
detail.submittedAt());The final handoff reuses the same internal submission creation path as the rest of the product. The only Canvas-specific part is that Canvas uses user_id as the student submission reference in this flow:
try {
submitAssignmentUseCase.create(
lmsAssignmentRef,
tenantId,
lmsStudentRef,
result.lmsStudentId(),
result.submittedAt(),
SubmissionStatus.RECEIVED,
result.fileName(),
result.fileContent(),
"CANVAS_SUBMISSION_QUEUE"
);
} catch (DataIntegrityViolationException e) {
log.info("Duplicate Canvas submission ignored");
}The duplicate handling is deliberately downstream. The poller should reduce duplicate messages, but the database write path still has to tolerate the same Canvas student and assignment arriving twice.
I would keep the 2-queue split and change the parts that are currently hidden in code constants.
Make the delay policy configurable per account. The current adaptive delay is hardcoded: 120 seconds for the first 2 empty polls, then 300, 600, and 1200 seconds with jitter. The send step clamps DelaySeconds to 900, so the effective maximum delay is 15 minutes. That is fine for normal university workloads, but deadline-heavy courses deserve their own profile. I would move those thresholds onto the LMS account so a high-volume tenant can stay hot without making every Canvas account poll aggressively.
EventBridge Scheduler instead of self-feeding SQS. The watchdog and chain id logic work, but they exist because a self-feeding queue can go quiet. EventBridge Scheduler would give each Canvas account a scheduled rule that targets SQS directly. The tradeoff is a larger control-plane surface. In this codebase, SQS was already the worker model, so the self-feeding design kept deployment simpler.
Capture Canvas API drift as metrics. The client-side stale-submission filter was added because the API did not always behave like the docs implied. I would add a counter for "returned by Canvas but rejected by local watermark" per account and assignment. That would show which institutions have noisy Canvas responses before they turn into queue volume.
The design limit I would document clearly: this is not real-time ingestion. In the current code, an idle Canvas account backs off to an effective 15-minute SQS delay, and the watchdog only checks stale chains every 30 minutes. That was acceptable for our grading workflow. It would be wrong for a product that promises seconds-level delivery.
why doesn't self-hosted canvas lms support real-time event streaming?
in our setup, the deployable integration surface was a canvas personal access token and the rest api. the hosted canvas event-streaming path depends on infrastructure we did not control, and the self-hosted instance did not give us a reliable webhook subscription surface for submissions. that left polling as the only path we could operate ourselves. the important design shift was treating polling as a durable queue workflow, not as a cron script.
can you use canvas lms webhooks in a self-hosted deployment?
do not assume you can. canvas has event and data-streaming concepts, but a self-hosted institution may not expose the services or admin surface you need to subscribe to submission events. for this project we confirmed the available path was rest api access through a service account. the production design therefore avoided inbound canvas callbacks entirely and stored canvas state in our own database and sqs queues.
what is a self-feeding sqs queue and why use it instead of a cron job?
a self-feeding sqs queue is a polling loop where each trigger message schedules the next trigger after it finishes the current poll cycle. in our implementation, the trigger message contains the lms account id, tenant id, empty-poll count, assignment watermarks, and a chain id. the consumer reads the trigger, scans active canvas assignments, publishes one canvas-submission message per new student submission, then sends a delayed trigger back to the same queue. compared to cron: sqs owns retries and dead-letter queues, delay seconds can change per account, and the watchdog can reseed a broken chain without adding a second scheduler model.
how do you handle duplicate submissions when polling the canvas api?
canvas api responses are not guaranteed to respect the cursor perfectly, so the poller uses three layers. first, it stores a per-assignment last_submission_at watermark in the database. second, it applies a client-side submitted_at >= since filter after reading the canvas response. third, it advances the next watermark by one second because canvas timestamps have second-level precision. the downstream processor still treats duplicates as possible and ignores database uniqueness collisions when the same student and assignment arrive again.