A walkthrough of understanding a new codebase at work
Overview¶
I will walk through the understanding of a codebase without any plan first, then review how I looked through it after I'm done.
Raw Walkthrough¶
Right now, I'm trying to integrate my 80% complete independent feature into the main codebase.
The codebase uses Celery (which uses Redis as its message broker running on a separate docker container) to manage lifecycles of asynchronous tasks / processes. The codebase also uses FastAPI
and its native "wrapper" (for developer convenience) for websockets, which under the hood is the starlette
websocket implementation, to expose data for frontend UIs. Client-server communication is hence a mix of both conventional GET
and POST
methods, combined with "live-streamed" data.
How does the data get "live-streamed"?
A websocket connection is established. Conventional websocket boilerplate (not exactly but I don't have a better word for this right now) are added, such as token authentication and websocket connection management.
The core of the websocket is that (1) the user can send a json
body over the active websocket connection, which details an enum
of actions that a user can take, and (2) a continuous polling task, which reads ongoing task details from the redis database, packages it into a Pydantic model (basically UI DTOs) and sends it to the client via the websocket in constant intervals.
User Session Management¶
There is also an additional feature to link the task data with users. The codebase uses mixpanel
, which is a product analytics tool, that exposes a really simple python SDK that is used to access their API. We use the SDK and corresponding API to call methods that track the user's actions and session metadata while using the vscode extension. Metadata like the user session_id
is then used in functions throughout the codebase as parameters.
Tight coupling with the vscode api
Apparently for now this is generally okay because most of the clients UI with the backend on their code is via vscode anyway.
Codebase Layout¶
From a very high level look at the codebase, we only need to care about 2 folders. The /server
and /tasks
folder located in root
.
/server
Layout¶
/server
contains the following:
/routes
- Contains the endpoint methods, as well as the websocket./actions
- Service layer, methods that make up the core business logic of the application/models
- DTO layer. Contains model definitions that standardize data being passed around the codebase./utils
- Utilities mainly for config settings, connections, tocelery
,redis
,mixpanel
- Need to understand the
middlewares.py
better...
- Need to understand the
Key areas to note about the /server
module¶
-
analysis.py
Contains endpoints for clients to send files and get results.
Dependency Injection for FastAPI
From what I've seen, its most commonly used in parameters for the API endpoint method parameters, as a validation step. E.g. a valid
redis.StrictRedis
object is needed as a parameter to the endpoint function. We will useredis: StrictRedis = Depends(get_redis_connection)
such that the validation function is called and injects a valid ornull
object right before the function runs in runtime. -
repository.py
Kind of similar to
analysis.py
but this time its to submit an entire repo. -
session.py
API security is handled by
fastapi.security
. This handlesGET
,POST
andDELETE
at the/session
endpoint for user sessions.fetch_session_models
is used at thecreate
endpoint, which takes in the valid user credentials (session token) and valid redis connection. the "validity" of these 2 arguments are checked at the dependency injection level, before the endpoint method is even called.SessionData
is returned and it contains the jsonized output of the redis pipeline for that user (will beNone
or something, depending on the token existence checked byredis.StrictRedis.pipeline.hget()
)
FastAPI setup (to be updated)¶
Main file app.py
is the main application factory that sets up the whole FastAPI server.
Quick recap on what application factory is
Application factory is a design pattern where we configure the main instance of the application with our own configuration. Many web frameworks like Flask
and FastAPI
uses this because users at the end need configurability for their application instance.
Within the create_app
function we can also ensure that the settings are initiallized globally throughout the application. This means we dont have to os
call the env each time.
Overall app.py
loads configs, loggings, app creation with lifespan logic(?), adds routes and middleware, adds metric collection and redis, celery integrations. Then on app shutdown it gives a pdf.
What is lifespan?
These type of functions are when you want the function to execute halfway then yield until a future event occurs.
Intended User Flow¶
This codebase is originally meant for the user to interact with, whats seems to be the vscode extension. I think theres some vscode script layer than is an SDK for their UI that calls endpoints in our service.
Understanding Celery¶
According to the Celery Documentation, its a distributed system that processes many messages + provide operations to maintain such a system.
Whats a task queue?¶
Task queue is a mechanism to distribute individual units of work across threads or machines.
You need to have a dedicated worker process to constantly monitor task queues for new work to perform.
Celery
uses a broker to mediate between clients and workers. A client adds a message to the queue, then the broker delivers the message to be picked up by the monitoring worker.
Celery
can have multiple workers and brokers i.e. we can have high availability and horizontal scaling. It can run on a single machine, multiple machines, or even across data centers.
Celery
needs a message transport to send and receive messages. RabbitMQ
and Redis
(and Amazon SQS
) broker transports that are feature-complete. Which means we can almost plug and play with complete message broker features using either of these 2 libraries, without needing much code on top of whats already provided by them.
A Quick Example¶
A client calls my endpoint and that endpoint method calls a method that is "heavy".
Celery
will add the function to the Redis
Queue. Under the hood, Celery
will serialize the message packet before pushing it to the queue. Side note: the serialization of the task name and arguments are in JSON
for Celery
under the hood.
Worker method, which is decorated by the @celery_app.task
decorator and called earlier, is listening in a separate process. Result is then stored in some db of your choice.
The workers match the task name in the serialized JSON
to a registered @celery_app.task
function, deserializes the arguments, and executes the python function, in another process.
Conclusion¶
I think the way I tackled understanding codebase (this one in particular) is by looking at the routes first, because all code cascades from there. So I made a mental map and grouping of the different endpoints. Previous understanding of some common idiomatic practices like having a utils, DTOs, and service layer helps.
On top of that are mostly reading up more on the documentation of other high level libraries being used to facilitate the application. In this case its Celery and Redis.
Then after getting that kind of broad overview of the library I think I start to read deeper into how these libraries are actually implemented in the code. E.g. app.py
for FastAPI is probably a standard format but with our own sprinklings on there for our custom purpose. This is the stage where we might start to see some weird issues that weren't previously resolved. Also at this stage, we might look at some more nitty gritty parts like why did we choose a certain design pattern (factory design). Be prepared to also learn about the code or language methods kind of level stuff because there will be some new kind of python or writing that is just a little bit hard to grasp. Do let yourself spend time reading articles and learning about those.
!!! note "partial()
, yield
, *args, **kwargs
These were just some of the things that I had to break my flow to go and read up more on. Using LLMs to aid can help, but I've found that the time spent reading articles somehow makes more sense to me. Probably a combination of the time spent learning about something and understanding the purpose of those stuff helps in a deeper understanding and consequently memory retention.
Next steps¶
I will start to try to implement my feature, with the expectation to get stuck diving deeper into something at some point!
TO FURTHER STUDY¶
- mixpanel middleware