CodeLens

About:

CodeLens is an interactive tool for visualizing code representations that are usually utilized in machine learning models (e.g., CodeBERT, GraphCodeBERT, CodeX). The primary goal is to help software engineers understand and explore these code representations in a visual environment.
Currently, CodeLens supports three programming languages, including Java, Python, and JavaScript, and four types of code representations, including sequence of tokens, abstract syntax tree (AST), data flow graph (DFG), and control flow graph (CFG).
CodeLens is a collaborative project between two research centers in Luxembourg: LIST – Luxembourg Institute of Science and Technology, and SnT – Interdisciplinary Center for Security, Reliability, and Trust.

Information for each representation:

Token: A source code is treated as plain text and processed into a linear sequence of tokens via a tokenizer. Each line of code is chopped into pieces by looking for the whitespace (tabs, spaces, newlines). Each piece is finally represented by an integer that refers to the ID of the piece in a so-called vocabulary. A piece can be a word, a subword, or a character depending on different tokenizers.
AST: Abstract syntax tree (AST) is a tree representation of the abstract syntactic structure of a source code. Each node in the tree represents a construct occurring in the source code. When converting a source code to an AST, only structural information is preserved, such as variable types, order and definition of executable statements, and identifiers.
DFG: Data flow graph (DFG). As the name suggests, DFG is a data-oriented graph representation that shows the flow of data through a source code. In a DFG, each node represents a variable or an expression, and each edge represents the flow of data between them.
CFG: Control flow graph (CFG), like DFG, is a graph-based representation. While CFG is process-oriented, it represents all paths that might be traversed through the execution. In a CFG, nodes portray basic blocks of statements and conditions, and edges describe the transfer of control and subsequent access or modification onto the same variable. Note that, a CFG includes two designated blocks, an entry block and an exit block where the control enters and leaves the flow.