Exploring Programming Languages on Wikipedia

by David Qin

[!!!!BREAKING NEWS!!!!]

Art Charlesworth, master of programming languages.

Programming languages have come a long way since Ada Lovelace wrote the first documented instance of a computer program in the mid-1800s. The earliest programming language was simply machine code. The next iteration sought to make code human readable in the form of assembly language. However, it was still relatively difficult to learn and write. Finally, in the 1950s, the first high-level languages were developed, which comprise of the languages that programmers are most familiar with today. The intent of this project was to investigate relationships between various programming languages using spectral clustering on the Wikipedia pages of these languages. The clustering behavior was based on the cosine similarity between documents. Thirty-five languages ranging from object-oriented to scripting to visual were selected with hopes of seeing relationships between languages similar on the fronts of syntax, style, and paradigms. The methods used and analysis of the results of this investigation are presented below.

The Wikipedia page for “List of Programming Languages” lists over 700 languages, but many of them are lesser known and only a small percentage of them are widely used. To filter for only relatively popular languages, the final thirty-five languages chosen were gathered based on the TIOBE Index for December 2018. This index is calculated according to a comprehensive set of criteria, which can be found on their website. Ultimately, the index provides a ranking of languages based on their rating, which is calculated according to their number of hits on the most popular search engines. The full list of languages (by cluster) can be found at this link. This list was used as input for performing spectral clustering of their corresponding Wikipedia pages, handled by a class and functions in Dr. Arnold’s wikitext module. First, a wikitext.WikiCorpus object was created to 1) download the data of the Wikipedia pages (which uses the Wikipedia API in its implementation), 2) generate bag-of-words representations of the documents/pages’ paragraphs, 3) build the similarity matrix of the bag-of-words representations (using cosine similarity to build a graph representation of the documents), and 4) perform the spectral clustering on the resulting similarity matrix. It was determined that separating the programming languages’ pages into nine clusters depicted the most meaningful relationships. Next, functions from wikitext were used to procedurally generate a set of webpages. The webpages included 1) a "Docs" page of the Wikipedia data gathered for each article, with links to a corresponding webpage of revision history information, 2) a "Clusters" page separating the programming languages’ articles into clusters, and 3) a "Viz" page for a scatterplot relating an article’s number of language pages to its number of internal links. With these pages, analysis could be done on the performance of the clustering algorithm based on the intended criteria of syntax, purpose, and paradigms.

Taylor Arnold, master of Python.

The results of the clustering algorithm depicted some interesting relationships. Most clusters were very natural and made sense, such as the “BASIC and Its Derivations” cluster, the “Web Application Languages” cluster, and the “Data Analysis” cluster. The “Web Application Languages” cluster contained JavaScript and Dart, both of which are commonly used in web, server, and mobile applications. On the other hand, some clusters were composed of languages that were hard to draw relationships between, namely the “???” cluster. This cluster contained C#, SQL, Object Pascal, and Ada. The languages seem to only have the concept of the object-oriented paradigm and typing somewhat in common; SQL is not object-oriented but some implementations use strong typing. However, strong typing is used in a multitude of languages, so this cluster of languages is hard to justify without specifically looking at the words that caused these languages to cluster. Another interesting phenomenon is that Java-based languages separated from the C-based languages, even though Java is derived from C as well. A possible reason for why this occured could be that a major feature of Java is the Java Virtual Machine (JVM), that makes it so unique from C, which compiles differently for different machines. Thus, languages that compile on the JVM were more likely to cite other features of Java in their Wikipedia articles. The clustering algorithm also naturally grouped Assembly Language, Fortran, and COBOL together in "Old Stuff". The reasoning for this grouping is likely because in the prime usage eras of these languages, the high level languages that are commonly used and popular today were not yet robustly implemented. Thus, the contents of these Wikipedia articles referenced each other more frequently than any other of the downloaded articles did. Additionally, the "C and Its Derivatives" cluster was the largest, which made sense because so many languages have been influenced by or are based on C.

However, it was surprising to see that the clustering algorithm seemed to incorporate some similarities in syntax. Based on how the wikitext module works, only paragraph sections of a Wikipedia page are parsed, and many of the coding examples on the pages were within or under non-paragraph tags. However, enough descriptions of syntax or comparisons to languages with similar syntax must have been made throughout the paragraphs of the Wikipedia articles.

Programming languages have what are called “paradigms,” which are categorizations of their features. Paradigms include concurrent programming, generic programming, functional programming, object-oriented programming, and visual programming. Most languages are multi-paradigm. It seemed that overall the clustering algorithm did a great job of capturing the paradigm similarities between languages. In the context of the clustering algorithm, this meant that the similarity matrix did a good job of picking up on similarities between languages. At a slightly deeper level, term frequency inverse document frequency to measure word importance seemed to be a good measurement for document similarities. A supporting example of this phenomenon is that the appearance of the word "SAS" on the R_(programming_language) page was a "top word," as was "analytics" on both pages. On the other hand, anticipating groupings by paradigms would imply that two visual programming languages, Scratch and LabVIEW, would have been grouped together as well. However, Scratch is more a toy language used for educational purposes, whereas LabVIEW is actually used for real-world projects.

Jory Denny, master of C++.

The most interesting part of this project was observing the clusters that were returned. Some interesting conclusions could perhaps be drawn from the revision histories, but it did not seem like the revision history metadata provided signficant insight into the popularity of a language. A language's popularity could reflect in a language in various ways; for already-popular languages, an increase in popularity could correspond to increased scrutiny and editing of a page, or it could require additional sections to cover a language's new features. Less popular languages, however, would likely reflect an increase in its popularity through increased character counts and sections counts in the article. Since the languages chosen for this project were already popular, the trends in the articles' metadata were ambiguous. For instance, Java, which was first in the TIOBE Index, had a volatile character and section count over the last five years, even with new changes in the language being released multiple times over that period.

This study of programming languages was very interesting and is something I hope to look into more in the future. There are a multitude of natural extensions to this investigation. It might make more sense, for instance, to perform cluster analysis of only coding examples to determine which languages are syntactically similar, and then perform analysis on the common or major use cases of the languages. This information could be used by a developer to more easily choose the best language for a new project. For instance, if a project requires a language with the object-oriented paradigm and the developer is most comfortable with a language syntactically similar to C, then the analysis of languages might suggest using C++. An online source of data for this analysis could be some encyclopedia specifically of programming languages such as progopedia.com.