Categories
Tips

Git Internals Part 3: Understanding the staging area in Git

Software development is a messy and intensive process, which in theory, should be a linear, cumulative construction of functionalities and improvements in code, but is rather more complex. More often than not it is a series of intertwined, non-linear threads of complex code, partly finished features, old legacy methods, collections of TODO comments, and other things common to any human-driven and a largely hand-crafted process known to mankind.

Git was built to make our lives easier when dealing with this messy and complex approach to software development. Git made it possible to work effortlessly on many features at once and decide what you want to stage and commit to the repository. The staging area in Git is the main working area, but most of the developers know only a little about it.

In this article, we will be discussing the staging area in Git and how it is a fundamental part of version control and can be used effectively to make version control easier and uncomplicated.

What is Staging area?

To understand what is staging area is, let’s take a real-world example – suppose that you are moving to another place, and you have to pack your stuff into boxes and you wouldn’t want to mix the items meant for the bathroom, kitchen, bedroom, and the living room in the same box. So, you will take a box and start putting stuff into it, and if doesn’t make sense, you can also remove it before finally packing the box and labeling it.

Here, in this example, the box serves as the staging area, where you are doing the work (crafting your commit), whereas when you are done, then you are packing it and labeling it (committing the code).

In technical terms, the staging area is the middle ground between what you have done to your files (also known as the working directory) and what you had last committed (the HEAD commit). As the name implies, the staging area gives you space to prepare (stage) the changes that will be reflected on the next commit. This surely adds up some complexity to the process, but it also adds more flexibility to selectively prepare the commits as they can be modified several times in the staging area before committing.

Assume you’re working on two files, but only one is ready to commit. You don’t want to be forced to commit both files, but only the one that is ready. This is where Git’s staging area comes in handy. We place files in a staging area before committing what has been staged. Even the deletion of a file must be recorded in Git’s history, therefore deleted files must be staged before being committed.

What are git commands for the staging area?

git add

The command used to stage any change in Git is git add. The git add command adds a modification to the staging area from the working directory. It informs Git that you wish to include changes to a specific file in the next commit. However, git add has little effect on the repository—changes are not truly recorded until you execute git commit.

The common options available along with this command are as follows:

You can specify a <file> from which all changes will be staged. The syntax would be as follows:

git add <file>

Similarly, you can specify a <directory> for the next commit:

git add <directory>

You can also use a . to add all the changes from the present directory, such as the following:

git add .

git status

git status command is used to check the status of the files (untracked, modified, or deleted) in the present branch. It can be simply used as follows:

git status

git reset

In case, you have accidentally staged a file or directory and want to undo it or unstage it, then you can use git reset command. It can be used as follows:

git reset HEAD example.html

git rm

If you remove files, they will appear as deleted in git status, and you must use git add to stage them. Another option is to use the git rm command, which deletes and stages files in a single command:

To remove a file (and stage it)

git rm example.html

To remove a folder (and stage it)

git rm -r myfolder 

git commit

The git commit command saves a snapshot of the current staged changes in the project. Committed snapshots are “secure” versions of a project that Git will never alter unless you specifically ask it to.

Git may be considered a timeline management utility at a high level. Commits are the fundamental building blocks of a Git project timeline. Commits may be thought of as snapshots or milestones along a Git project’s history. Commits are produced with the git commit command to record the current status of a project.

Git Snapshots are never committed to the remote repository. As the staging area serves as a wall between the working directory and the project history, each developer’s local repository serves as a wall between their contributions and the central repository.

The most common syntax followed to create a commit in git is as follows:

git commit -m "commit message"

The above commands and their functionalities can be summed up simply in the following image:

git commit -m commit message

Conclusion

To summarize, git add is the first command in a series of commands that instructs Git to “store” a snapshot of the current project state into the commit history. When used alone, git add moves pending changes from the working directory to the staging area. The git status command examines the repository’s current state and can be used to confirm a git add promotion. To undo a git add, use the git reset command. The git commit command is then used to add a snapshot of the staging directory to the commit history of the repository.

This is all for this article, we will discuss more Git Internals in the next article. Do let me know if you have any feedback or suggestions for this series. 

If you want to read what we discussed in the earlier instalments of the series, you can find them below.

Git Internals Part 1- List of basic Concepts That Power your .git Directory here

Git Internals Part 2: How does Git store your data? here

Keep reading!

Categories
Tips

Git Internals Part 2: How does Git store your data?

In this article, we’ll be learning about the basics of the data storage mechanism for git. 

The most fundamental term we know regarding git and data storage is repositories. Let’s first understand what a git repository is and where it stands in terms of data storage in git.

Are you ready to influence the tech landscape? Take part in the Developer Nation Survey and be a catalyst for change. Your thoughts matter, and you could be the lucky recipient of our weekly swag and prizes! Start Here

Repositories

A git repository can be seen as a database containing all the information needed to retain and manage the revisions and history of a project. In git, repositories are used to retain a complete copy of the entire project throughout its lifetime. 

Git maintains a set of configuration values within each repository such as the repository user’s name and email address. Unlike the file data or other repository metadata, configuration settings are not propagated from one repository to another during a clone, or fork, or any other duplication operation. Instead of this, git manages and stores configuration settings on a per-site, per-user, and per-repository basis.

Inside a git repository, there are two data structures – the object store and the index. All of this repository data is stored at the root of your working directory inside a hidden folder named .git. You can read more about what’s inside your .git folder here.

As part of the system that allows a fully distributed VCS, the object store is intended to be effectively replicated during a cloning process. The index is temporary data that is private to a repository and may be produced or edited as needed.

Let’s discuss object storage and index in further depth in the next section.

Git Object Types

Object store lies at the heart of the git’s data storage mechanism. It contains your original data files, all the log messages, author information, and other information required to rebuild any version or branch of the project.

Git places the following 4 types of objects in its object store which form the foundation of git’s higher-level data structures:

  1. blobs
  2. trees
  3. commits
  4. tags

Let’s look a bit more about these object types:

Blobs

A blob represents each version of a file. “Blob” is an abbreviation for “binary big object,” a phrase used in computers to refer to a variable or file that may contain any data and whose underlying structure is disregarded by the application.

A blob is considered opaque it contains the data of a file but no metadata or even the file’s name.

Trees

A tree object represents a single level of directory data. It saves blob IDs, pathnames, and some metadata for all files in a directory. It may also recursively reference other (sub)tree objects, allowing it to construct a whole hierarchy of files and subdirectories.

Commits

Each change made into the repository is represented by a commit object, which contains metadata such as the author, commit date, and log message. 

Each commit links to a tree object that records the state of the repository at the moment the commit was executed in a single full snapshot. The initial commit, also known as the root commit, has no parents and the following most of the commits have single parents.

A Directed Acyclic Graph is used to arrange commits. For those who missed it in Data Structures, it simply implies that commits “flow” in one way. This is usually just the trail of history for your repository, which might be very basic or rather complicated if you have branches.

Tags

A tag object gives a given object, generally a commit, an arbitrary but presumably human-readable name such as Ver-1.0-Alpha.

All of the information in the object store evolves and changes over time, monitoring and modeling your project’s updates, additions, and deletions. Git compresses and saves items in pack files, which are also stored in the object store, to make better use of disc space and network traffic.

Index

The index is a transient and dynamic binary file that describes the whole repository’s directory structure. More specifically, the index captures a version of the general structure of the project at some point in time. The state of the project might be represented by a commit and a tree at any point in its history, or it could be a future state toward which you are actively building.

One of the primary characteristics of Git is the ability to change the contents of the index in logical, well-defined phases. The indicator distinguishes between gradual development stages and committal of such improvements.

How does git monitor object history?

The Git object store is organized and implemented as a storage system with content addresses. Specifically, each item in the object store has a unique name that is generated by applying SHA1 to the object’s contents, returning a SHA1 hash value.

Because the whole contents of an object contribute to the hash value, and because the hash value is thought to be functionally unique to that specific content, the SHA1 hash is a suitable index or identifier for that item in the object database. Any little modification to a file causes the SHA1 hash to change, resulting in the new version of the file being indexed separately.

For monitoring history, Git keeps only the contents of the file, not the differences between separate files for each modification. The contents are then referenced by a 40-character SHA1 hash of the contents, which ensures that it is almost certainly unique.

The fact that the SHA1 hash algorithm always computes the same ID for identical material, regardless of where that content resides, is a significant feature. In other words, the same file content in multiple folders or even on separate machines produces the same SHA1 hash ID. As a result, a file’s SHA1 hash ID is a globally unique identifier.

Every object has an SHA, whether it’s a commit, tree, or blob, so get to know them. Fortunately, they are easily identified by the first seven characters, which are generally enough to identify the entire string.

One fantastic benefit of saving only the content is that if you have two or more copies of the same file in your repository, Git will only save one internally.

Conclusion

In this article, we learned about the two primary data structures used by git to enable data storage, management, and tracking history. We also discussed the 4 types of object types and the different roles played by them in git’s data storage mechanism. 

This was all for this article, I hope you find it helpful. These are the fundamental components of Git as we know it today and use on a regular basis. We’ll be learning more about these Git internal concepts in the upcoming articles.
Keep reading. In case you want to connect with me, follow the links below:

LinkedIn | GitHub | Twitter | Dev

Categories
Tips

Git Internals Part 1- List of basic Concepts That Power your .git Directory

Git is the most popular and commonly used open-source version control system in the modern-day. However, we barely focus on the basic concepts that are the building blocks of this system. 

In this article, we will learn about the basic concepts that power your .git directory.

The .git directory

Whenever we initialize a git repository, a .git directory gets created in the project’s root. This is the place where Git stores all its information. Digging a bit deeper you can see the directory structure as below:

$ ls -C .git
COMMIT_EDITMSG  MERGE_RR    config      hooks       info        objects     rr-cache
HEAD        ORIG_HEAD   description index       logs        refs

The detailed structure looks like the following:
.
|-- COMMIT_EDITMSG
|-- FETCH_HEAD
|-- HEAD
|-- ORIG_HEAD
|-- branches
|-- config
|-- description
|-- hooks
|   |-- applypatch-msg
|   |-- commit-msg
|   |-- post-commit
|   |-- post-receive
|   |-- post-update
|   |-- pre-applypatch
|   |-- pre-commit
|   |-- pre-rebase
|   |-- prepare-commit-msg
|   `-- update
|-- index
|-- info
|   `-- exclude
|-- logs
|   |-- HEAD
|   `-- refs
|-- objects
`-- refs
    |-- heads
    |-- remotes
    |-- stash
    `-- tags

Directories inside the .git directory

The .git directory consists of the following directories:

hooks:
This directory contains scripts that are executed at certain times when working with Git, such as after a commit or before a rebase.

info:
You can use this file to ignore files for this project, however, it’s not versioned like a .gitignore file would be.

logs:
Contains the history of different branches. It is most commonly used with the git reflog command.

objects:
Git’s internal warehouse of blobs, all indexed by SHAs. You can see them as following:

$ ls -C .git/objects
09  24  28  45  59  6a  77  80  8c  97  af  c4  e7  info
11  27  43  56  69  6b  78  84  91  9c  b5  e4  fa  pack

These directory names are the first two letters of the SHA1 hash of the objects stored in git.

You can enquire a little further as following:

$ ls -C .git/objects/09
6b74c56bfc6b40e754fc0725b8c70b2038b91e  9fb6f9d3a104feb32fcac22354c4d0e8a182c1

These 38 character strings are the names of the files that contain objects stored in git. They are compressed and encrypted, so it’s impossible to view their contents directly. 

rebase-apply: 

The workbench for git rebase. It contains all the information related to the changes that have to be rebased.

refs:

The master copy of all refs that live in your repository, be they for stashes, tags, remote-tracking branches, or local branches. 

You can see the existing refs in your .git directory as below:

$ ls .git/refs
heads
tags
$ ls .git/refs/heads
master
$ ls .git/refs/tags
v1
v1-beta
$ cat .git/refs/tags/v1
fa3c1411aa09441695a9e645d4371e8d749da1dc

Now, having discussed the directories inside the .git directory, let’s explore the files that reside inside the .git directory and their uses.

Files in the .git directory

  1. COMMIT_EDITMSG:

This file contains the commit message of a commit in progress or the last commit. Any commit message provided by the user (e.g., in an editor session) will be available in this file. 

If the git commit exits due to an error before generating a commit, it will be overwritten by the next invocation of git commit.

It’s there for your reference once you have made the commit and is not actually used by Git.

2. config:

This configuration file contains the settings for this repository. Project-specific configuration variables can be dumped in here including aliases. 

$ cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
[user]
    name = Pragati Verma
    email = pragati.verma@gmail.com

This file is mostly used to define where the remote repository lives and some core settings, such as if your repository is bare or not.

3. description:

This description will appear when you see your repository or the list of all versioned repositories available while using Git web interfaces like gitweb or instaweb.

4. FETCH_HEAD:

FETCH_HEAD is a temporary ref that keeps track of what has recently been fetched from a remote repository. 

In most circumstances, git fetch is used first, which fetches a branch from the remote; FETCH_HEAD points to the branch’s tip (it stores the SHA1 of the commit, just as branches do). After that, git merge is used to merge FETCH_HEAD into the current branch.

5. HEAD:

HEAD is a symbolic reference pointing to wherever you are in your commit history. It’s the current ref that you’re looking at. 

HEAD can point to a commit, however, typically it points to a branch reference. It is attached to that branch, and when you do certain things (e.g., commit or reset), the attached branch will move along with HEAD. In most cases, it’s probably refs/heads/master. You can check it as follows:

$ cat .git/HEAD
ref: refs/heads/master

6. ORIG_HEAD:

When doing a merge, this is the SHA of the branch you’re merging into.

7. MERGE_HEAD:

When doing a merge, this is the SHA of the branch you’re merging from.

8. MERGE_MODE:

Used to communicate constraints that were originally given to git merge to git commit when merge conflicts and a separate git commit is needed to conclude it.

9. MERGE_MSG:

Enumerates conflicts that happen during your current merge.

10. index:

Git index refers to the “staging area” between the files you have on your filesystem and your commit history with meta-data such as timestamps, file names, and also SHAs of the files that are already wrapped up by Git. 

The files in your working directory are hashed and stored as objects in the index when you execute git add, making them “staged changes.”

11. packed-refs:

It solves the storage and performance issues by keeping the refs in a single file. When a ref is missing from the /refs directory hierarchy, it is searched for in this file and used if it is found.

Conclusion

In this article, we covered a brief overview of the basic concepts that make up your git directory. These are the fundamental components of Git as we know it today and use on a regular basis. We’ll be learning more about these Git internal concepts in the upcoming articles.

Keep reading. In case you want to connect with me, follow the links below:

LinkedIn | GitHub | Twitter | Dev

Bio 

Pragati Verma is a software developer and open-source enthusiast. She has also been an active writer on various platforms and has written for many organizations as a freelance writer. As a Junior Editor at Hackernoon, Pragati helps numerous writers every day to publish their content on Hackernoon.

In her spare time, Pragati loves to read books or watch movies.