diff

In computing, diff is a file comparison utility that displays the difference between two files. This program outputs line-by-line changes made to a file (for text files). Modern implementations also support binaries . The utility's output is called diff, or, more commonly, the patch , since it can be applied with the patch program. The output of other file comparison utilities is also often called diff.

Content

History

The diff utility was developed in the early 1970s for the Unix operating system, which was the product of AT & T Bell Labs , in Murray Hill (New Jersey). The final version, distributed with the 5th version of Unix in 1974, was written entirely by Douglas McIlroy .

Algorithm

The work of diff is based on finding the largest common subsequence ( English longest common subsequence , LCS problem). For example, there are two sequences of elements:

  abcdfghjqz

  abcdefgijkrxyz

and it is necessary to find the longest sequence of elements, which is presented in both sequences in the same order. This means that it is necessary to find a new sequence that can be obtained from the first sequence by deleting some elements or from the second sequence by removing other elements. In this case, such a sequence will be

  abcdfgjz

After receiving the largest overall sequence, only a small step remains until a diff-like output is obtained:

  ehikqrxy + - + + - + + +

Use

diff is called from the command line with the names of two files as arguments: diff original new . The output of the command is the changes that need to be made in the original source file to get a new file new. If the original and new are directories, then diff will automatically be applied to every file that exists in both directories. All examples in this article use the following two files, original and new :

original:

  1 This part of the document
 2 remained unchanged
 3 from version to version.  If a
 4 there are no changes in it
 5 should not be displayed.
 6 Otherwise it does not contribute.
 7 conclusion optimal 
 8 volumes produced
 9 changes.
 ten
 11 This paragraph contains
 12 obsolete text.
 13 It will be removed
 14 in the near future.
 15
 16 In this document
 17 required to hold
 18 spell checker.
 19 error on the other hand
 20 in the word - not the end of the world.
 21 Rest of paragraph
 22 does not require changes.
 23 New text available
 24 add to the end of the document.

new:

  1 This is an important note!
 2 Therefore, it should
 3 to be located
 4 at the beginning of this
 5 documents!
 6
 7 This part of the document
 8 remained unchanged
 9 from version to version.  If a
 10 there is no change in her
 11 should not be displayed.
 12 Otherwise it does not contribute.
 13 conclusion optimal 
 14 amounts of information.
 15 
 16 In this document
 17 must be held
 18 spell checker.
 19 error on the other hand
 20 in the word - not the end of the world.
 21 Rest of paragraph
 22 does not require changes.
 23 New text available
 24 add to the end of the document.
 25
 26 This paragraph contains
 27 important additions
 28 for this document.

The diff original new command produces the following normal diff output :

  0a1,6
  > This is an important note!
  > Therefore it should
  > be located
  > at the beginning of this
  > document!
  > 
  8,14c14
  <volume produced
  <changes.
  < 
  <This paragraph contains
  <obsolete text.
  <It will be removed
  <in the near future.
  ---
  > amount of information.
  17c17
  <required to hold
  ---
  > need to hold
  24a25,28
  > 
  > This paragraph contains
  > important additions
  > for this document.

In this traditional output format, a means added (from add . ), D - removed , c - changed . The letters a, d or c are before the line numbers of the source file, followed by the line numbers of the final file. Each line that has been added, deleted or changed is preceded by angle brackets .

By default, line numbers for the source and destination files are not indicated. Lines that are moved are shown added to their new location and deleted from their previous location. ^[one]

Options

Most diff implementations have remained unchanged since 1975. Modifications include improvements in the basic algorithm, the addition of new command keys, new output formats. The basic algorithm is outlined in the books by An O (ND) Difference Algorithm and Variations by Eugene V. Myers, ^[2] and in A Web Comparison Program by Webb Miller and Myers. ^[3] The algorithm was independently discovered and described in Algorithms for Approximate String Matching by E. Ukkonen ^[4] The first versions of the diff program were designed to compare strings of text files using the newline character as a line separator. In the 1980s, binary support led to changes in the design and implementation of the program.

Edit script

Edit script can be generated with modern versions of diff using the -e option. The result for our example will look like this:

  24a

 This paragraph contains
 important additions
 for this document.
 .
 17c
 need to hold
 .
 8,14c
 amount of information.
 .
 0a
 This is an important note!
 Therefore, it should
 to be located
 at the beginning of it
 document!

 .

To use the resulting script to convert the original file to the state of the new file, we need to add two lines to the end of the script: one contains the w (write) command, the other q (quit). For example, so printf "w\nq\n" >> mydiff . Here we give the diff file the name mydiff . The conversion will happen when we give the command ed -s original < mydiff .

Context Format

In BSD version 2.8 (released in July 1981), a contextual format ( -c ) and the ability to recursively traverse the file system directory tree ( -r ) appeared.

In the contextual format, the modified lines are shown along with the unaffected lines before and after the modified fragment. Inserting any number of unaffected lines provides context for the patch. A context consisting of unaffected lines serves as a reference to determine the position of the fragment being modified in the target file, even if the numbers of the lines to be modified in the source and target files do not match. The contextual format represents greater readability for people and greater reliability when applying a patch, and the output is taken as input to the patch program.

The number of unaffected lines before and after the modified fragment can be specified by the user and even be zero, but usually it is three lines by default. If the context of the unaffected lines in the fragment intersects with the neighboring fragment, then diff will avoid copying the unaffected lines and merge the adjacent fragments into one.

The diff command output is -c original new :

  *** / path / to / original '' timestamp ''
 --- / path / to / new '' timestamp ''
 ***************
 *** 1.3 ****
 --- 1.9 ----
 + This is an important note!
 + Therefore it should
 + be located
 + at the beginning of this
 + document!
 +
   This part of the document
   remained unchanged
   from version to version.  If a
 ***************
 *** 5.20 ****
   should not be displayed.
   Otherwise it does not contribute
   finding the optimal 
 !  volume produced
 !  changes.
 !
 !  This paragraph contains
 !  obsolete text.
 !  It will be removed
 !  soon.
  
   In this document
 !  need to hold
   spell checker
   Error on the other hand
   in the word - not the end of the world.
 --- 11.20 ----
   should not be displayed.
   Otherwise it does not contribute
   finding the optimal 
 !  amount of information.
  
   In this document
 !  need to hold
   spell checker
   Error on the other hand
   in the word - not the end of the world.
 ***************
 *** 22.24 ****
 --- 22.28 ----
   does not require changes.
   New text can be
   add to the end of the document.
 +
 + This paragraph contains
 + important additions
 + for this document.

Universal format

The universal format (or unidiff ) includes technical improvements made in the contextual format, but the difference between the old and the new text is more compact. Universal format is usually invoked using the “ -u ” command line option . This output is often used as a patch for programs. Many projects specifically ask to send them “diffs” in a universal format, making, thereby, a universal format most common for exchange between software developers.

Universal context diffs were first developed by Wayne Davison in August 1990 ( unidiff appears in chapter 14 of comp.sources.misc). Stallman added support for the universal format in GNU Project 's diff utility one month later, and this functionality debuted in GNU diff 1.15, released in January 1991.

A file in a universal format starts with the same two lines as the context format, except that the original file starts with “ --- ”, and the new file starts with “ +++ ”. Behind them is followed by one or more modified fragments that contain line-by-line changes in files. Unchanged lines start with a space, added lines begin with a plus sign, deleted lines begin with a minus sign.

The fragment begins with information about the range and immediately followed by added lines, deleted lines and any number of context lines. Information about the range is surrounded by double @ signs and is combined into one line, unlike two lines in ( context format ). Range information has the following format:

  @@ -l, s + l, s @@ optional section heading

The range information consists of two parts. The part for the original file starts with a minus, and the part for a new file starts with a plus. Each part is in the format l, s , where l is the line number from which we start, and s is the number of lines that have been changed in the current fragment for each of the files, respectively (that is, in the first case, this is the sum of the displayed lines beginning with a space and with a minus, in the second - lines starting with a space and with a plus). In many versions of the GNU diff in each range, the comma and trailing s can be omitted. In this case, the default s is 1. Note that the only useful value for l is the line number of the first range, the remaining values can be calculated from the diff.

The range fragment for the original file must be the sum of all contextual and deleted (including modified) lines of the fragment. The range fragment for a new file should include the sum of all contextual and added (including modified) lines of the fragment.

A range fragment can be preceded by a section or function header, of which the fragment is a part. This is usually useful for reading the fragment itself. When diff is created using GNU, the diff header is defined by a regular expression ^[5]

If the line has been changed, it is shown both as deleted and added. Since the deleted and added lines are in adjacent fragments, these lines are shown next to each other ^[6] . For example:

  -check this dokument.  On
 + check this document.  On

The diff -u original new command will produce the following output:

  --- / path / to / original '' timestamp ''
 +++ / path / to / new '' timestamp ''
 @@ -1.3 +1.9 @@
 + This is an important note!
 + Therefore it should
 + be located
 + at the beginning of this
 + document!
 +
  This part of the document
  remained unchanged
  from version to version.  If a
 @@ -5.16 +11.10 @@
  should not be displayed.
  Otherwise it does not contribute
  finding the optimal 
 -volume produced
 - changes.
 -
 -This paragraph contains
 -old text.
 - It will be deleted
 -soon.
 + amount of information.
 
  In this document
 - need to hold
 + need to hold
  spell checker
  Error on the other hand
  in the word - not the end of the world.
 @@ -22.3 +22.7 @@
  does not require changes.
  New text can be
  add to the end of the document.
 +
 + This paragraph contains
 + important additions
 + for this document.

Note that in order to properly separate file names from timestamps, tabulation is used. This is not visible on the screen and may be lost when copying / pasting from the console.

There are several changes and extensions for diff formats that various programs use and understand. For example, some version control systems , such as Subversion , specify a version number, a “working copy”, or any other comment in addition to the timestamp in the diff header.

Some programs allow you to create diffs for several different files and merge them into one, using the header for each modified file that might look something like this:

  Index: path / to / file.cpp

A special kind of files that do not end with a new line is not supported. Neither the unidiff utility nor the POSIX diff standard define how such files are processed (moreover, files of this type are not “text” in the POSIX definition ^[7] ).

The patch program knows nothing about the implementation of the special output of the diff command.

Notes

↑ David MacKenzie, Paul Eggert, and Richard Stallman. Comparing and Merging Files with GNU Diff and Patch . - 1997. - ISBN ISBN 0-9541617-5-0 .
↑ E. Myers. An O (ND) Difference Algorithm and Its Variations (Eng.) // Algorithmica : journal. - 1986. - Vol. 1 , no. 2 - P. 251-266 .
↑ Webb Miller and Eugene W. Myers. A File Comparison Program (Unsolved) // Software - Practice and Experience. - 1985. - Vol. 15 , No. 11 . - pp . 1025-1040 .
↑ E. Ukkonen. Algorithms for Approximate String Matching (English) // Information and Control : journal. - 1985. - Vol. 64 . - P. 100-118 .
2.2.3 Showing Which Sections Differences Are in , GNU diffutils manual
↑ Unified Diff Format by Guido van Rossum , June 14, 2006
↑ http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_205 Section 3.205

Links

The GNU diffutils package includes diff. Distributed under the GNU General Public License .
diffutils for Win32 - part of GnuWin32
Online interface to diff (rus.)
C # diff algorithm - The source code of the diff algorithm and its C # options

[1] David MacKenzie, Paul Eggert, and Richard Stallman. Comparing and Merging Files with GNU Diff and Patch . - 1997. - ISBN ISBN 0-9541617-5-0 .

[2] E. Myers. An O (ND) Difference Algorithm and Its Variations (Eng.) // Algorithmica : journal. - 1986. - Vol. 1 , no. 2 - P. 251-266 .

[3] Webb Miller and Eugene W. Myers. A File Comparison Program (Unsolved) // Software - Practice and Experience. - 1985. - Vol. 15 , No. 11 . - pp . 1025-1040 .

[4] E. Ukkonen. Algorithms for Approximate String Matching (English) // Information and Control : journal. - 1985. - Vol. 64 . - P. 100-118 .

[5] 2.2.3 Showing Which Sections Differences Are in , GNU diffutils manual

[6] Unified Diff Format by Guido van Rossum , June 14, 2006

[7] ttp://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_205 Section 3.205