2018-09-18

Hadoop 3.1.1をインストールしてLocal Modeで動かしてみる

Hadoop Local Mode

今日は https://qiita.com/Esfahan/items/39fd1e2f8b755eacec65 http://www.atmarkit.co.jp/ait/articles/0902/27/news129_2.html http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html あたりを参考にしつつHadoopのインストール作業の素振りをしてみます。

Hadoopをインストール

Hadoopのインストールの事前準備

HadoopはJavaで動作するのでJavaのOpen JDKを入れる。

$ sudo yum install java-1.8.0-openjdk

# http://www.yunabe.jp/tips/linux_default_java_version.html を参考に利用するJavaのバージョンを切り替える。
$ sudo update-alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
   2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number: 2

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Hadoopの起動に必要な環境変数を設定。

[ec2-user@ip-172-31-16-22 ~]$ vim ~/.bash_profile 

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/.local/bin:$HOME/bin

export PATH
export PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin:/usr/java/latest/bin
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre
export HADOOP_HOME=/home/ec2-user/hadoop

JAVA_HOMEのパスが若干どこから出てきた感ありますが、JAVA_HOMEにはインストールしてきたJavaのルートディレクトリを指定する必要があるので、yum install したあと下記のパッケージで調べています。

# yumでインストールしたパッケージのインストール場所を調べる http://d.hatena.ne.jp/muupan/20130311/1362939424
$ rpm -ql  java-1.8.0-openjdk
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre/bin/policytool
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre/lib/amd64/libawt_xawt.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre/lib/amd64/libjawt.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre/lib/amd64/libjsoundalsa.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-8.b13.39.39.amzn1.x86_64/jre/lib/amd64/libsplashscreen.so

.bash_profile の設定を反映させる。

$ exec $SHELL -l

Hadoopのインストール

$ wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
$ tar xvfz hadoop-3.1.1.tar.gz
$ ln -s hadoop-3.1.1 hadoop

$ cd hadoop
$ ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share

$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:
...

Hadoop 3.1.1の場合はこの時点で、

Local (Standalone) Mode
Pseudo-Distributed Mode
Fully-Distributed Mode

での動作が可能と書いてあるので、 Local (Standalone) Mode で動かしてみます。

動作確認をしてみる

Local Modeで動かしてみる

とはいえ、インストールしたまま設定変更不要で動かせるのはLocal Modeだけみたいです。

ということで、公式のGetting Startedにあわせて、Local Modeでのサンプルコマンドを動かしてみる。

# テスト処理用にconfigファイルをコピーしてきている
$ cp etc/hadoop/*.xml input
# inputディレクトリ以下にあるdfs..で始まる設定ファイルの名前をoutputディレクトリ以下に出力する
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep input output 'dfs[a-z.]+'
# 今回みたいにめちゃくちゃ出力が少ないコマンドでも part-r-00000 みたいな形のファイルで出るのかー(独り言)
$ ls output/
part-r-00000  _SUCCESS
# grepの結果は part-r-00000 に出力されている
$ cat output/part-r-00000 
1  dfsadmin

これのあと設定をいじって Pseudo-Distributed Mode を動かそうとしたらうまく動かなかったので、一からインスタンス立て直して試した方が早そう、ということで今日はここまで。

2018-09-17

Linuxにrbenvをインストールし、rbenvでRubyを入れ、mysql2のgemを入れるまでのメモ

rbenv gcc gem mysql2

ちょっとたくさんのインスタンスからMySQLにクエリを投げたいことってありますよね?

というわけで今日はRubyでMySQLに接続するコードを書く前段階として表題のEC2上で環境構築の素振りをします。

TL;DR

rbenvおよび、rbenvのプラグイン ruby-buildはGitHubからcloneしてくるのでまずgitを入れる
exec $SHELL -l で新しく起動中のシェルを実行し、現在のプロセスと入れ替えることで ~/.bash_profile に書いた設定内容(シェルの起動時に読み込み)をその場で反映させられる
RubyのインストールにはCコンパイラが必要で、Cコンパイラとしてgccをインストールする
- ネイティブエクステンションのgemのC言語部分のコンパイルにもC/C++コンパイラが必要。
- XCodeはC系言語のコンパイラを内蔵しているのでXCodeを新しくすると(ネイティブエクステンションのコンパイルが通って)gemがインストールできるのはおそらくこのせい
-devel がついたパッケージには開発環境で使うオブジェクトやヘッダが含まれていて、C言語で書かれたライブラリをコンパイルする際に必要になることがある
mysql2 のgem のインストールには mysql-devel が必要
- ネイティブエクステンションの部分でgemのインストールが失敗すると長いメッセージが出てきてオプションなどの羅列に目が行きがちだが、まんなかあたりに具体的に何をやればいいか1行で書いてある

rbenvのインストール

https://qiita.com/inouet/items/478f4228dbbcd442bfe8 を参考に作業していきます。

gitをインストールする

$ sudo yum -y install git

rbenvはGitHubからDLしてきます。そのために、Gitクライアントがあると便利なのでgitのインストールを行います。

rbenvをホームディレクトリにDLし、rbenvの実行ファイルへのパスを通す

$ git clone https://github.com/sstephenson/rbenv.git ~/.rbenv

$ echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bash_profile

ついでに、シェルを立ち上げた時に rbenv が起動するようにも設定しておきます。

$ echo 'eval "$(rbenv init -)"' >> ~/.bash_profile

~/.bash_profile に書いた設定は次以降にbashを開く時に読み込まれるので

$ exec $SHELL -l

を実行して先ほどまで ~/.bash_profile に追記していた設定をその場で反映させます。

exec $SHELL -l コマンドは、新しく ~/.bash_profile の設定変更が反映されているbash($SHELL の中身)のプロセスを実行して今動いているbashのプロセスと入れ替えます*1。

rbenvのプラグインである ruby-build をインストールする

$ git clone https://github.com/sstephenson/ruby-build.git ~/.rbenv/plugins/ruby-build

rbenvはシステム全体や特定のディレクトリ下で利用するRubyのバージョンを管理してくれますが、Rubyのインストールは ruby-build のプラグインが必要です。

というわけで、 rbenv を入れたパスの /plugins/ 以下に ruby-build を、これもGitHubからDLしてきて入れます。

Ruby 2.5.1 のインストール

1回目のRubyのインストールをやってみるとCのコンパイラがないといって失敗する

それではrbenvでRubyのインストールをためしてみます。

$ rbenv install 2.5.1
Downloading ruby-2.5.1.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.5/ruby-2.5.1.tar.bz2
Installing ruby-2.5.1...

BUILD FAILED (Amazon Linux AMI 2018.03 using ruby-build 20180822-8-g336584c)

Inspect or clean up the working tree at /tmp/ruby-build.20180916020903.3029
Results logged to /tmp/ruby-build.20180916020903.3029.log

Last 10 log lines:
tool/config.sub already exists
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking target system type... x86_64-pc-linux-gnu
checking for gcc... no
checking for cc... no
checking for cl.exe... no
configure: error: in `/tmp/ruby-build.20180916020903.3029/ruby-2.5.1':
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details

すると利用可能なCコンパイラがないといってインストールに失敗します。

Cのコンパイラとしてgccを入れる

sudo yum install gcc

gccはC言語のコンパイラ、およびコンパイル用実行ファイルの名称です*2。

ところで、XCodeを入れたら/入れ直したらmysqlのgemがインストールできるようになった、という話がありますが*3、XCodeはもともとC系の言語用のIDEなので、C言語のコンパイラが一緒に入った/正常な状態のものに更新された、ということではないかと思います。

2回目のRubyのインストールをやってみるとopenssl, readline, zlibのextensionがないといって失敗する

では、気を取り直してもう一回インストールを試すと、今度は

[ec2-user@ip-172-31-19-189 ~]$ rbenv install 2.5.1
Downloading ruby-2.5.1.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.5/ruby-2.5.1.tar.bz2
Installing ruby-2.5.1...

BUILD FAILED (Amazon Linux AMI 2018.03 using ruby-build 20180822-8-g336584c)

Inspect or clean up the working tree at /tmp/ruby-build.20180916021333.14684
Results logged to /tmp/ruby-build.20180916021333.14684.log

Last 10 log lines:
The Ruby openssl extension was not compiled.
The Ruby readline extension was not compiled.
The Ruby zlib extension was not compiled.
ERROR: Ruby install aborted due to missing extensions
Try running `yum install -y openssl-devel readline-devel zlib-devel` to fetch missing dependencies.

Configure options used:
  --prefix=/home/ec2-user/.rbenv/versions/2.5.1
  LDFLAGS=-L/home/ec2-user/.rbenv/versions/2.5.1/lib 
  CPPFLAGS=-I/home/ec2-user/.rbenv/versions/2.5.1/include

のようなメッセージで失敗します。

openssl-devel readline-devel zlib-devel のインストール

openssl, readline, zlib はそれぞれSSL通信用のライブラリ、コマンド履歴管理用のライブラリ*4、データ圧縮、解凍のためのライブラリですが、これらの拡張が足りず -devel のsuffixがついたパッケージを入れるように言われています。

https://www.unknownengineer.net/entry/2017/04/07/162346 の記事によると、 -devel というsuffixはなにかというと開発に必要なライブラリオブジェクトやヘッダファイル(**.h とかそういう)が入っていて、(そのライブラリやコマンドを単体で利用する場合にはおそらくいらない場合もあると思われるんですが、)ライブラリを利用したプログラムをコンパイルする際に必要となることがあるそうで、

つまり、Rubyをコンパイルするのに、openssl, readline, zlib の開発用のヘッダ、ライブラリの一式が必要ということみたいです。

というわけでいれます。

$ sudo yum install -y openssl-devel readline-devel zlib-devel

なお、 -devel のパッケージは -devel が入っていない方のパッケージに依存しているというか、openssl-develはopensslの内容+開発用オブジェクトやヘッダが入っている、という感じなので、 sudo yum install -y openssl-devel したあとは openssl コマンドが動かせます。

$ sudo yum install -y openssl-devel readline-devel zlib-devel

...


Installed:
  openssl-devel.x86_64 1:1.0.2k-12.110.amzn1                 readline-devel.x86_64 0:6.2-9.14.amzn1                 zlib-devel.x86_64 0:1.2.8-7.18.amzn1                

Dependency Installed:
  keyutils-libs-devel.x86_64 0:1.5.8-3.12.amzn1           krb5-devel.x86_64 0:1.15.1-19.43.amzn1                   libcom_err-devel.x86_64 0:1.42.12-4.40.amzn1          
  libkadm5.x86_64 0:1.15.1-19.43.amzn1                    libselinux-devel.x86_64 0:2.1.10-3.22.amzn1              libsepol-devel.x86_64 0:2.1.7-3.12.amzn1              
  libverto-devel.x86_64 0:0.2.5-4.9.amzn1                 ncurses-devel.x86_64 0:5.7-4.20090207.14.amzn1          

Dependency Updated:
  krb5-libs.x86_64 0:1.15.1-19.43.amzn1                                               openssl.x86_64 1:1.0.2k-12.110.amzn1                                              

Complete!
$ openssl
OpenSSL> exit

このAMIにはもともとopenssl入っていたみたいですが！

3回目の正直でRubyのインストールに成功するので、インストールしたバージョンを利用するようにする

$ rbenv install 2.5.1
Downloading ruby-2.5.1.tar.bz2...
-> https://cache.ruby-lang.org/pub/ruby/2.5/ruby-2.5.1.tar.bz2
Installing ruby-2.5.1...
Installed ruby-2.5.1 to /home/ec2-user/.rbenv/versions/2.5.1

$ ruby -v
ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
$ rbenv global 2.5.1
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]

mysql2のgemをインストールする

なにもせず mysql2 のgemを入れようとするとmysqlがインストールされていないので怒られる

$ gem install mysql2

Fetching: mysql2-0.5.2.gem (100%)
Building native extensions. This could take a while...
ERROR:  Error installing mysql2:
    ERROR: Failed to build gem native extension.

    current directory: /home/ec2-user/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/mysql2-0.5.2/ext/mysql2
/home/ec2-user/.rbenv/versions/2.5.1/bin/ruby -r ./siteconf20180916-10601-y92qux.rb extconf.rb
checking for rb_absint_size()... yes
checking for rb_absint_singlebit_p()... yes
checking for rb_wait_for_single_fd()... yes
checking for -lmysqlclient... no
-----
mysql client is missing. You may need to 'apt-get install libmysqlclient-dev' or 'yum install mysql-devel', and try again.
-----
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
    --with-opt-dir
    --without-opt-dir
    --with-opt-include
    --without-opt-include=${opt-dir}/include
    --with-opt-lib
    --without-opt-lib=${opt-dir}/lib
    --with-make-prog
    --without-make-prog
    --srcdir=.
    --curdir
    --ruby=/home/ec2-user/.rbenv/versions/2.5.1/bin/$(RUBY_BASE_NAME)
    --with-mysql-dir
    --without-mysql-dir
    --with-mysql-include
    --without-mysql-include=${mysql-dir}/include
    --with-mysql-lib
    --without-mysql-lib=${mysql-dir}/lib
    --with-mysql-config
    --without-mysql-config
    --with-mysql-dir
    --without-mysql-dir
    --with-mysql-include
    --without-mysql-include=${mysql-dir}/include
    --with-mysql-lib
    --without-mysql-lib=${mysql-dir}/lib
    --with-mysqlclientlib
    --without-mysqlclientlib

To see why this extension failed to compile, please check the mkmf.log which can be found here:

  /home/ec2-user/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/extensions/x86_64-linux/2.5.0-static/mysql2-0.5.2/mkmf.log

extconf failed, exit code 1

Gem files will remain installed in /home/ec2-user/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/mysql2-0.5.2 for inspection.
Results logged to /home/ec2-user/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/extensions/x86_64-linux/2.5.0-static/mysql2-0.5.2/gem_make.out

MySQLをインストールしていないので、

mysql client is missing. You may need to 'apt-get install libmysqlclient-dev' or 'yum install mysql-devel', and try again.

と怒られています。それはそうですね...。ということで 'yum install mysql-devel' をします。

mysql-develをインストールする

$ sudo yum install mysql-devel
Loaded plugins: priorities, update-motd, upgrade-helper
amzn-main                                                                                                                               | 2.1 kB  00:00:00     
amzn-updates                                                                                                                            | 2.5 kB  00:00:00     
Resolving Dependencies
--> Running transaction check
---> Package mysql-devel.noarch 0:5.5-1.6.amzn1 will be installed
--> Processing Dependency: mysql55-devel >= 5.5 for package: mysql-devel-5.5-1.6.amzn1.noarch
--> Processing Dependency: /usr/bin/mysql_config for package: mysql-devel-5.5-1.6.amzn1.noarch
--> Running transaction check
---> Package mysql55.x86_64 0:5.5.61-1.22.amzn1 will be installed
--> Processing Dependency: real-mysql55-libs(x86-64) = 5.5.61-1.22.amzn1 for package: mysql55-5.5.61-1.22.amzn1.x86_64
--> Processing Dependency: mysql-config for package: mysql55-5.5.61-1.22.amzn1.x86_64
---> Package mysql55-devel.x86_64 0:5.5.61-1.22.amzn1 will be installed
--> Running transaction check
---> Package mysql-config.x86_64 0:5.5.61-1.22.amzn1 will be installed
---> Package mysql55-libs.x86_64 0:5.5.61-1.22.amzn1 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

===============================================================================================================================================================
 Package                                Arch                            Version                                    Repository                             Size
===============================================================================================================================================================
Installing:
 mysql-devel                            noarch                          5.5-1.6.amzn1                              amzn-main                             2.7 k
Installing for dependencies:
 mysql-config                           x86_64                          5.5.61-1.22.amzn1                          amzn-updates                           49 k
 mysql55                                x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          7.5 M
 mysql55-devel                          x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          201 k
 mysql55-libs                           x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          816 k

Transaction Summary
===============================================================================================================================================================
Install  1 Package (+4 Dependent packages)

Total download size: 8.6 M
Installed size: 32 M
Is this ok [y/d/N]: y
Downloading packages:
(1/5): mysql-config-5.5.61-1.22.amzn1.x86_64.rpm                                                                                        |  49 kB  00:00:00     
(2/5): mysql-devel-5.5-1.6.amzn1.noarch.rpm                                                                                             | 2.7 kB  00:00:00     
(3/5): mysql55-5.5.61-1.22.amzn1.x86_64.rpm                                                                                             | 7.5 MB  00:00:00     
(4/5): mysql55-libs-5.5.61-1.22.amzn1.x86_64.rpm                                                                                        | 816 kB  00:00:00     
(5/5): mysql55-devel-5.5.61-1.22.amzn1.x86_64.rpm                                                                                       | 201 kB  00:00:00     
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                           15 MB/s | 8.6 MB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : mysql55-libs-5.5.61-1.22.amzn1.x86_64                                                                                                       1/5 
  Installing : mysql-config-5.5.61-1.22.amzn1.x86_64                                                                                                       2/5 
  Installing : mysql55-5.5.61-1.22.amzn1.x86_64                                                                                                            3/5 
  Installing : mysql55-devel-5.5.61-1.22.amzn1.x86_64                                                                                                      4/5 
  Installing : mysql-devel-5.5-1.6.amzn1.noarch                                                                                                            5/5 
  Verifying  : mysql-config-5.5.61-1.22.amzn1.x86_64                                                                                                       1/5 
  Verifying  : mysql55-libs-5.5.61-1.22.amzn1.x86_64                                                                                                       2/5 
  Verifying  : mysql55-5.5.61-1.22.amzn1.x86_64                                                                                                            3/5 
  Verifying  : mysql55-devel-5.5.61-1.22.amzn1.x86_64                                                                                                      4/5 
  Verifying  : mysql-devel-5.5-1.6.amzn1.noarch                                                                                                            5/5 

Installed:
  mysql-devel.noarch 0:5.5-1.6.amzn1                                                                                                                           

Dependency Installed:
  mysql-config.x86_64 0:5.5.61-1.22.amzn1 mysql55.x86_64 0:5.5.61-1.22.amzn1 mysql55-devel.x86_64 0:5.5.61-1.22.amzn1 mysql55-libs.x86_64 0:5.5.61-1.22.amzn1

Complete!

yum installに指定したパッケージのは mysql-devel ですが、

Dependencies Resolved

===============================================================================================================================================================
 Package                                Arch                            Version                                    Repository                             Size
===============================================================================================================================================================
Installing:
 mysql-devel                            noarch                          5.5-1.6.amzn1                              amzn-main                             2.7 k
Installing for dependencies:
 mysql-config                           x86_64                          5.5.61-1.22.amzn1                          amzn-updates                           49 k
 mysql55                                x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          7.5 M
 mysql55-devel                          x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          201 k
 mysql55-libs                           x86_64                          5.5.61-1.22.amzn1                          amzn-updates                          816 k

Transaction Summary
===============================================================================================================================================================
Install  1 Package (+4 Dependent packages)

のように、mysql-develは mysql55 に依存しているのでMySQLのクライアント、およびサーバのプログラムも入ります*5。

これだけ書くと yum install mysql でよさそうな気もしますが、別途試したところ、mysql2のgemのCの部分をコンパイルするために mysql-devel パッケージ内のヘッダが必要みたいです。

あらためてmysql2のgemをインストール

$ gem install mysql2
Building native extensions. This could take a while...
Successfully installed mysql2-0.5.2
Parsing documentation for mysql2-0.5.2
Installing ri documentation for mysql2-0.5.2
Done installing documentation for mysql2 after 0 seconds
1 gem installed

成功しました。

もうちょっとよくわからないトラブルが起きてそのトラブルシューティングで知見をためようとおもっていたらすんなり進んで若干がっかりしていますが、まあいいですね。

現場からは以上です。

*1:https://www.gnu.org/software/bash/manual/html_node/Bourne-Shell-Builtins.html#Bourne-Shell-Builtins

*2:http://e-words.jp/w/gcc.html

*3:https://qiita.com/unsoluble_sugar/items/1403ddf0ac9709b1aae6#xcode%E3%82%92%E6%9C%80%E6%96%B0%E3%81%AB%E3%82%A2%E3%83%83%E3%83%97%E3%83%87%E3%83%BC%E3%83%88 など

*4:https://ja.wikipedia.org/wiki/GNU_Readline irbとかで上キー押したら前のコマンド見れる、とかそういうところなんでしょうか...?

*5:mysql55だとサーバとクライアントの両方が入るんですが、一つのホストに必要なのは片方だけであることも多いので、サーバプログラム、クライアントプログラムをわけてインストールすることも可能 https://dev.mysql.com/doc/refman/5.6/ja/linux-installation-rpm.html

2018-09-16

CSVからCSVを作るためのシェル芸のいくらかについてメモ

CSV shell gawk

列の順番を入れ替えたい

awkを使えばよい。

aaa,ddd,fff,bbb,ccc,eee

のような行があったとき、

aaa,bbb,ccc,ddd,eee,fff

のように列を入れ替えたいとすると

$ echo 'aaa,ddd,fff,bbb,ccc,eee' | awk -F ',' '{ print $1 "," $4 "," $5 "," $2 "," $6 "," $3 }'
aaa,bbb,ccc,ddd,eee,fff

CSVにダブルクオーテーションがついている項目と付いていない項目が混じっている

aaa,"bbb",ccc

上の行をすべて、""がついているようにしたいとする。扱う列はまあなんとか手で処理できる個数だとすると

$ echo 'aaa,"bbb",ccc' | gawk -v FPAT='([^,]+)' '{print "\"" $1 "\"," "\""$2"\"," "\""$3"\""}' | sed -e 's/""/"/g'
"aaa","bbb","ccc"

gawkの -v FPAT pattarn で1列分の要素として扱われるパターンを指定できるので、これで , 以外の文字列を指定。
余分についた " はsedで簡単に取り除けるので、各列に " を追加する。

こんな感じ。念のため、もともと "bbb" の列の中に "" のような文字列がないか、

$ echo 'aaa,"bbb",ccc' | cut -d ',' -f 2 | grep '""'

のようにして探しておくとよい(なお、上はa列にはややこしい文字列が入ってこないことを仮定している)。

CSVの列の中に,が含まれている項目がある

これもgawkの -v FPAT pattarn が使える。

$ echo '"aaa","b,b,b","ccc"' | gawk -v FPAT='(\"[^\"]+\")' '{ print $2 "  " $3 }'
"b,b,b"  "ccc"

上のパターンと組み合わさった条件のCSVの場合、上のパターンと組み合わせればよい。

$ echo 'aaa,"b,b,b",ccc' | gawk -v FPAT='([^,]+)|(\"[^\"]+\")' '{print "\"" $1 "\"," "\""$2"\"," "\""$3"\""}' | sed -e 's/""/"/g'
"aaa","b,b,b","ccc"

もっとややこしいパターンが入っていてうまく入れ替えられない場合

取り出したい列の近くの特徴的な列を利用して、tr + grep とかsedで頑張る。

aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii

の場合、"State" の部分が何種類かの固定値であることがわかっているのであれば、

$ echo 'aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii' | tr ',' '\n' | grep -n1 "State" | sed -n '3,3 p' | cut -d '-' -f 2
hhh

となる。

tr ',' "\n" で雑にセパレータで行を分けてしまう
行単位でわかりやすい列をgrepで前後も出力するようにして検索する
わかりやすい列から目当ての列が何個かを踏まえて、sed -n '開始行,終了行 p' で抜き出す
grep -n数字 の影響で取り出した列は 数字- から始まっているので cut を使って最初から2番目の値を目当ての値として取り出す

$ echo 'aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii' | tr ',' '\n'
aaa
bbb

"cccc
ddddd"
"e
ff
g"
"State"
hhh
ii

$ echo 'aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii' | tr ',' '\n' | grep -n1 "State" 
8-g"
9:"State"
10-hhh

$ echo 'aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii' | tr ',' '\n' | grep -n1 "State" |  sed -n '3,3 p'
10-hhh

$ echo 'aaa,bbb,,"cccc,ddddd","e,ff,g","State",hhh,ii' | tr ',' '\n' | grep -n1 "State" |  sed -n '3,3 p' | cut -d '-' -f 2
hhh

もしある列の値を見て別の値を割り振りたい

is_current_year="false"
target_date="2017/01/02"
echo $target_date | grep '2018' 1>/dev/null && is_current_year="true"
echo $is_current_year
# false

is_current_year="false"
target_date="2018/02/03"
echo $target_date | grep '2018' 1>/dev/null && is_current_year="true"
echo $is_current_year
# true

grepである列の値が特定のパターンに合致するか調べて、合致する場合はその列の値を表す変数を上書きする、みたいな感じ。

別のCSVのデータと結合したいが、別のCSVからデータを探してくる時間を少しでも短くしたい

世の中には別々のデータストアに入っているデータがそれぞれCSVでしか出力できないため、CSV同士でデータを結合しなければいけないという時がある(ないほうがよい)。
で、片方のCSVからデータを取り出すとき、必要な行を少しでも早く取り出したいとき、grepに -m オプションをつけると必要な数だけ合致する行を見つけたらその場でreturnしてくれる。

# 社員データ.csv から誰でもいいので営業部の社員を1人だけ出力したい
grep -m 1 営業部 社員データ.csv

その他留意事項

改行の扱いなどが環境によって異なる場合があるのでバッチ処理を行わせるサーバでコマンドの結果を改めて確認すること
3つ以上特殊な値があったら1つずつ値を取り出して、sedで置換していくほうが正確性はたかそう
gawkは入っていない場合があるので、インストールすること

参考

現場からは以上です。

2018-09-15

Capistranoでぺらいち未満のWebサイトをデプロイしてみます

Capistrano

今日はRubyで書かれたビルド・デプロイ作業の自動化フレームワークのCapistranoにさわってみます。

Capistranoでできること

最初に、入門 Capistrano 3 ~ 全ての手作業を生まれる前に消し去りたいを読めばいいという話ではあるんですが、Capistranoでできることを手短にまとめておきます。

ビルド作業で共通で使う設定を定義する
- set :xxx, value_of_xxx と書いておけば、setした後続のtaskの中で fetch :xxx と書いてその値を参照できる
ビルドの際に行う作業をtaskとして定義する
- dockerイメージのビルドなど、他のビルド作業でも利用できる作業をライブラリとして切り出せる。切り出されたライブラリの中にはgemなどの形で配布されているものもある
  - たとえば https://github.com/reproio/capistrano-dockerbuild や https://github.com/seuros/capistrano-sidekiq など
after_task => :before_task などの記法でタスク間の依存関係が指定できる
- after_task => :before_task で :after_task を行う前に :before_task を行う必要がある、依存関係の指定
- before task_a, task_b, after task_b, task_a で task_b の前に task_a を行う指定(task_bの実行にtask_aが必要といったタスク間に依存関係はない)

今日やること

nginxで ホスト名/ にアクセスしたらHTMLのページを返す単純なWebサイトのデプロイ
デプロイ先はEC2
(たぶん必要ないかもですが)勉強用にデプロイ前後でnginxの再起動を行う

という内容でCapistranoの設定ファイルを書いて動かしてみます。

設定ファイルを動かしているCapistranoのバージョンは 3.11.0 で、書く内容の設定ファイルのイメージはざっくりと以下です。

# config/deploy.rb
# ローカルホストで行う設定を書く

# HTMLのソースコードを取得
# ソースコードをアップロードのためにzipに固める

# config/deploy/production.rb
# デプロイ先のサーバでおこなう設定を書く

# EC2サーバの設定
# htmlファイルのzipのアップロード & 解凍
# nginxの停止・起動

TL;DR

下にも書いたんですが、今日やったことのまとめです。

GitHubのリポジトリからソースコードを取ってくる
- ブランチの指定はまだ
config/deploy/#{environment}.rb に server を書いてデプロイ先のサーバを指定
- ssh_options でsshログインするときの秘密鍵etcを指定
deploy_to でデプロイ先のサーバ上のどこのディレクトリにアップロードするかを指定
- 上のディレクトリへのデプロイは task :deploy 開始時に当該ディレクトリで git clone する形で行われる
on roles(:role) でリモートのサーバで実施するコマンド, run_locally でローカルサーバで実行するタスクを書く
- execute "command" でタスクの中で実行するコマンドを書く
- upload! では、 deploy_to のディレクトリ内にある git リポジトリ直下から他のディレクトリへコピーするファイルを指定する
  - cap install したそのままの状態で git pluginを使っている場合、 upload! は実態がリモートホスト上のコピー*1なので、upload! は on roles(:role) に書く
task :a => :b で :a を実行するときはその前に :b を実行するという指定ができる

実際手を動かしたメモ

Capistranoのインストールと設定ファイルの生成

$ gem install capistrano
$ cap --version
Capistrano Version: 3.11.0 (Rake Version: 12.3.1)

$ cap install
mkdir -p config/deploy
create config/deploy.rb
create config/deploy/staging.rb
create config/deploy/production.rb
mkdir -p lib/capistrano/tasks
create Capfile
Capified

$ tree .
.
├── Capfile
├── config
│   ├── deploy
│   │   ├── production.rb
│   │   └── staging.rb
│   └── deploy.rb
├── html
│   └── index.html# デプロイ対象のHTML
└── lib
    └── capistrano
        └── tasks

HTMLのソースコードを取得

デプロイ対象のHTML含めて先ほど表示したファイルをGitHubのプライベートリポジトリで管理することにして、デプロイ時はプライベートリポジトリからソースコードを取ってくることにします。

# config/deploy.rb
# config valid for current version and patch releases of Capistrano
lock "~> 3.11.0"

set :application, "capistrano_experiment"
set :repo_url, "git@github.com:woshidan/capistranotest.git"

task :download do
  run_locally do # run_locally do; ... end でローカルマシン(今回は自分のPC)上で実行するコマンドを書く. run_locally ブロックと呼ぶ
    info "downloading master branch source from GitHub." # info, warn... などでログレベルに応じたログを出力できる
    execute "git checkout master && git pull origin master" # タスク内でコマンドを実行するときは、execute "コマンド" で書くkaku
  end
end

話を単純にするため、作業用のディレクトリではすでにgitのリポジトリをcloneしてきていて、masterブランチのみをdeploy対象とするものとします*2。

この時点ではタスクを定義しているだけで、deployの際にdownloadタスクを実行すると設定していないので cap production deploy しても何も起こりませんし、 downloading master branch source from GitHub. のログも log/capistrano.log に出力されません。

ソースコードをアップロードのためにzipに固める

サーバにアップロードする必要のあるファイル、今回は量はそこまでするほど量はないですが、index.htmlをzipに固めるタスクです。

# config/deploy.rb

task :archive => :download do
  run_locally do
    info "archive html directory to zip"
    execute "zip -r html.zip html"
  end
end

:archive => :download の部分は :archive タスクは :download タスクをやった後にしてください、ということですね。

EC2のインスタンスを立ち上げて、nginxを入れる

nginxで静的ファイルを表示するサイトを作る予定と書いたので、適当なEC2インスタンスを立ち上げて以下のコマンドでnginxを入れておく。セキュリティグループは雑にマイIP*3からのみSSH, HTTP, HTTPSアクセスを許可。

# 参考: http://d.hatena.ne.jp/january/20130819
$ sudo yum install nginx
$ which nginx # インストール確認
/usr/sbin/nginx
$ sudo nginx # nginx起動
$ ps -ef | grep nginx # masterプロセスとworkerプロセスの起動を確認
root      2759     1  0 09:27 ?        00:00:00 nginx: master process nginx
nginx     2760  2759  0 09:27 ?        00:00:00 nginx: worker process
ec2-user  2762  2686  0 09:27 pts/0    00:00:00 grep --color=auto nginx

ホストのアドレスにアクセスすると

f:id:woshidan:20180913184629p:plain

のようなページが出てnginxが起動していることが確認できる。

/etc/nginx/nginx.conf をみると、

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    # Load modular configuration files from the /etc/nginx/conf.d directory.
    # See http://nginx.org/en/docs/ngx_core_module.html#include
    # for more information.
    include /etc/nginx/conf.d/*.conf;

    index   index.html index.htm;

    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  localhost;
        root         /usr/share/nginx/html;

とあり、

$ less /usr/share/nginx/html/index.html | grep Welcome
        <h1>Welcome to <strong>nginx</strong> on the Amazon Linux AMI!</h1>

と確かめてみると、と表示されたページのソースコードの所在が確認できたので、今日のところは server ディレクティブの root を /var/www に変更し、 /var/www 以下に index.html をアップロードすることを目標にする。

# 変更後のconfファイル
    index   index.html index.htm;

    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  localhost;
        root         /var/www; # ここを変えた
        # root         /usr/share/nginx/html;

        # Load configuration files for the default server block.
        include /etc/nginx/default.d/*.conf;

        location / {
        }

        # redirect server error pages to the static page /40x.html
        #
        # error_page 404 /404.html; # 40xなどを/var/wwwに用意するのが面倒なので一旦コメントアウト
        #     location = /40x.html {
        # }

        # redirect server error pages to the static page /50x.html
        #
        # error_page 500 502 503 504 /50x.html;
        #     location = /50x.html {
        # }

また、デプロイが完了するまで仮に表示するページを用意しておき

[ec2-user@ip-172-31-27-79 ~]$ sudo mkdir /var/www
[ec2-user@ip-172-31-27-79 ~]$ sudo touch /var/www/index.html
[ec2-user@ip-172-31-27-79 ~]$ sudo vim /var/www/index.html

再起動して設定が変更されたことを確認しておく。

[ec2-user@ip-172-31-27-79 ~]$ sudo service nginx stop
Stopping nginx:                                            [  OK  ]
[ec2-user@ip-172-31-27-79 ~]$ sudo service nginx start
Starting nginx:                                            [  OK  ]

f:id:woshidan:20180913184810p:plain

EC2サーバの設定を書く

さて、デプロイ先のサーバができたところでCapistranoの設定ファイルにデプロイ先のサーバの設定を書いていく。

# config/deploy/production.rb

# server ホスト名(アクセスできるものだったらよいのでIPアドレス)
# user ログインユーザ
# roles あとでリモートホストで行うタスクを設定するとき on roles(:web) のように使う、デプロイ作業用グループ
server "5x.xxx.xxx.x0", user: "ec2-user", roles: %w{web}

そういえば、 config/deploy/production.rb のようなデプロイ先のステージの設定ファイルを用意すると

$ cap production download
00:00 download
      downloading master branch source from GitHub.
      01 git checkout master && git pull origin master
      01 Already on 'master'
      01 From https://github.com/woshidan/capistranotest
      01  * branch            master     -> FETCH_HEAD
      01 Already up-to-date.
    ✔ 01 woshidan@localhost 1.083s

# task :archive => :download do
$ cap production archive
00:00 download
      downloading master branch source from GitHub.
      01 git checkout master && git pull origin master
      01 Already on 'master'
      01 From https://github.com/woshidan/capistranotest
      01  * branch            master     -> FETCH_HEAD
      01 Already up-to-date.
    ✔ 01 woshidan@localhost 1.097s
00:01 archive
      archive html directory to zip
      01 zip -r html.zip html
      01 updating: html/ (stored 0%)
      01 updating: html/index.html (deflated 20%)
    ✔ 01 woshidan@localhost 0.042s

のように、設定したタスクが実行できるようになる。

htmlファイルのzipのアップロード & 解凍

今回のWebサイトについて、具体的なデプロイ作業とは、nginxで表示するhtmlファイルのzipのアップロードと、アップロードしたファイルの解凍・配置となる。その作業を行うタスクを書いていく。

Capistranoでファイルをアップロードする際は upload! ローカルファイルへのパス, アップロード先サーバ上のパス を利用するが、その設定は

# config/deploy.rb

task :deploy => :archive do
  on roles(:web) do
    upload! "./html.zip", "/home/ec2-user/deploy_target_dir" # リモートホスト上の /home/ec2-user/deploy_target_dir へ ローカルホストの ./html.zip をアップロード
  end
end

でありますが、そのままだと

00:01 git:wrapper
      01 mkdir -p /tmp
(Backtrace restricted to imported tasks)
cap aborted!
SSHKit::Runner::ExecuteError: Exception while executing as ec2-user@5x.xxx.xxx.x0: Authentication failed for user ec2-user@5x.xxx.xxx.x0


Caused by:
Net::SSH::AuthenticationFailed: Authentication failed for user ec2-user@5x.xxx.xxx.x0

Tasks: TOP => deploy:check => git:check => git:wrapper
(See full trace by running task with --trace)
The deploy has failed with an error: Exception while executing as ec2-user@5x.xxx.xxx.x0: Authentication failed for user ec2-user@5x.xxx.xxx.x0


** DEPLOY FAILED
** Refer to log/capistrano.log for details. Here are the last 20 lines:

のようにエラーになってしまうので、sshのオプションを config/deploy/production.rb (デプロイ先の設定を書くファイル) に追加します。

# config/deploy/production.rb

server "5x.xxx.xxx.x0", user: "ec2-user", roles: %w{web}
set :ssh_options, keys: %{./capistrano_test.pem}, auth_methods: %w{publickey}

すると、今度は

 DEBUG [0691849d] Command: ( export GIT_ASKPASS="/bin/echo" GIT_SSH="/tmp/git-ssh-capistrano_experiment-production-woshidan.sh" ; /usr/bin/env git ls-remote git@github.com:woshidan/capistranotest.git HEAD )

 DEBUG [0691849d]   /usr/bin/env: 

 DEBUG [0691849d]   git

 DEBUG [0691849d]   : No such file or directory

のようなメッセージで失敗しますが、これはデプロイ先のサーバにgitがない、という意味なのでデプロイ先のサーバに

$ sudo yum install git

のようにgitをインストールします*4。

実は、Capfileを cap install から何も変更せずに進めた場合、

task :deploy => :archive do
  on roles(:web) do
    upload! "./html.zip", "/var/www" # リモートホスト上の /var/www へ ローカルホストの ./html.zip をアップロード
  end
end

の deploy のタスクを開始した時点で :deploy_to に指定したディレクトリ、または /var/www/#{application} のディレクトリで git clone を行なっているようです。

task :deploy => :archive do

end

のように :deploy タスクの中身を空にしても

00:00 archive
      archive html directory to zip
      01 zip -r html.zip html
      01 updating: html/ (stored 0%)
      01 updating: html/index.html (deflated 20%)
    ✔ 01 woshidan@localhost 0.042s
    ✔ 02 ec2-user@5x.xxx.xxx.x0 0.421s
00:01 git:wrapper
      01 mkdir -p /tmp
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.092s
      Uploading /tmp/git-ssh-capistrano_experiment-production-woshidan.sh 100.0%
      02 chmod 700 /tmp/git-ssh-capistrano_experiment-production-woshidan.sh
    ✔ 02 ec2-user@5x.xxx.xxx.x0 0.092s
00:01 git:check
      01 git ls-remote git@github.com:woshidan/capistranotest.git HEAD
      01 b132c31d1fc37f08848d8b860bf57f20ad4ef635   HEAD
    ✔ 01 ec2-user@5x.xxx.xxx.x0 1.774s
00:03 deploy:check:directories
      01 mkdir -p /home/ec2-user/upload_prepare_dir/shared /home/ec2-user/upload_prepare_dir/releases
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.132s
00:04 git:clone
      The repository mirror is at /home/ec2-user/upload_prepare_dir/repo
00:04 git:update
      01 git remote set-url origin git@github.com:woshidan/capistranotest.git
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.093s
      02 git remote update --prune
      02 Fetching origin
      02 From github.com:woshidan/capistranotest
      02    f4e5e6b..b132c31  master     -> master
    ✔ 02 ec2-user@5x.xxx.xxx.x0 2.188s
00:06 git:create_release
      01 mkdir -p /home/ec2-user/upload_prepare_dir/releases/20180913134338
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.100s
      02 git archive master | /usr/bin/env tar -x -f - -C /home/ec2-user/upload_prepare_dir/releases/20180913134338
    ✔ 02 ec2-user@5x.xxx.xxx.x0 0.096s
00:07 deploy:set_current_revision
      01 echo "b132c31d1fc37f08848d8b860bf57f20ad4ef635" > REVISION
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.092s
00:07 deploy:symlink:release
      01 ln -s /home/ec2-user/upload_prepare_dir/releases/20180913134338 /home/ec2-user/upload_prepare_dir/releases/current
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.092s
      02 mv /home/ec2-user/upload_prepare_dir/releases/current /home/ec2-user/upload_prepare_dir
    ✔ 02 ec2-user@5x.xxx.xxx.x0 0.090s
00:07 deploy:cleanup
      Keeping 5 of 6 deployed releases on 5x.xxx.xxx.x0
      01 rm -rf /home/ec2-user/upload_prepare_dir/releases/20180913112304
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.118s
00:07 deploy:log_revision
      01 echo "Branch master (at b132c31d1fc37f08848d8b860bf57f20ad4ef635) deployed as release 20180913134338 by woshidan" >> /home/ec2-user/upload_prepare_dir/r…
    ✔ 01 ec2-user@5x.xxx.xxx.x0 0.098s

git clone が行われています。

# config/deploy.rb
set :deploy_to, "/home/ec2-user/upload_prepare_dir"

task :deploy => :archive do
  on roles(:web) do
    upload! "./html.zip", "/home/ec2-user/deploy_target_dir" # リモートホスト上の /home/ec2-user/deploy_target_dir へ ローカルホストの ./html.zip をアップロード
  end
end

とした場合、 /home/ec2-user/upload_prepare_dir にてgit cloneやgit cloneしてきたコードのバージョン管理用のディレクトリを作成して、 upload! では、そのディレクトリから2つめに指定したディレクトリへファイルをコピーしているようです*5。

[ec2-user@ip-172-31-27-79 ~]$ pwd
/home/ec2-user
$ tree .
.
├── deploy_target_dir
│   └── html.zip
├── html
│   └── index.html
└── upload_prepare_dir
    ├── current -> /home/ec2-user/upload_prepare_dir/releases/20180913134913
    ├── releases
    │   └── 20180913134913
    │       ├── Capfile
    │       ├── config
    │       │   ├── deploy
    │       │   │   ├── production.rb
    │       │   │   └── staging.rb
    │       │   └── deploy.rb
    │       ├── html
    │       │   └── index.html
    │       └── REVISION
    ├── repo
    │   ├── branches
    │   ├── config
    │   ├── description
    │   ├── FETCH_HEAD
    │   ├── HEAD
    │   ├── hooks
    │   │   ├── applypatch-msg.sample
    │   │   ├── commit-msg.sample
    │   │   ├── post-update.sample
    │   │   ├── pre-applypatch.sample
    │   │   ├── pre-commit.sample
    │   │   ├── prepare-commit-msg.sample
    │   │   ├── pre-push.sample
    │   │   ├── pre-rebase.sample
    │   │   ├── pre-receive.sample
    │   │   └── update.sample
    │   ├── info
    │   │   └── exclude
    │   ├── objects
    │   │   ├── info
    │   │   └── pack
    │   │       ├── pack-455700f6724c559c3d0264e92c2888bf6b191610.idx
    │   │       └── pack-455700f6724c559c3d0264e92c2888bf6b191610.pack
    │   ├── packed-refs
    │   └── refs
    │       ├── heads
    │       └── tags
    ├── revisions.log
    └── shared

20 directories, 27 files

上は実際にデプロイ先のサーバで tree . してみたところ。

アップロードした zip ファイルを解凍して /var/www/index.html に配置するところまで追加*6。

# config/deploy.rb

task :deploy => :archive do
  on roles(:web) do
    execute "mkdir -p /home/ec2-user/deploy_target_dir"
    upload! "./html.zip", "/home/ec2-user/deploy_target_dir" # リモートホスト上の /home/ec2-user/deploy_target_dir へ ローカルホストの ./html.zip をアップロード
    execute "cd /home/ec2-user/deploy_target_dir && unzip -o /home/ec2-user/deploy_target_dir/html.zip"
    execute "sudo cp /home/ec2-user/deploy_target_dir/html/index.html /var/www/index.html"
  end
end

nginxの停止・再起動

今回のデプロイには変更ないのだけどデプロイ時にアプリケーションを再起動する、というのはよくあることだから、素振りとしてnginxを再起動しておく。

# config/deploy.rb

task :deploy => :archive do
  on roles(:web) do
    execute "mkdir -p /home/ec2-user/deploy_target_dir"
    upload! "./html.zip", "/home/ec2-user/deploy_target_dir" # リモートホスト上の /home/ec2-user/deploy_target_dir へ ローカルホストの ./html.zip をアップロード
    execute "cd /home/ec2-user/deploy_target_dir && unzip -o /home/ec2-user/deploy_target_dir/html.zip"
    execute "sudo cp /home/ec2-user/deploy_target_dir/html/index.html /var/www/index.html"

    # 以下を追加
    execute "sudo service nginx stop"
    execute "sudo service nginx start"
  end
end

デプロイした結果を確認できたのでよさそう。

今日はcapistranoを使って

GitHubのリポジトリからソースコードを取ってくる
- ブランチの指定はまだ
config/deploy/#{environment}.rb に server を書いてデプロイ先のサーバを指定
- ssh_options でsshログインするときの秘密鍵etcを指定
deploy_to でデプロイ先のサーバ上のどこのディレクトリにアップロードするかを指定
- 上のディレクトリへのデプロイは task :deploy 開始時に当該ディレクトリで git clone する形で行われる
on roles(:role) でリモートのサーバで実施するコマンド, run_locally でローカルサーバで実行するタスクを書く
- execute "command" でタスクの中で実行するコマンドを書く
- upload! では、 deploy_to のディレクトリ内にある git リポジトリ直下から他のディレクトリへコピーするファイルを指定する
  - cap install したそのままの状態で git pluginを使っている場合、 upload! は実態がリモートホスト上のコピー*7なので、upload! は on roles(:role) に書く
task :a => :b で :a を実行するときはその前に :b を実行するという指定ができる

あたりの復習をして、さぼったなぁと思うことは

git pushしていれば、いまいるブランチをデプロイできるようにする
deploy ユーザーを用意して直接デプロイ対象のディレクトリへファイルをアップロードする
静的ファイルのアップロードではなくもう少し動作するアプリをdeployする
ビルドサーバを用意して、ビルドサーバからだけアップロードできるようにする
プラグインの詳細

ですが、また今度でいいかなと思います。

とりあえず、現場からは以上です。

*1:デプロイプロセス全体としてはアップロードに見える?

*2:ビルドサーバとかで作業を行うならこういうわけにもいきませんが、今日はcapistranoの勉強ということにして、その辺はまた別途やろうと思います

*3:AWSはセキュリティグループのIPアドレスの設定でマイIPを設定するとAWSに接続しているISPが割り振る範囲のIPを入力してくれるみたい

*4:参考: https://qiita.com/himatani/items/87d54752021879e1ec89

*5:Capfileを何もいじっていなかったので https://github.com/capistrano/capistrano/blob/220db8fabab15b9d5cd5c9ab1f2744e0aa346eb0/lib/capistrano/scm/tasks/git.rake#L1-L2 や Capfile 中の install_plugin Capistrano::SCM::Git あたりが原因と思われます...

*6:実際はnginxで公開するディレクトリ /var/www/ にアップロドして、 /var/www/current/index.html あたりをnginxで公開するパスとしたほうがよさそう。そのためのdeployユーザの設定などがあるが今日は時間がないのでこういう感じで

*7:デプロイプロセス全体としてはアップロードに見える?

2018-09-09

Terraformで検証用インスタンスを立ち上げるのに使う最低限の作業についてメモ

Terraform AWS

検証用の環境作るとき、微妙に設定をいじって立て直す、その後複数台で動かしたい、みたいな場合はTerraformでやったりします*1。

そのとき、共通でやる作業についてメモしておきます。

共通

変数設定など

アプリごとにパスフレーズなしの認証鍵を使い捨てで作る

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/woshidan/.ssh/id_rsa): ./app_secret
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in ./app_secret.
Your public key has been saved in ./app_secret.pub.

access_key = "ACCCCCCCCCCCCCCCCCC"
secret_key = "/VXXXXXXKXXXXXXXXXXXXXXXXXXXXXXXXXX"
region = "ap-northeast-1"
ssh_key_path = "./app_secret.pub"

variable access_key {}
variable secret_key {}
variable region {}

# ssh-keygen -t rsa -f secret_key
variable ssh_key_path {}

# terraform plan --var-file=tf.vars
provider "aws" {
    access_key = "${var.access_key}"
    secret_key = "${var.secret_key}"
    region = "${var.region}"
}

resource "aws_key_pair" "app_secret" {
  key_name   = "app_secret"
  public_key = "${file("./app_secret.pub")}"
}

ざついセキュリティグループ

resource "aws_security_group" "sg-app-server" {
  name        = "app-server-sg"
  // 特に指定しなければデフォルトのVPN

  # SSH access from anywhere
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # HTTP access from anywhere
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # HTTPS access from anywhere
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "-1"
    cidr_blocks     = ["0.0.0.0/0"]
  }
}

AMI作る時のちょっぱや用

refs: https://www.terraform.io/docs/providers/aws/r/instance.html

resource "aws_instance" "web" {
  // aws ec2 describe-images --owners amazon --filters 'Name=name,Valailable' | jq -r '.Images | sort_by(.CreationDate) | last(.[]).ImageId'
  ami           = "ami-08847abae18baa040"
  instance_type = "t2.micro"
  security_groups = ["${aws_security_group.sg-app-server.name}"]
  key_name = "${aws_key_pair.app_secret.key_name}"

  tags {
    Name = "HelloWorld"
  }
}

ssh -i app_secret ec2-user@ec2-12-345-67-89.ap-northeast-1.compute.amazonaws.com

AMI作成によく使うコマンド集

なんだか割とよくJava 1.8を入れている気がする。

yum install java-1.8.0-openjdk

refs: https://docs.aws.amazon.com/ja_jp/cli/latest/userguide/awscli-install-linux-python.html

# 古いバージョンを探す必要がある場合 http://www.atmarkit.co.jp/flinux/rensai/linuxtips/901instoldver.html
yum search python

yum install python

# ami-08847abae18baa040 の場合
$ python --version
Python 2.7.14

# AWSのLinux系インスタンスのEC2にはpythonは2系と3系が入っていてデフォルトでは2系の方が有効になっている
# 切り替えるときは https://aws.amazon.com/jp/premiumsupport/knowledge-center/python-boto3-virtualenv/ などが参考になる
$ python3 --version
Python 3.6.2

もっと色々覚えた気がするけどとりあえず先週分...、現場からは以上です。

*1:計画性がないと言われたらそれまでですが検証がひと段落して終わった後追加で試したいときにも便利...

2018-08-27

ecs_deployのecs_auto_scalerにPRを出すときに環境構築をがんばったよ、というお話

AWS ECS AutoScalingGroup ecs_deploy ecs_auto_scaler

先日、 ecs_auto_scaler にPRを出してマージされてしばらく運用して問題なさそうので今日はちょっと自慢をします。

ecs_auto_scaler そのものについては前回の記事でふれたのでそちらをご参照ください。

woshidan.hatenablog.com

複数のサービスが連携させて動かすためのgemですが、こういったgemの場合、動作確認のための環境構築がなかなか大変です。

ecs_autoscaler の場合、ロールやセキュリティグループを除いて最低限でも

ecs_autoscalerでコンテナインスタンス数、タスク数を管理する対象はクラスタ単位ですが、管理される側のクラスタA
- クラスタAに登録するコンテナインスタンスを起動するAutoScaling Group
  - AutoScaling Groupグループで利用する起動テンプレート
- ecs_autoscalerはサービス単位で必要なタスクの有無を確認している*1ので、起動していることが必要なサービスとそうでないサービス
ecs_autoscalerを動かすホスト
upscale triggerとdownscale triggerのためのCloudWatchのアラート2種

あたりが必要です。毎回1から用意するのがめんどくさいので、先日仕事で動作確認をした時、ちょっと時間はかかりましたがTerraformで立てれるようにしました。

ので、今日はそれを自慢して終わりにします。どやぁ。

gist.github.com

なお、セキュリティグループは検証用にインバウンドはHTTP, HTTPS, SSHのIPを全て空け、アウトバウンドは全部空け、と空きまくりなので検証が終わったらすぐに落としてください。現場からは以上です。

ecs_deployによるECSのオートスケーリングとAWSのECSによるオートスケーリングの違いについて

AWS ECS AutoScalingGroup OSS ecs_deploy

今日は会社で管理しているOSSの一つ、ecs_deployに関連する話をECSの復習がてらします。

TL;DR

ecs_deploy gemでは、ECSへのデプロイに関するCapistranoタスクの定義とECSのオートスケーリングを行うスクリプトが入っている
ECSのオートスケーリングはAutoScaling Groupのdesiredの設定でクラスタ内のコンテナインスタンスの数を、ECSのサービスのdesired countの設定でタスクの数を調整する
- AWSによるECSのオートスケーリングでは2つのレイヤーの調整は独立して動いているので、AutoScalingグループがまだタスクが動いているインスタンスを停止してエラーになることがある
- ecs_auto_scaler のオートスケーリングの場合は、スケールイン時はECSインスタンス上のタスクの状態をしらべ、必要なタスクがうごていないことを確認してからAutoScaling Groupの設定の調整を行うようになっている

reproio/ecs_deploy のgemについて

ecs_deploy の gem は、各言語で作成されているECSへのデプロイを助けるスクリプト群*1の一つで、特徴としてはECSのデプロイをCapistranoのタスクとして記述させるアプローチを取っていることだと思います。

ecs_deploy のgemの中身はおおまかにいって

ECSへのデプロイに関するCapistranoタスクの定義
上記Capistranoのタスクの中で利用されるRubyのコード
- EcsDeploy ... ECSにサービスやタスクを登録するためにECSのAPIを叩くスクリプト
- EcsDeploy::EcsAutoScaler ... ECS上のあるタスク、クラスタに対し、タスク数、クラスタ内のコンテナインスタンス数を自動的に調整する(AutoScaling)ためのスクリプト
  - ecs_auto_scaler <config yaml> で起動

があって、今回自分が仕事で触ったのは ecs_auto_scalerr の方なので、 ecs_auto_scaler についてもう少し説明していきます。

また、ややこしいので、この記事ではAWSが提供しているECSのService AutoScalingを「AWSによるECSのオートスケーリング」、ecs_deployに含まれるecs_auto_saclerによるオートスケーリングは「ecs_auto_scaler のオートスケーリング」と記載することにします。

ECSのオートスケーリングのために、ECS ServiceとAutoScaling Groupの設定をいじる必要がある

ecs_auto_scaler はひらたくいうと、ECSのAutoScalingを行うスクリプトです。

このスクリプトが書かれたのは2016年1月には、まだAWSによるECSのオートスケーリングがありませんでした*2が、この二種類のオートスケーリングが行なっていることをおおまかにまとめると

CloudWatchのアラートを受け取って*3
- ECSのサービスのdesired countを増減させる
- AutoScaling Groupのdisiredを増減させる*4

ということをやっています。

Amazon Web Services ブログ > Amazon ECSでAuto Scalingによると、AWSによるECSのオートスケーリングの場合、

ECSのサービスのタスク数の調整には、ECS ServiceのScaling Policy
ECSのクラスターのコンテナ数の調整には、コンテナインスタンスが属するAutoScaling GroupのScaling Policy

を用いています。一方、ecs_auto_scaler のオートスケーリングは

タスク数、コンテナ数の増減のために直接ECSのタスクやAutoScaling Groupのコンテナインスタンスを止めたり、desired, desired countを変更するAPIを直接叩いたり

しています。

何が言いたいかというと、2つのオートスケーリングで利用している設定やAPIに多少違いはありますが、ECSのオートスケーリングは、ECSのサービスとクラスターのAutoScaling Groupの２つのレイヤーの設定を管理して行う必要があるわけです。

AWSによるECSのオートスケーリングとecs_auto_scaler のオートスケーリングの違い

それでは、この2つのオートスケーリングの方法の違いで一体どういう事態が生じるのでしょうか。

じつは、AWSによるECSのオートスケーリングでは、ECS ServiceのScaling PolicyとAutoScaling GroupのScaling Policyがそれぞれ独立して動いていて、AutoScaling GroupのScaling Policyによりまだタスクが動作しているコンテナインスタンスが停止となりエラーが発生することがあります。

この問題に対応するため、 ecs_auto_scaler ではスケールイン時はECSインスタンス上のタスクの状態をしらべ必要なタスクが動いていないことを確認してから、AutoScaling Groupの設定の調整を行うようになっています*5。

AWSによるECSのオートスケーリングを使っている場合でもこの問題の対応は可能ですが*6、ecs_auto_scalerを利用するメリットとしては、ecs_deployの EcsDeploy::EcsAutoScaler を利用する場合、ECSインスタンス上のプロセスのチェックを含めたオートスケーリングの処理を管理対象クラスタの外部のホストで行う*7ため、監視される側のインスタンスには特別な設定をしなくていい点でしょうか。

そのかわり、オートスケーリングのためにネットワークを経由してAWSのAPIを叩くので、AWSのAPIの回数制限*8を超えるような規模のクラスタ、たとえば200台くらいのコンテナインスタンスが存在するような大規模なクラスタの管理は難しそうです*9。

書ける、と思ってたら割とかけなくて焦りました。。現場からは以上です。

*1:たとえば、pythonだと https://github.com/fabfuel/ecs-deploy , シェルスクリプト: https://github.com/silinternational/ecs-deploy/blob/develop/ecs-deploy, JSだと https://www.npmjs.com/package/ecs-deploy など

*2:https://aws.amazon.com/jp/blogs/news/automatic-scaling-with-amazon-ecs/ AWSによるECSのオートスケーリングがアナウンスされたのは2016年5月

*3:正確には少し違っていて、後発のターゲット追跡スケーリングポリシーはCloudWatchのアラートではなく、CloudWatchの特定のメトリクスを見て、その値が一定値に近づくようにする https://docs.aws.amazon.com/ja_jp/AmazonECS/latest/developerguide/service-autoscaling-targettracking.html

*4:MinやMaxなども増減させていますが、詳しくは https://dev.classmethod.jp/cloud/aws/comprehend-auto-scaling-desired-capacity/

*5:詳しくは https://github.com/reproio/ecs_deploy/blob/master/lib/ecs_deploy/auto_scaler.rb#L350-L356 あたり

*6:https://developers.cyberagent.co.jp/blog/archives/14664/

*7:ことが前提になっている、おそらく。

*8:ECSのAPIは1時間に1000回くらい叩くとエラーを返してくるようになるとかなんとか...

*9:なんとなく、200台超えてきたらインスタンスタイプ変えることの方を先に検討しそうな気もするけどAWS詳しくない...

Hadoopをインストール

Hadoopのインストールの事前準備

Hadoopのインストール

動作確認をしてみる

Local Modeで動かしてみる

TL;DR

rbenvのインストール

gitをインストールする

rbenvをホームディレクトリにDLし、rbenvの実行ファイルへのパスを通す

rbenvのプラグインである ruby-build をインストールする

Ruby 2.5.1 のインストール

1回目のRubyのインストールをやってみるとCのコンパイラがないといって失敗する

Cのコンパイラとしてgccを入れる

2回目のRubyのインストールをやってみるとopenssl, readline, zlibのextensionがないといって失敗する

openssl-devel readline-devel zlib-devel のインストール

3回目の正直でRubyのインストールに成功するので、インストールしたバージョンを利用するようにする

mysql2のgemをインストールする

なにもせず mysql2 のgemを入れようとするとmysqlがインストールされていないので怒られる

mysql-develをインストールする

あらためてmysql2のgemをインストール

列の順番を入れ替えたい

CSVにダブルクオーテーションがついている項目と付いていない項目が混じっている

CSVの列の中に,が含まれている項目がある

もっとややこしいパターンが入っていてうまく入れ替えられない場合

もしある列の値を見て別の値を割り振りたい

別のCSVのデータと結合したいが、別のCSVからデータを探してくる時間を少しでも短くしたい

その他留意事項

参考

Capistranoでできること

今日やること

TL;DR

実際手を動かしたメモ

Capistranoのインストールと設定ファイルの生成

HTMLのソースコードを取得

ソースコードをアップロードのためにzipに固める

EC2のインスタンスを立ち上げて、nginxを入れる

EC2サーバの設定を書く

htmlファイルのzipのアップロード & 解凍

nginxの停止・再起動

共通

変数設定など

ざついセキュリティグループ

AMI作る時のちょっぱや用

AMI作成によく使うコマンド集

関連エントリ

TL;DR

reproio/ecs_deploy のgemについて

ECSのオートスケーリングのために、ECS ServiceとAutoScaling Groupの設定をいじる必要がある

AWSによるECSのオートスケーリングとecs_auto_scaler のオートスケーリングの違い